CEPTORD.NET Fujimoto Seiji December 2018 Updated: 2021-01-11PolylevenTable of Contents 1. Introduction 2. How to Install 3. Usage 4. Benchmark 5. Internal Mechanism 6. License 7. Contact Information1. IntroductionPolyleven is a Levenshtein distance library for Python, whose focus is put on efficiency. The name "polyleven" comes from the fact that it combines several algorithms behind the scenes (poly is a Greek-origin word that means many). The basic idea is that we can gain the best of different algorithms by switching between them depending on the kinds of input strings.2. How to InstallPyPI Run the following command on your console for quick installation. $ pip install polyleven Source Distribution polyleven-0.7.zip (2021-02-23, 439kb) * Optimize the implementation of myers1999_block() polyleven-0.6.zip (2021-01-11, 429kb) * Fix the implementation of blockhash polyleven-0.5.zip (2020-02-14, 429kb) * Add Windows platform support. GIT Repository Visit cgit/polyleven. This is the main development repository. GitHub mirror is available at github.com/fujimotos/polyleven.3. Usage3.1. Basic UsagePolyleven provides a single interface function "levenshtein()". You can use this function to measure the similarity of two strings. >>> from polyleven import levenshtein >>> levenshtein('aaa', 'ccc') 33.2. Advanced UsageIf you only care about distances under a certain threshold, you can pass the max threshold to the third argument. >>> levenshtein('acc', 'ccc', 1) 1 >>> levenshtein('aaa', 'ccc', 1) 2 In general, you can gain a noticeable speed boost with threshold k ≤ 3.4. Benchmark4.1 English WordsThe following table answers the question "how efficient are these libraries at processing English words?". I created a million English word pairs, and measured how long each program takes to process all of them. Function Name | TIME[sec] | SPEED[pairs/s] ------------------------------- | ----------- | ---------------- edlib | 4.763 | 208216 editdistance | 1.943 | 510450 jellyfish.levenshtein_distance | 0.722 | 1374081 distance.levenshtein | 0.623 | 1591396 Levenshtein.distance | 0.500 | 1982764 polyleven.levenshtein | 0.431 | 2303420 You may notice that edlib and editdistance appear to be slower than other libraries. Why? Because both internally use Myers' algorithm for computing the Levenshtein distance. The idea behind Myers' algorithm is to use bit parallelism to speed up the computation. In order to perform the computation efficiently, we need to pre-compute a table of bit patterns for each character. For short inputs like English words (the average length is just 7-8 characters), the cost of pre-computation dwarfs the benefit of parallelism. * Measured using Python 3.5.3 on Debian Jessie with Intel Core i3-4010U (1.70GHz)4.2. Longer InputsThe following table answers the question "what happens if the input gets longer?" I created 5000 pairs of random strings of size 16, 32, 64, 128, 256, 512 and 1024, and measured how fast each library can process them. Library | N=16 | N=32 | N=64 | N=128 | N=256 | N=512 | N=1024 ----------- | ----- | ----- | ----- | ----- | ----- | ----- | ------ edlib | 0.040 | 0.063 | 0.094 | 0.205 | 0.432 | 0.908 | 2.089 editdistance| 0.027 | 0.049 | 0.086 | 0.178 | 0.336 | 0.740 | 58.139 jellyfish | 0.009 | 0.032 | 0.118 | 0.470 | 1.874 | 8.877 | 42.848 distance | 0.007 | 0.029 | 0.109 | 0.431 | 1.726 | 6.950 | 27.998 Levenshtein | 0.006 | 0.022 | 0.085 | 0.336 | 1.328 | 5.286 | 21.097 polyleven | 0.003 | 0.005 | 0.010 | 0.043 | 0.149 | 0.550 | 2.109 You may notice that edlib and editdistance are good at handling longer inputs. Other libraries have a steep cost curve as the string length increases. The good thing about polyleven is that it can process both cases (shorter and longer inputs) with reasonable efficiency. * Measured using Python 3.5.3 on Debian Jessie with Intel Core i3-4010U (1.70GHz)4.3. List of Librariesedlib v1.2.1 https://github.com/Martinsos/edlib editdistance v0.4 https://github.com/aflc/editdistance jellyfish v0.5.6 https://github.com/jamesturk/jellyfish distance v0.1.3 https://github.com/doukremt/distance Levenshtein v0.12 https://github.com/ztane/python-Levenshtein polyleven v0.3 http://ceptord.net/cgit/polyleven5. Internal Mechanism5.1 How polyleven chooses an algorithmAs of v0.5, polyleven uses the following heuristics to choose an algorithm. +===============+ Yes +---------------------+ || k = 0 ||---------->| PyUnicode_Compare() | +===============+ +---------------------+ | No V +===============+ Yes +-------------------+ || k ≤ 3 ||---------->| mbleven algorithm | +===============+ +-------------------+ | No V +===============+ Yes +------------------+ || len(s) ≤ 64 ||---------->| Myers' algorithm | +===============+ +------------------+ | No V +-------------------------------+ | Myers algorithm (with blocks) | +-------------------------------+ Before 0.4, polyleven used the Wagner-Fischer algorithm for shorter strings, but, alas, it turned out that reasonably optimized Myers' algorithm almost always performs better.6. LicenseI hereby place my work "polyleven" into the public domain. All my copyrights, including related and neighbouring rights, of this work are abandoned. 2018 Fujimoto Seiji <fujimoto@ceptord.net> Back to index page