Polyleven – Fast Pythonic Levenshtein Library¶
- Author:
Fujimoto Seiji
- Published:
2018-12-15
- Updated:
2021-02-05
- Copyright:
This document has been placed in the public domain.
1. Introduction¶
Polyleven is a Levenshtein distance library for Python, with a special attention to efficiency.
The basic idea behind this library is that we can gain the best of different algorithms by switching between them depending on the kinds of input strings.
2. How to Install¶
2.1. PyPI¶
$ pip install polyleven
2.2. Source Distribution¶
-
Optimize the implementation of
myers1999_block()
.Migrate the license to MIT License #2
-
Fix the implementation of blockhash
-
Add Windows platform support.
2.3. GIT Repository¶
$ git clone https://github.com/fujimotos/polyleven
3. Usage¶
3.1. Basic Usage¶
Polyleven provides a single interface function levenshtein()
. You
can use this function to measure the similarity of two strings.
>>> from polyleven import levenshtein
>>> levenshtein('aaa', 'ccc')
3
3.2. Advanced Usage¶
If you only care about distances under a certain threshold, you can pass the max threshold to the third argument.
>>> levenshtein('acc', 'ccc', 1)
1
>>> levenshtein('aaa', 'ccc', 1)
2
In general, you can gain a noticeable speed boost with threshold \(k < 3\).
4. Benchmark¶
4.1 English Words¶
To compare Polyleven with other Pythonic edit distance libraries, a million word pairs was generated from SCOWL.
Each library was measured how long it takes to evaluate all of these words. The following table summarises the result:
Function Name |
TIME[sec] |
SPEED[pairs/s] |
---|---|---|
edlib |
4.763 |
208216 |
editdistance |
1.943 |
510450 |
jellyfish.levenshtein_distance |
0.722 |
1374081 |
distance.levenshtein |
0.623 |
1591396 |
Levenshtein.distance |
0.500 |
1982764 |
polyleven.levenshtein |
0.431 |
2303420 |
You may notice that edlib and editdistance appear to be slower than other libraries. This is because both internally use Myers’ algorithm for computing the Levenshtein distance.
The idea behind Myers’ algorithm is to use bit parallelism to speed up the computation. In order to perform the computation efficiently, we need to pre-compute a table of bit patterns for each character. For short inputs like English words (the average length is just 7-8 characters), the cost of pre-computation dwarfs the benefit of parallelism.
4.2. Longer Inputs¶
To evaluate the efficiency for longer inputs, I careted 5000 pairs of random strings of size 16, 32, 64, 128, 256, 512 and 1024.
Each library was measured how fast it can process these entries. [1]
Library |
N=16 |
N=32 |
N=64 |
N=128 |
N=256 |
N=512 |
N=1024 |
---|---|---|---|---|---|---|---|
edlib |
0.040 |
0.063 |
0.094 |
0.205 |
0.432 |
0.908 |
2.089 |
editdistance |
0.027 |
0.049 |
0.086 |
0.178 |
0.336 |
0.740 |
58.139 |
jellyfish |
0.009 |
0.032 |
0.118 |
0.470 |
1.874 |
8.877 |
42.848 |
distance |
0.007 |
0.029 |
0.109 |
0.431 |
1.726 |
6.950 |
27.998 |
Levenshtein |
0.006 |
0.022 |
0.085 |
0.336 |
1.328 |
5.286 |
21.097 |
polyleven |
0.003 |
0.005 |
0.010 |
0.043 |
0.149 |
0.550 |
2.109 |
You may notice that edlib and editdistance are good at handling longer inputs. Other libraries have a steep cost curve as the string length increases.
The good thing about polyleven is that it can process both short and long inputs efficiently.
4.3. List of Libraries¶
Library |
Version |
URL |
---|---|---|
edlib |
v1.2.1 |
|
editdistance |
v0.4 |
|
jellyfish |
v0.5.6 |
|
distance |
v0.1.3 |
|
Levenshtein |
v0.12 |
|
polyleven |
v0.3 |
5. Implementation Note¶
As of version 0.5, polyleven uses the following heuristics to choose an algorithm:
+===============+ Yes +---------------------+
|| k = 0 ||---------->| PyUnicode_Compare() |
+===============+ +---------------------+
| No
V
+===============+ Yes +-------------------+
|| k < 4 ||---------->| mbleven algorithm |
+===============+ +-------------------+
| No
V
+===============+ Yes +------------------+
|| len(s) < 65 ||---------->| Myers' algorithm |
+===============+ +------------------+
| No
V
+-------------------------------+
| Myers algorithm (with blocks) |
+-------------------------------+
Before 0.4, polyleven used the Wagner-Fischer algorithm for shorter strings, but, alas, it turned out that reasonably optimized Myers’ algorithm almost always performs better.