Version 2019.06.02

Polyleven

Hyper fast Levenshtein distance library

Polyleven is a very efficient Levenshtein distance library for Python. Intended to be used in performance critical situations.

The source code of the latest version is available on polyleven-0.3.tar.gz or GitHub (requires Python ≥ 3.4).

How to install

$ pip install polyleven

Usage

Use levenshtein() to compute the edit distance between a pair of words.

>>> from polyleven import levenshtein
>>> levenshtein('hazel', 'maze')
2

Benchmark

I've benchmarked several Levenshtein libraries I found on PyPI, modelling a classical spell check problem.

Here is how I set up the test: I choose an English dictionary with 100k words as a test data set. I picked up a random word, then compute the Levenshtein distance with every word in the dictionary. I repeated this procedure 10 times (so 1 million computations) for each library.

The system used for this benchmark was:

All the benchmark resources are available in the test directory.

Result

 Test target                    |  TIME[sec]  |  SPEED[calls/s]
------------------------------- | ----------- | ----------------
 edlib.align                    |      4.763  |         208216
 editdistance.eval              |      1.943  |         510450
 jellyfish.levenshtein_distance |      0.722  |        1374081
 distance.levenshtein           |      0.623  |        1591396
 Levenshtein.distance           |      0.500  |        1982764
 polyleven.levenshtein          |      0.431  |        2303420

Benchmark (longer inputs)

This benchmark illustrates how each Levenshtein distance library behaves as the size of input strings increases. For this test, I generated 5000 pairs of random strings for each size 16, 32, 64, 128, 256, 512, 1024, and then, measured the time each library takes to process all the inputs.

Result

 Test target                   | N=16  | N=32  | N=64  | N=128 | N=256 | N=512 | N=1024
------------------------------ | ----- | ----- | ----- | ----- | ----- | ----- | ------
edlib.align                    | 0.040 | 0.063 | 0.094 | 0.205 | 0.432 | 0.908 |  2.089
editdistance.eval              | 0.027 | 0.049 | 0.086 | 0.178 | 0.336 | 0.740 | 58.139
jellyfish.levenshtein_distance | 0.009 | 0.032 | 0.118 | 0.470 | 1.874 | 8.877 | 42.848
distance.levenshtein           | 0.007 | 0.029 | 0.109 | 0.431 | 1.726 | 6.950 | 27.998
Levenshtein.distance           | 0.006 | 0.022 | 0.085 | 0.336 | 1.328 | 5.286 | 21.097
polyleven.levenshtein          | 0.003 | 0.005 | 0.010 | 0.043 | 0.149 | 0.550 |  2.109

Links

Below is the list of libraries benchmarked above.

edlib v1.2.1 https://github.com/Martinsos/edlib
editdistance v0.4 https://github.com/aflc/editdistance
jellyfish v0.5.6 https://github.com/jamesturk/jellyfish
distance v0.1.3 https://github.com/doukremt/distance
Python-Levenshtein v0.12 https://github.com/ztane/python-Levenshtein
polyleven v0.3 https://github.com/fujimotos/polyleven

License

I hereby place my work "polyleven" into the public domain. All my copyrights, including related and neighbouring rights, of this work are abandoned.

2018 Fujimoto Seiji <fujimoto@ceptord.net>


Back to ceptord.net