Polyleven – Fast Pythonic Levenshtein Library

Author:

Fujimoto Seiji

Published:

2018-12-15

Updated:

2021-02-05

Copyright:

This document has been placed in the public domain.

1. Introduction

Polyleven is a Levenshtein distance library for Python, with a special attention to efficiency.

The basic idea behind this library is that we can gain the best of different algorithms by switching between them depending on the kinds of input strings.

2. How to Install

2.1. PyPI

$ pip install polyleven

2.2. Source Distribution

2.3. GIT Repository

$ git clone https://github.com/fujimotos/polyleven

3. Usage

3.1. Basic Usage

Polyleven provides a single interface function levenshtein(). You can use this function to measure the similarity of two strings.

>>> from polyleven import levenshtein
>>> levenshtein('aaa', 'ccc')
3

3.2. Advanced Usage

If you only care about distances under a certain threshold, you can pass the max threshold to the third argument.

>>> levenshtein('acc', 'ccc', 1)
1
>>> levenshtein('aaa', 'ccc', 1)
2

In general, you can gain a noticeable speed boost with threshold \(k < 3\).

4. Benchmark

4.1 English Words

To compare Polyleven with other Pythonic edit distance libraries, a million word pairs was generated from SCOWL.

Each library was measured how long it takes to evaluate all of these words. The following table summarises the result:

Function Name

TIME[sec]

SPEED[pairs/s]

edlib

4.763

208216

editdistance

1.943

510450

jellyfish.levenshtein_distance

0.722

1374081

distance.levenshtein

0.623

1591396

Levenshtein.distance

0.500

1982764

polyleven.levenshtein

0.431

2303420

You may notice that edlib and editdistance appear to be slower than other libraries. This is because both internally use Myers’ algorithm for computing the Levenshtein distance.

The idea behind Myers’ algorithm is to use bit parallelism to speed up the computation. In order to perform the computation efficiently, we need to pre-compute a table of bit patterns for each character. For short inputs like English words (the average length is just 7-8 characters), the cost of pre-computation dwarfs the benefit of parallelism.

4.2. Longer Inputs

To evaluate the efficiency for longer inputs, I careted 5000 pairs of random strings of size 16, 32, 64, 128, 256, 512 and 1024.

Each library was measured how fast it can process these entries. [1]

Library

N=16

N=32

N=64

N=128

N=256

N=512

N=1024

edlib

0.040

0.063

0.094

0.205

0.432

0.908

2.089

editdistance

0.027

0.049

0.086

0.178

0.336

0.740

58.139

jellyfish

0.009

0.032

0.118

0.470

1.874

8.877

42.848

distance

0.007

0.029

0.109

0.431

1.726

6.950

27.998

Levenshtein

0.006

0.022

0.085

0.336

1.328

5.286

21.097

polyleven

0.003

0.005

0.010

0.043

0.149

0.550

2.109

You may notice that edlib and editdistance are good at handling longer inputs. Other libraries have a steep cost curve as the string length increases.

The good thing about polyleven is that it can process both short and long inputs efficiently.

4.3. List of Libraries

Library

Version

URL

edlib

v1.2.1

https://github.com/Martinsos/edlib

editdistance

v0.4

https://github.com/aflc/editdistance

jellyfish

v0.5.6

https://github.com/jamesturk/jellyfish

distance

v0.1.3

https://github.com/doukremt/distance

Levenshtein

v0.12

https://github.com/ztane/python-Levenshtein

polyleven

v0.3

https://github.com/fujimotos/polyleven

5. Implementation Note

As of version 0.5, polyleven uses the following heuristics to choose an algorithm:

+===============+    Yes    +---------------------+
||   k = 0     ||---------->| PyUnicode_Compare() |
+===============+           +---------------------+
      | No
      V
+===============+    Yes    +-------------------+
||   k < 4     ||---------->| mbleven algorithm |
+===============+           +-------------------+
      | No
      V
+===============+    Yes    +------------------+
|| len(s) < 65 ||---------->| Myers' algorithm |
+===============+           +------------------+
      | No
      V
+-------------------------------+
| Myers algorithm (with blocks) |
+-------------------------------+

Before 0.4, polyleven used the Wagner-Fischer algorithm for shorter strings, but, alas, it turned out that reasonably optimized Myers’ algorithm almost always performs better.