CEPTORD.NET                                               Fujimoto Seiji
                                                           December 2018
                                                     Updated: 2021-01-11

                                Polyleven

Table of Contents

    1.  Introduction
    2.  How to Install
    3.  Usage
    4.  Benchmark
    5.  Internal Mechanism
    6.  License
    7.  Contact Information

1.  Introduction

    Polyleven is a Levenshtein distance library for Python, whose focus
    is put on efficiency.

    The name "polyleven" comes from the fact that it combines several
    algorithms behind the scenes (poly is a Greek-origin word that means
    many).

    The basic idea is that we can gain the best of different algorithms
    by switching between them depending on the kinds of input strings.

2.  How to Install

    PyPI

      Run the following command on your console for quick installation.

      $ pip install polyleven

    Source Distribution

      polyleven-0.7.zip (2021-02-23, 439kb)

      * Optimize the implementation of myers1999_block()
      * Migrate the license to MIT License issue#2

      polyleven-0.6.zip (2021-01-11, 429kb)

      * Fix the implementation of blockhash

      polyleven-0.5.zip (2020-02-14, 429kb)

      * Add Windows platform support.

    GIT Repository

      Visit cgit/polyleven. This is the main development repository.

      GitHub mirror is available at github.com/fujimotos/polyleven.

3.  Usage

3.1. Basic Usage

    Polyleven provides a single interface function "levenshtein()". You
    can use this function to measure the similarity of two strings.

    >>> from polyleven import levenshtein
    >>> levenshtein('aaa', 'ccc')
    3

3.2. Advanced Usage

    If you only care about distances under a certain threshold, you can
    pass the max threshold to the third argument.

    >>> levenshtein('acc', 'ccc', 1)
    1
    >>> levenshtein('aaa', 'ccc', 1)
    2

    In general, you can gain a noticeable speed boost with threshold
    k ≤ 3.

4.  Benchmark

4.1  English Words

    The following table answers the question "how efficient are these
    libraries at processing English words?". I created a million English
    word pairs, and measured how long each program takes to process
    all of them.

     Function Name                  |  TIME[sec]  |  SPEED[pairs/s]
    ------------------------------- | ----------- | ----------------
     edlib                          |      4.763  |         208216
     editdistance                   |      1.943  |         510450
     jellyfish.levenshtein_distance |      0.722  |        1374081
     distance.levenshtein           |      0.623  |        1591396
     Levenshtein.distance           |      0.500  |        1982764
     polyleven.levenshtein          |      0.431  |        2303420

    You may notice that edlib and editdistance appear to be slower than
    other libraries. Why? Because both internally use Myers' algorithm
    for computing the Levenshtein distance.

    The idea behind Myers' algorithm is to use bit parallelism to speed
    up the computation. In order to perform the computation efficiently,
    we need to pre-compute a table of bit patterns for each character.
    For short inputs like English words (the average length is just 7-8
    characters), the cost of pre-computation dwarfs the benefit of
    parallelism.

    * Measured using Python 3.5.3 on Debian Jessie with Intel Core
      i3-4010U (1.70GHz)

4.2. Longer Inputs

    The following table answers the question "what happens if the input
    gets longer?" I created 5000 pairs of random strings of size 16, 32,
    64, 128, 256, 512 and 1024, and measured how fast each library can
    process them.

     Library    | N=16  | N=32  | N=64  | N=128 | N=256 | N=512 | N=1024
    ----------- | ----- | ----- | ----- | ----- | ----- | ----- | ------
    edlib       | 0.040 | 0.063 | 0.094 | 0.205 | 0.432 | 0.908 |  2.089
    editdistance| 0.027 | 0.049 | 0.086 | 0.178 | 0.336 | 0.740 | 58.139
    jellyfish   | 0.009 | 0.032 | 0.118 | 0.470 | 1.874 | 8.877 | 42.848
    distance    | 0.007 | 0.029 | 0.109 | 0.431 | 1.726 | 6.950 | 27.998
    Levenshtein | 0.006 | 0.022 | 0.085 | 0.336 | 1.328 | 5.286 | 21.097
    polyleven   | 0.003 | 0.005 | 0.010 | 0.043 | 0.149 | 0.550 |  2.109

    You may notice that edlib and editdistance are good at handling
    longer inputs. Other libraries have a steep cost curve as the string
    length increases.

    The good thing about polyleven is that it can process both cases
    (shorter and longer inputs) with reasonable efficiency.

    * Measured using Python 3.5.3 on Debian Jessie with Intel Core
      i3-4010U (1.70GHz)

4.3. List of Libraries

    edlib          v1.2.1 https://github.com/Martinsos/edlib
    editdistance   v0.4   https://github.com/aflc/editdistance
    jellyfish      v0.5.6 https://github.com/jamesturk/jellyfish
    distance       v0.1.3 https://github.com/doukremt/distance
    Levenshtein    v0.12  https://github.com/ztane/python-Levenshtein
    polyleven      v0.3   https://git.ceptord.net/polyleven

5.  Internal Mechanism

5.1  How polyleven chooses an algorithm

    As of v0.5, polyleven uses the following heuristics to choose an
    algorithm.

        +===============+    Yes    +---------------------+
        ||   k = 0     ||---------->| PyUnicode_Compare() |
        +===============+           +---------------------+
              | No
              V
        +===============+    Yes    +-------------------+
        ||   k ≤ 3     ||---------->| mbleven algorithm |
        +===============+           +-------------------+
              | No
              V
        +===============+    Yes    +------------------+
        || len(s) ≤ 64 ||---------->| Myers' algorithm |
        +===============+           +------------------+
              | No
              V
        +-------------------------------+
        | Myers algorithm (with blocks) |
        +-------------------------------+

    Before 0.4, polyleven used the Wagner-Fischer algorithm for shorter
    strings, but, alas, it turned out that reasonably optimized Myers'
    algorithm almost always performs better.

6. License

    Polyleven is licensed under MIT License. For the full license text,
    refer to LICENSE.

                              Back to index page