Version 2017.12.24

mbleven is a fast, memory-efficient algorithm to compute k-bounded Levenshtein distance.

If the upper-bound parameter `k` is relatively small (especially k=1,2), it's hard to beat this algorithm in speed.

mbleven measures edit distance by testing a collection of edit sequences
(*models* in mbleven terms). Given a upper-bound parameter `k`,
mbleven first enumerates all possible combinations of edit operations,
and then checks each model to see if any of them can convert one input
string into another.

For example, suppose we are trying to measure the Levenshtein distance between strA and strB under the constraint k=1. By the definition of Levenshtein distance, there exist three edit models:

- An insertion
- A deletion
- A substitution

If at least one of these edit models is found to be able to transform strA into strB, then mbleven concludes that the Levenshtein distance between strA and strB is one. Otherwise, it reports that these strings are more than one edit distance away.

As explained later, each edit model can be verified in linear time (in
the length of the input strings), with a constant memory space. So as
long as the number of edit models remains small, this algorithm will run
quite faster than the common O(N^{2}) algorithm.

One important aspect of mbleven is very efficient pruning of search space. Specifically, we can reject most of the models just by checking the length of the input strings.

Suppose that input strings are 'foo' and 'bar', and the upper-bound parameter `k` is 1, we can immediately conclude that a substitution is the only possible edit model we need to check. Why? Because 'foo' and 'bar' have the same string lengths, and applying a deletion (or an insertion) will make them unequal in length.

The same argument can easily be applied to other cases as well. For example, if the input strings are 'foo' and 'b' with k=2, we only need to check if two deletions can convert 'foo' into 'b', because all other combinations of edit operations cannot even make them the same length.

So how can we verify each edit model? Here is the algorithm (implemented in Python):

def check_model(str1, str2, model): len1 = len(str1) len2 = len(str2) k = len(model) i, j, c = 0, 0, 0 while (i < len1) and (j < len2): if str1[i] != str2[j]: if k <= c: return c + 1 if model[c] == 'd': # deletion i += 1 elif model[c] == 'i': # insertion j += 1 elif model[c] == 'r': # replacement/substitution i += 1 j += 1 c += 1 else: i += 1 j += 1 return c + (len1 - i) + (len2 - j)

This function returns the number of the operations it consumed to convert str1 into str2. If the return value is greater than `k`, it means that the specified `model` cannot transform str1 into str2.

*Note* This is a simplified algorithm based on the assumption that
one chose such a `model` which makes the input strings the same length.
In other words, those models that won't even make the input strings
the same length should be pruned in the previous pruning phase.

As you can see, each iteration of the while loop increments i or j, so the main loop will break after at most O(len1 + len2) steps. Thus, this function runs in linear time in the length of the input strings.

A simple benchmark was done to evaluate the performance of mbleven. For a comparison, this benchmark also includes the result of a slightly optimized implementation of the Wagner-Fischer algorithm.

For this test, I generated a set *S* which contains all binary
strings (like '01010') of length between 0 and 10. Then, for each string
pair from the cartesian product *S* x *S*, I computed the
Levenshtein distance between them (~= 4 million string pairs).

Each function was employed for this task, and the time it took to complete the task was measured using gettimeofday(2).

Here is the source code of this benchmark: mbleven-benchmark.c

You can execute this benchmark program as follows:

$ cc -o mbleven-benchmark mbleven-benchmark.c $ ./mbleven-benchmark

The system used for this benchmark was:

- Intel Core i3-4010U (1.70GHz)
- Linux x86_64 (Debian Stretch)
- GCC 6.3.0

Here is the result:

Function | k | TIME[s] | SPEED[calls/s] | (mbleven / WF) |
---|---|---|---|---|

mbleven | 1 | 0.408 | 10,279,682 | 11.1x |

mbleven | 2 | 0.832 | 5,042,668 | 6.6x |

mbleven | 3 | 2.873 | 1,459,895 | 2.0x |

wagner_fischer | 1 | 4.547 | 922,434 | - |

wagner_fischer | 2 | 5.491 | 763,915 | - |

wagner_fischer | 3 | 5.929 | 707,466 | - |

So, where k < 4, mbleven can run 2x - 11x faster than the Wagner-Fischer algorithm.

polyleven implements this algorithm in C.

distance contains an implementation of this algorithm. (fast_comp)

mbleven is the reference implementation written in Python.

I hereby place this document into the public domain. All my copyrights, including related and neighbouring rights, of this document and the accompanied benchmark program are abandoned.

2018 Fujimoto Seiji <fujimoto@ceptord.net>