*mbleven* is a fast algorithm to compute k-bounded Levenshtein
distance. In general, it's one of the fastest algorithm for cases
where the bound parameter is small (k ≤ 3).

To illustrate the performance characteristic, I conducted a benchmark test that compares mbleven and the Wagner-Fischer algorithm.

The measurement was done by count the time each algorithm takes to process 4 million binary strings. Here is the graph that shows the result (on Intel Core i3-4010U with GCC 6.3.0):

The benchmark program is available from mbleven-benchmark.c. To run this program:

$ cc -O2 -o mbleven-benchmark mbleven-benchmark.c $ ./mbleven-benchmark

There are several implementations available.

- mbleven is the reference implementation written in Python. This implementation also supports Damerau-Levenshtein distance.
- distance contains a C implementation (fast_comp).
- polyleven contains another C implementation.

mbleven is a "hypothesis-based" algorithm, which means that it solves the edit distance problem by testing a collection of hypotheses.

Suppose you are trying to compute the distance between `s` and `t` under
the constraint k = 1. The mbleven algorithm first enumerates all the
edit operations that are possible under the threshold `k`:

- Can an insertion transform
`s`to`t`? - Can a deletion transform
`s`to`t`? - Can a substitution transform
`s`to`t`?

... then mbleven checks each hypothesis one by one to find out if any of them holds true.

As explained later, each hypothesis can be verified in O(n) time
using O(1) space, so it can run quite faster than the common
algorithms that require O(n^{2}) time and O(n) space.

One important aspect of mbleven is very efficient pruning of search space. In particular, we can reject most of hypotheses just by looking at the length of input strings.

Consider the task of computing the edit distance between s = 'foo' and t = 'bar' under the constraint k = 1. We can immediately see that a substitution is the only hypothesis that we need to check.

Why? It's because 'foo' and 'bar' have the same string lengths (3 chars), so operations such as "one insertion" or "one deletion" can't convert one into another.

We can expand this argument to other cases as well. If s = 'foo' and t = 'fo' with k=2, we only need to check "one deletion + one substitution". If s = 'foobar' and t = 'bar' with k=3, we just need to test "three deletions".

As mentioned above, each hypothesis can be verified in O(n) time. The following code shows how the verification can be done:

def check_model(s, t, model): m = len(s) n = len(t) k = len(model) i, j, c = 0, 0, 0 while (i < m) and (j < n): if s[i] != t[j]: if k <= c: return c + 1 if model[c] == 'd': # deletion i += 1 elif model[c] == 'i': # insertion j += 1 elif model[c] == 'r': # replacement/substitution i += 1 j += 1 c += 1 else: i += 1 j += 1 return c + (m - i) + (n - j)

This function returns the number of the operations it consumed to convert s into t. If the return value is greater than k, it means that the specified model cannot transform s into t.

As you can see, each iteration of the while loop increments i or j, so the main loop will break after at most (n + m) steps. Thus, this function runs in linear time to the length of the input strings.

I hereby place my work "mbleven" into the public domain. All my copyrights, including related and neighbouring rights, of mbleven, this document, and the accompanied benchmark program are abandoned.

2018 Fujimoto Seiji <fujimoto@ceptord.net>