A faster way to compute edit distance with a bound

Fujimoto Seiji <fujimoto@ceptord.net>

mbleven is a fast algorithm to compute k-bounded Levenshtein distance. In general, it's one of the fastest algorithm for cases where the bound parameter is small (k ≤ 3).

Benchmark

To illustrate the performance characteristic, I conducted a benchmark test that compares mbleven and the Wagner-Fischer algorithm.

The measurement was done by count the time each algorithm takes to process 4 million binary strings. Here is the graph that shows the result (on Intel Core i3-4010U with GCC 6.3.0):

mbleven benchmark

The benchmark program is available from mbleven-benchmark.c. To run this program:

      $ cc -O2 -o mbleven-benchmark mbleven-benchmark.c
      $ ./mbleven-benchmark

Implementations

There are several implementations available.

How it works

mbleven is a "hypothesis-based" algorithm, which means that it solves the edit distance problem by testing a collection of hypotheses.

Suppose you are trying to compute the distance between s and t under the constraint k = 1. The mbleven algorithm first enumerates all the edit operations that are possible under the threshold k:

  1. Can an insertion transform s to t?
  2. Can a deletion transform s to t?
  3. Can a substitution transform s to t?

... then mbleven checks each hypothesis one by one to find out if any of them holds true.

As explained later, each hypothesis can be verified in O(n) time using O(1) space, so it can run quite faster than the common algorithms that require O(n2) time and O(n) space.

Efficient Pruning

One important aspect of mbleven is very efficient pruning of search space. In particular, we can reject most of hypotheses just by looking at the length of input strings.

Consider the task of computing the edit distance between s = 'foo' and t = 'bar' under the constraint k = 1. We can immediately see that a substitution is the only hypothesis that we need to check.

Why? It's because 'foo' and 'bar' have the same string lengths (3 chars), so operations such as "one insertion" or "one deletion" can't convert one into another.

We can expand this argument to other cases as well. If s = 'foo' and t = 'fo' with k=2, we only need to check "one deletion + one substitution". If s = 'foobar' and t = 'bar' with k=3, we just need to test "three deletions".

Verification Algorithm

As mentioned above, each hypothesis can be verified in O(n) time. The following code shows how the verification can be done:

    def check_model(s, t, model):
        m = len(s)
        n = len(t)
        k = len(model)
        i, j, c = 0, 0, 0
        while (i < m) and (j < n):
            if s[i] != t[j]:
                if k <= c:
                    return c + 1
                if model[c] == 'd':  # deletion
                    i += 1
                elif model[c] == 'i':  # insertion
                    j += 1
                elif model[c] == 'r':  # replacement/substitution
                    i += 1
                    j += 1
                c += 1
            else:
                i += 1
                j += 1
        return c + (m - i) + (n - j)

This function returns the number of the operations it consumed to convert s into t. If the return value is greater than k, it means that the specified model cannot transform s into t.

As you can see, each iteration of the while loop increments i or j, so the main loop will break after at most (n + m) steps. Thus, this function runs in linear time to the length of the input strings.

License

I hereby place my work "mbleven" into the public domain. All my copyrights, including related and neighbouring rights, of mbleven, this document, and the accompanied benchmark program are abandoned.

2018 Fujimoto Seiji <fujimoto@ceptord.net>