Version 2019.06.28

mbleven

mbleven is a fast, memory-efficient algorithm to compute k-bounded Levenshtein distance.

If the upper-bound parameter k is relatively small (k=1,2), it's hard to beat this algorithm in speed.

  1. Quick Introduction
  2. Benchmark
  3. Algorithm Details
  4. Links
  5. License

Quick Introduction

The de-facto solution for edit distance problem is the Wagner-Fischer algorithm. It's very generic and reasonably fast algorithm that is directly applicable to real life problems. So there is no wonder that a large portion of historical research efforts were devoted to this algorithm.

However, it is sometimes forgotten that this (flood-fill-the-matrix) is not the only approach for the problem, nor the best one. Indeed, if your problem is domain-specific, it is often very rewarding to visit other approaches that rely on different paradigms.

mbleven is one of such alternative approaches. Its strength is space and time efficiency in computing k-bounded distance. It's especially effective to spell-checking problem, where you need to filter through a large word set.

Benchmark

Here is a simple benchmark to illustrate the performance characteristics of mbleven, compared to the Wagner-Fischer algorithm.

The script used in this benchmark is downloadable from mbleven-benchmark.c. To run this program, follow the steps below.

$ wget https://ceptord.net/fastcomp/mbleven-benchmark.c
$ cc -o mbleven-benchmark mbleven-benchmark.c
$ ./mbleven-benchmark

The numbers in the graph above were obtained using a machine running on Intel Core i3-4010U (1.70GHz) with GCC 6.3.0.

Algorithm Details

mbleven is a "hypothesis-based" algorithm, that is, it solves the edit distance problem by testing a collection of hypotheses.

Suppose you are trying to compute the distance between s and t under the constraint k = 1. mbleven first enumerates all the edit operations that are possible under the threshold k=1:

  1. Can an insertion transform s to t?
  2. Can a deletion transform s to t?
  3. Can a substitution transform s to t?

... then it checks each hypothesis one by one to find out if any of them holds true.

As explained later, each hypothesis can be verified in O(n) time using O(1) space, so it can run quite faster than the Wagner Fischer algorithm that requires O(n2) time and O(n) space.

Pruning

One important aspect of mbleven is very efficient pruning of search space. Specifically, we can reject most of hypotheses just by looking at the length of the input strings.

Consider the task of computing the edit distance between s = 'foo' and t = 'bar' under the constraint k = 1. We can immediately see that a substitution is the only hypothesis that we need to check. Why? Since 'foo' and 'bar' have the same string lengths (3 characters), other edit operations (an insertoin or a deletion) cannot convert one into another.

We can expand this argument to other cases as well. If s = 'foo' and t = 'fo' with k=2, we only need to check if one deletion + one substitution can convert one into another. If s = 'foobar' and t = 'bar' with k=3, we just need to test three deletions.

Verification

So how can we verify each hypthesis? Here is the algorithm for it implemented in Python.

def check_model(s, t, model):
    m = len(s)
    n = len(t)
    k = len(model)
    i, j, c = 0, 0, 0
    while (i < m) and (j < n):
        if s[i] != t[j]:
            if k <= c:
                return c + 1
            if model[c] == 'd':  # deletion
                i += 1
            elif model[c] == 'i':  # insertion
                j += 1
            elif model[c] == 'r':  # replacement/substitution
                i += 1
                j += 1
            c += 1
        else:
            i += 1
            j += 1
    return c + (m - i) + (n - j)

This function returns the number of the operations it consumed to convert s into t. If the return value is greater than k, it means that the specified model cannot transform s into t.

As you can see, each iteration of the while loop increments i or j, so the main loop will break after at most (n + m) steps. Thus, this function runs in linear time in the length of the input strings.

mbleven is the reference implementation written in Python.

distance contains an implementation of mbleven.

polyleven implements this algorithm in C.

License

I hereby place this document into the public domain. All my copyrights, including related and neighbouring rights, of this document and the accompanied benchmark program are abandoned.

2018 Fujimoto Seiji <fujimoto@ceptord.net>


Back to ceptord.net