Version 2019.07.07

Can we optimize the Wagner-Fischer algorithm?

Answer: Yes.

  1. Overview
  2. How to Reduce Memory Usage
  3. How to Reduce Computaiton Time
  4. Benchmark
  5. License

1. Overview

The purpose of this expository note is to explain several practical techniques to optimize the Wagner-Fischer algorithm, which is the most popular algorithm for Levenshtein distance.

In this note, I also carry out a simple benchmark test to evaluate these techniques. The result suggests that it is possible to reduce the computation time by 20-40% in general cases.

2. How to Reduce Memory Usage

It is often said that Wagner-Fischer is an algorithm that uses O(mn) space. In other words, the memory usage grows quadratically to the length of input strings. This common explanation comes from the fact that an array of length (m + 1) x (n + 1) is required to hold the entire edit distance matrix.

However, there is a trivial improvement that reduces the space requirement from O(nm) to O(m). Here is how it works.

  1. Allocate a single array of length (m + 1).
  2. Initialize the array buffer with a series of integers {0, 1, 2, ..., m}.
  3. Compute the second row while updating the buffer incrementally.
  4. Repeat step 3 until you reach the last row.

A more detailed explanation follows below.

An intersting thing to note is that this technique also reduces the computation time, since this optimization makes the code slightly simpler. In general, the memory-optimized implementation runs a couple of percents faster than the naive O(nm) implementation. You can confirm this effect by running the the benchmark script below.

Step-by-step Explanation

Suppose you are trying to compute the Levenshtein distance between 'ab' and 'cd'. In this case, you need to fill a 3 x 3 matrix (shown below). Now the task is to solve this matrix using a single array of length 3.

    c d
  . . .   buf := [0, 0, 0]
a . . .
b . . .

Let's solve the matrix row by row. The first row is the easiest part because it's just an increasing integer sequence from 0 to m. So just fill the array with the values.

    c d
  0 1 2   buf[0] := 0
a . . .   buf[1] := 1
b . . .   buf[2] := 2

The tricky part starts here. For the second row, the value of the leftmost cell is always 1, but we cannot simply assign it to buf[0] or we lose the information required to compute the cell to its right. So first we save buf[0] to a temporary variable and then update buf[0]:

    c d
  0 1 2   tmp := buf[0]
a 1 . .   buf[0] := 1
b . . .

Now we can trivially compute the middle cell of the second row.

buf[0] = the cell to the left
buf[1] = the cell above
tmp    = the cell in the upper left

In other words:

<the value of the middle cell> = min(buf[0], buf[1], tmp) + 1 = 1

Of course, we must not assign the result to buf[1] in a hurry; Again, this makes the next cell incomputable. So we need to save the value of buf[1] to tmp first, and then update buf[1]:

    c d
  0 1 2   cost := min(buf[0], buf[1], tmp) + 1
a 1 1 .   tmp := buf[1]
b . . .   buf[1] := cost

We can subsequently compute the last cell of the row. In this case, the value is

<the value of the right cell> = min(buf[1], buf[2], tmp) + 1

So update buf[2] using the value and now we finish the second row.

    c d
  0 1 2   cost := min(buf[1], buf[2], tmp) + 1
a 1 1 2   tmp := buf[2]
b . . .   buf[2] := cost

We can repeat the same procedure to fill the third row, which leads to the following result.

    c d
  0 1 2   buf = [2,2,2]
a 1 1 2
b 2 2 2

So the distance between 'ab' and 'cd' turns out to be 2.

Implementation in Python

The following is the straight-forward implementation in Python.

def wagner_fischer_O1(s, t):
    n = len(s)
    m = len(t)
    buf = list(range(m + 1))

    for i in range(1, n + 1):
        tmp = buf[0]
        buf[0] = i

        for j in range(1, m + 1):
            if s[i - 1] == t[j - 1]:
                tmp, buf[j] = buf[j], tmp
            else:
                tmp, buf[j] = buf[j], min(buf[j], buf[j - 1], tmp) + 1
    return buf[m]

3. How to Reduce Computaiton Time

This section explains (so-called) Ukkonen's optimization. It is very well known that we only need to fill n x (2k + 1) cells at most to compute the edit distance, if we do not care about distance more than a threshold k.

The less known fact is, however, that Ukkonen's optimization is applicable to general cases where there is no threshold parameter.

Upper bound Theorem

First let me introduce a theorem. Suppose we have a pair of strings S and T, whose lengths are n and m respectively with n ≤ m. We can guarantee that the Levenshtein distance between them cannot exceed m.

LevenshteinDistance(S, T) ≤ m

Here is a proof. Let's notate "i-th character of a string X" by "Xi". Using this notation, we can write S and T as follows.

S = S1 S2 S3 ... Sn-1 Sn
T = T1 T2 T3 ... Tm-1 Tm

Since n ≤ m, we can rewrite T to:

S = S1 S2 S3 ... Sn-1 Sn
T = T1 T2 T3 ... Tn-1 Tn ... Tm

It is obvious that we can always convert T to S using the following operation.

  1. Substitute the sequence of T1 ... Tn to S1 ... Sn.
  2. Delete the trailing sequence of Tn+1 ... Tm.

The cost of this operation is at most m, since it requires n substitutions and (m - n) deletions. Thus the theorem has been proved □

How to Reduce the Search Space

Let's use this theorem to optimize the algorithm. As a starter, let's consider the cell in the upper-right corner of a matrix. For example, if S='abcd' and T='pqrs', this cell can be shown as below.

    p q r s
  . . . . *
a . . . . .
b . . . . .
c . . . . .
d . . . . .

The point is that there is only a single path that step through the cell, i.e. an edit path that involves n insertions and m deletions. Here is how the path looks like.

    p q r s
  0 1 2 3 4
a . . . . 5
b . . . . 6
c . . . . 7
d . . . . 8

As you may notice, this is indeed the most expensive path to convert S into T, since it always costs (n + m).

Since we know that the edit distance between S and T cannot exceed m, there is no reason to consider a path that costs n + m. This means that the corner cell is irrelevant when computing the edit distance. In other words, we can safely skip the cell when filling the matrix, and we still get the correct answer.

The same argument can be applied to the other cells as well. For example, let's consider the cells adjacent to the corner cell.

    p q r s
  . . . *
a . . . . *
b . . . . .
c . . . . .
d . . . . .

Although there are several edit paths that step through either or both of them, such paths always contain the following edit sequences.

(n - 1) insertions + (m - 1) deletions

Following the reasoning above, if the inequality condition m ≤ (m - 1) + (n - 1) holds, we can safely ignore these cells. By simplifying the inequality condition, we can reduce it to 2 ≤ n. In other words, unless the length of S is less than 2, we do not need to bother computing these cells.

Let's generalize this observation. Choose a cell C in a matrix and denote the Manhattan distance from it to the nearest corner cell (the upper-right one or the lower-left one) by Dc. Since any path passing through the cell requires (n - Dc) insertions + (m - Dc) deletions, we can ignore this cell if the condition m ≤ (n - Dc) + (m - Dc) holds.

To put it short, we can safely ignore the cells satisfying 2Dc ≤ n, and the answer is still correct.

Actual Example

Back to the example where S = "abcd" and T = "pqrs". In this case, m = 4, thus the following 12 cells satisfy the condition 2 Dc ≤ m.

    p q r s
  . . * * *
a . . . * *
b * . . . *
c * * . . .
d * * * . .

So, in order to compute the edit distance, we can just focus on filling the remaining 13 cells. For example, to fill the cell marked Z below, we only need to consider the cell to the left (marked X) and the cell in the upper-left (marked Y).

    p q r s
  . Y * * *
a . X Z * *
b * . . . *
c * * . . .
d * * * . .

Below shows the WF matrix after flood-filing. You can see that the edit distance between "abcd" and "pqrs" is computed correctly.

    p q r s
  0 1 * * *
a 1 1 2 * *
b * 2 2 3 *
c * * 3 3 4
d * * * 4 4

How Can I Implement it?

If you are looking for an example implementation, please refer to wagner_fischer_O2() contained in the benchmark script below,

4. Benchmark

Method

For benchmarking purpose, I implemented three versions of algorithms to compute Levenshtein distance:

  1. Normal Wagner-Fischer
  2. Wagner-Fischer + Memory Reduction
  3. Wagner-Fischer + Memory Reduction + CPU Time Reduction

See the benchmark script benchmark.c for details how I set up this perf test.

Result

The following numbers are retrieved using Intel Core i3-4010U (1.70GHz) with GCC 6.3.0.

5. License

I hereby place this document into the public domain. All my copyrights, including related and neighbouring rights, of this document and the accompanied benchmark program are abandoned.

2018 Fujimoto Seiji <fujimoto@ceptord.net>


Back to ceptord.net