Version 2018.07.16

*Answer: Yes.*

The purpose of this expository note is to explain several practical techniques to optimize the Wagner-Fischer algorithm (which is the most common algorithm for computing Levenshtein distance). A simple benchmark test was carried out to evaluate these optimizations techniques. The result suggests that these techniques, when all of them combined, can reduce the computation time by 20-40% in general cases.

A naive implementation of WF allocates a two dimensional array of size (n + 1) x (m + 1), hence requires O(nm) space. However, there is an easy fix to reduce its memory requirements to O(m).

- Allocate a single array of length m+1.
- Initialize the array buffer with a series of integers {0, 1, 2, ..., m}, which represents the first row of the WF matrix.
- Compute the second row of the matrix, while updating the buffer incrementally.
- Repeat step 3 until you reach the last row.

In short, we compute the matrix in a row-by-row manner using a buffer
only keeping the content of a single row. On the surface, it might
appear that two array buffers are required for this task, but careful
examination reveals that we can perform the whole computation with
*a single* array of size (m + 1).

Here is how it works. Suppose we are trying to compute the Levenshtein distance between 'ab' and 'cd', and for this task, we have allocated an array of size 3. In this case, we need to fill the 3 x 3 matrix below:

c d . . . buf = [0,0,0] a . . . b . . .

Let's solve this matrix row by row. The first row is the easiest part because it's just an increasing integer sequence from 0 to m. So just fill the buffer with it:

c d 0 1 2 buf[0] := 0 a . . . buf[1] := 1 b . . . buf[2] := 2

The tricky part starts here. For the second row, the value of the leftmost cell is always 1, but we cannot simply assign it to buf[0] or we lose the information required to compute the cell to its right. So first we save buf[0] to a temporary variable and then update buf[0]:

c d 0 1 2 tmp := buf[0] a 1 . . buf[0] := 1 b . . .

Now we can trivially compute the middle cell of the second row using the following variables:

buf[0] = the cell to the left buf[1] = the cell above tmp = the cell in the upper left

In other words:

<the value of the middle cell> = min(buf[0], buf[1], tmp) + 1 = 1

Of course, we must not assign the result to buf[1] in a hurry; Again, this makes the next cell incomputable. So we need to save the value of buf[1] to tmp first, and then update buf[1]:

c d 0 1 2 cost := min(buf[0], buf[1], tmp) + 1 a 1 1 . tmp := buf[1] b . . . buf[1] := cost

We can subsequently compute the last cell of the row. In this case, the value is

<the value of the right cell> = min(buf[1], buf[2], tmp) + 1 = 2.

So update buf[2] and now we finish the second row:

c d 0 1 2 cost := min(buf[1], buf[2], tmp) + 1 a 1 1 2 tmp := buf[2] b . . . buf[2] := cost

We can repeat the same procedure above to fill the third row, which leads to the following result:

c d 0 1 2 buf = [2,2,2] a 1 1 2 b 2 2 2

Therefore, we can conclude that the Levenshtein distance between 'ab' and 'cd' is 2.

Below is the straight-forward implementation of this memory-efficient algorithm in Python:

def wagner_fischer_opt(s, t): n = len(s) m = len(t) buf = list(range(m + 1)) for i in range(1, n + 1): tmp = buf[0] buf[0] = i for j in range(1, m + 1): if s[i - 1] == t[j - 1]: tmp, buf[j] = buf[j], tmp else: tmp, buf[j] = buf[j], min(buf[j], buf[j - 1], tmp) + 1 return buf[m]

Please note that this technique does not only make the computation use
less space, but also *use less time*. In general, this
memory-optimized implementation runs 5-10% faster than the naive O(nm)
space implementation. You can confirm this in
the benchmark result below.

The fundamental observation of this section can be summarized as follows:

In general, we don't need to fill every cell in a Wagner-Fischer matrix to compute edit distance.

The rest of this section provides a theoretical base for the observation, and shows how we can apply it to actual problems.

First let me introduce a useful theorem. Suppose we have a pair of strings s and t, whose lengths are n and m respectively with n ≤ m. We can guarantee that the Levenshtein distance between them cannot exceed the length of the longer string t:

LevenshteinDistance(s, t) ≤ m

Here is a proof. Let's denote "i-th character of a string x" by "x_{i}".
Using this notation, we can represent the strings s and t as follows.

s = s_{1}s_{2}s_{3}... s_{n-1}s_{n}t = t_{1}t_{2}t_{3}... t_{m-1}t_{m}

Since n ≤ m, we can rewrite the above expressions into:

s = s_{1}s_{2}s_{3}... s_{n-1}s_{n}t = t_{1}t_{2}t_{3}... t_{n-1}t_{n}... t_{m}

Now think about this edit operation:

- Substitute the sequence of t
_{1}... t_{n}to s_{1}... s_{n}, then - delete the trailing sequence of t
_{n+1}... t_{m}.

With this operation, we can always convert t to s. And since this procedure consists of m edit steps (= n substitutions + (m - n) deletions), it cannot cost more than m. So the theorem has been proved □

Note: From another angle, this m-step edit operation can be visually described as a "diagonal" path in the WF matrix. For example, if s = "abcd" and t = "pqrs", this m-step operation takes the path below:

p q r s o . . . . a . o . . . b . . o . . c . . . o . d . . . . o

This theorem, trivial as it may seem, is in fact quite effective for cutting the cost of the Wagner-Fischer algorithm.

As a starter, let's consider the cell in the upper-right corner of a WF matrix. For example, using the last example in the previous section, this cell can be shown as below:

p q r s . . . . * a . . . . . b . . . . . c . . . . . d . . . . .

From the basic property of a WF matrix, we can immediately see that

- there is only one edit path that steps through the cell and
- this edit path always requires (n + m) cost (= n insertions + m deletions).

Here is how the edit path looks like:

p q r s 0 1 2 3 4 a . . . . 5 b . . . . 6 c . . . . 7 d . . . . 8

Recalling the previous section, we already have an edit path that can
transform s into t with the cost m *in the worst case*. So it does not
make sense to consider such a path that is guaranteed to cost m *even in
the best case scenario*. This means that we can safely ignore the corner
cell when filling the matrix.

p q r s A B B B B A := the "m-step" path a . A . . B B := the path passing through the corner cell b . . A . B c . . . A B cost(A) ≤ m ≤ cost(B) d . . . . A

The same argument can be applied to the other cells in the matrix as well. For example, consider the cells adjacent to the corner cell:

p q r s . . . * a . . . . * b . . . . . c . . . . . d . . . . .

Although there are several edit paths that step through one or two of these cells, all of them share the following edit sequences:

(n - 1) insertions + (m - 1) deletions

Using the argument above, it follows that there is no reason to compute
these cells if m ≤ (m - 1) + (n - 1). By subtracting m from the both
sides, we can remove the variable m from the inequality and reduce the
condition to *2 ≤ n*. In other words, unless the length of the
shorter string is less than 2, we do not need to bother computing these
cells.

Let's generalize this observation. Choose a cell in a WF matrix and
denote the Manhattan distance from it to the cell in the upper-right
corner by *k*. Since any path passing through the chosen cell
requires, at least, (n - k) insertions + (m - k) deletions , we can
ignore this cell if the condition *m ≤ (n - k) + (m - k)* holds.
By simplifying this inequality condition, we can conclude that cells
satisfying *2k ≤ n* can be safely left blank.

Of course, we can trivially apply the same argument to cells in the bottom-left corner.

This section shows how to apply the observation in the previous section to an actual problem.

Back to the example where s = "abcd" and t = "pqrs". In this case, the
length of the shorter string is 4 (so n=4), thus the following 12 cells
satisfy the condition *2k ≤ n*.

p q r s . . * * * a . . . * * b * . . . * c * * . . . d * * * . .

So, in order to compute the edit distance, we can just focus on filling the remaining 13 cells while leaving these marked cells blank. For example, to fill the cell marked "Z" below, we only need to consider the cell to the left (marked "X") and the cell in the upper-left (marked "Y").

p q r s . Y * * * a . X Z * * Z = min(X, Y) + 1 b * . . . * c * * . . . d * * * . .

Below shows the WF matrix after flood-filing. You can see that the edit distance between "abcd" and "pqrs" is computed correctly.

p q r s 0 1 * * * a 1 1 2 * * b * 2 2 3 * c * * 3 3 4 d * * * 4 4

Although the gist of this algorithm is relatively easy to grasp, there
are, in fact, a number of pitfalls when actually implementing it. if you
are interested in the gory details, please look at
`wagner_fischer_O2()`

in the benchmark script below,

For benchmarking purpose, I implemented three versions of algorithms to compute Levenshtein distance:

- Normal Wagner-Fischer
- Wagner-Fischer + Memory Reduction
- Wagner-Fischer + Memory Reduction + Branch Pruning

See the benchmark script benchmark.c for details how I set up this perf test.

The system used for this benchmark was:

- Intel Core i3-4010U (1.70GHz)
- Linux x86_64 (Debian 9.4)
- GCC 6.3.0

Here is the result:

# | TIME[s] | SPEED[calls/s] |
---|---|---|

Normal Wagner Fischer | 6.382 | 657,245 |

+ (Memory Reduction) | 5.903 | 710,527 |

+ (Memory Reduction) + (Branch Pruning) | 3.993 | 1,050,380 |

The result shows that the most optimized version runs (1 - 3.993 / 6.382) ≈ 37.4% faster than the naive implementation.

There is a Python module named polyleven which implements the optimized algorithm.

I hereby place this document into the public domain. All my copyrights, including related and neighbouring rights, of this document and the accompanied benchmark program are abandoned.

2018 Fujimoto Seiji <fujimoto@ceptord.net>