The purpose of this expository note is to explain several practical techniques to optimize the Wagner-Fischer algorithm, which is the most popular algorithm for Levenshtein distance.
In this note, I also carry out a simple benchmark test to evaluate these techniques. The result suggests that it is possible to reduce the computation time by 20-40% in general cases.
It is often said that Wagner-Fischer is an algorithm that uses O(mn) space. In other words, the memory usage grows quadratically to the length of input strings. This common explanation comes from the fact that an array of length (m + 1) x (n + 1) is required to hold the entire edit distance matrix.
However, there is a trivial improvement that reduces the space requirement from O(nm) to O(m). Here is how it works.
A more detailed explanation follows below.
An intersting thing to note is that this technique also reduces the computation time, since this optimization makes the code slightly simpler. In general, the memory-optimized implementation runs a couple of percents faster than the naive O(nm) implementation. You can confirm this effect by running the the benchmark script below.
Suppose you are trying to compute the Levenshtein distance between 'ab' and 'cd'. In this case, you need to fill a 3 x 3 matrix (shown below). Now the task is to solve this matrix using a single array of length 3.
c d . . . buf := [0, 0, 0] a . . . b . . .
Let's solve the matrix row by row. The first row is the easiest part because it's just an increasing integer sequence from 0 to m. So just fill the array with the values.
c d 0 1 2 buf := 0 a . . . buf := 1 b . . . buf := 2
The tricky part starts here. For the second row, the value of the leftmost cell is always 1, but we cannot simply assign it to buf or we lose the information required to compute the cell to its right. So first we save buf to a temporary variable and then update buf:
c d 0 1 2 tmp := buf a 1 . . buf := 1 b . . .
Now we can trivially compute the middle cell of the second row.
buf = the cell to the left buf = the cell above tmp = the cell in the upper left
In other words:
<the value of the middle cell> = min(buf, buf, tmp) + 1 = 1
Of course, we must not assign the result to buf in a hurry; Again, this makes the next cell incomputable. So we need to save the value of buf to tmp first, and then update buf:
c d 0 1 2 cost := min(buf, buf, tmp) + 1 a 1 1 . tmp := buf b . . . buf := cost
We can subsequently compute the last cell of the row. In this case, the value is
<the value of the right cell> = min(buf, buf, tmp) + 1
So update buf using the value and now we finish the second row.
c d 0 1 2 cost := min(buf, buf, tmp) + 1 a 1 1 2 tmp := buf b . . . buf := cost
We can repeat the same procedure to fill the third row, which leads to the following result.
c d 0 1 2 buf = [2,2,2] a 1 1 2 b 2 2 2
So the distance between 'ab' and 'cd' turns out to be 2.
The following is the straight-forward implementation in Python.
def wagner_fischer_O1(s, t): n = len(s) m = len(t) buf = list(range(m + 1)) for i in range(1, n + 1): tmp = buf buf = i for j in range(1, m + 1): if s[i - 1] == t[j - 1]: tmp, buf[j] = buf[j], tmp else: tmp, buf[j] = buf[j], min(buf[j], buf[j - 1], tmp) + 1 return buf[m]
This section explains (so-called) Ukkonen's optimization. It is very well known that we only need to fill n x (2k + 1) cells at most to compute the edit distance, if we do not care about distance more than a threshold k.
The less known fact is, however, that Ukkonen's optimization is applicable to general cases where there is no threshold parameter.
First let me introduce a theorem. Suppose we have a pair of strings S and T, whose lengths are n and m respectively with n ≤ m. We can guarantee that the Levenshtein distance between them cannot exceed m.
LevenshteinDistance(S, T) ≤ m
Here is a proof. Let's notate "i-th character of a string X" by "Xi". Using this notation, we can write S and T as follows.
S = S1 S2 S3 ... Sn-1 Sn T = T1 T2 T3 ... Tm-1 Tm
Since n ≤ m, we can rewrite T to:
S = S1 S2 S3 ... Sn-1 Sn T = T1 T2 T3 ... Tn-1 Tn ... Tm
It is obvious that we can always convert T to S using the following operation.
The cost of this operation is at most m, since it requires n substitutions and (m - n) deletions. Thus the theorem has been proved □
Let's use this theorem to optimize the algorithm. As a starter, let's consider the cell in the upper-right corner of a matrix. For example, if S='abcd' and T='pqrs', this cell can be shown as below.
p q r s . . . . * a . . . . . b . . . . . c . . . . . d . . . . .
The point is that there is only a single path that step through the cell, i.e. an edit path that involves n insertions and m deletions. Here is how the path looks like.
p q r s 0 1 2 3 4 a . . . . 5 b . . . . 6 c . . . . 7 d . . . . 8
As you may notice, this is indeed the most expensive path to convert S into T, since it always costs (n + m).
Since we know that the edit distance between S and T cannot exceed m, there is no reason to consider a path that costs n + m. This means that the corner cell is irrelevant when computing the edit distance. In other words, we can safely skip the cell when filling the matrix, and we still get the correct answer.
The same argument can be applied to the other cells as well. For example, let's consider the cells adjacent to the corner cell.
p q r s . . . * a . . . . * b . . . . . c . . . . . d . . . . .
Although there are several edit paths that step through either or both of them, such paths always contain the following edit sequences.
(n - 1) insertions + (m - 1) deletions
Following the reasoning above, if the inequality condition m ≤ (m - 1) + (n - 1) holds, we can safely ignore these cells. By simplifying the inequality condition, we can reduce it to 2 ≤ n. In other words, unless the length of S is less than 2, we do not need to bother computing these cells.
Let's generalize this observation. Choose a cell C in a matrix and denote the Manhattan distance from it to the nearest corner cell (the upper-right one or the lower-left one) by Dc. Since any path passing through the cell requires (n - Dc) insertions + (m - Dc) deletions, we can ignore this cell if the condition m ≤ (n - Dc) + (m - Dc) holds.
To put it short, we can safely ignore the cells satisfying 2Dc ≤ n, and the answer is still correct.
Back to the example where S = "abcd" and T = "pqrs". In this case, m = 4, thus the following 12 cells satisfy the condition 2 Dc ≤ m.
p q r s . . * * * a . . . * * b * . . . * c * * . . . d * * * . .
So, in order to compute the edit distance, we can just focus on filling the remaining 13 cells. For example, to fill the cell marked Z below, we only need to consider the cell to the left (marked X) and the cell in the upper-left (marked Y).
p q r s . Y * * * a . X Z * * b * . . . * c * * . . . d * * * . .
Below shows the WF matrix after flood-filing. You can see that the edit distance between "abcd" and "pqrs" is computed correctly.
p q r s 0 1 * * * a 1 1 2 * * b * 2 2 3 * c * * 3 3 4 d * * * 4 4
If you are looking for an example implementation, please refer to
wagner_fischer_O2() contained in the benchmark script below,
For benchmarking purpose, I implemented three versions of algorithms to compute Levenshtein distance:
See the benchmark script benchmark.c for details how I set up this perf test.
The following numbers are retrieved using Intel Core i3-4010U (1.70GHz) with GCC 6.3.0.
I hereby place this document into the public domain. All my copyrights, including related and neighbouring rights, of this document and the accompanied benchmark program are abandoned.
2018 Fujimoto Seiji <firstname.lastname@example.org>
Back to ceptord.net