The purpose of this expository note is to explain several practical techniques to optimize the Wagner-Fischer algorithm (which is the most common algorithm for computing Levenshtein distance). A simple benchmark test was carried out to evaluate these optimizations techniques. The result suggests that these techniques, when all of them combined, can reduce the computation time by 20-40% in general cases.
A naive implementation of WF allocates a two dimensional array of size (n + 1) x (m + 1), hence requires O(nm) space. However, there is an easy fix to reduce its memory requirements to O(m).
In short, we compute the matrix in a row-by-row manner using a buffer only keeping the content of a single row. On the surface, it might appear that two array buffers are required for this task, but careful examination reveals that we can perform the whole computation with a single array of size (m + 1).
Here is how it works. Suppose we are trying to compute the Levenshtein distance between 'ab' and 'cd', and for this task, we have allocated an array of size 3. In this case, we need to fill the 3 x 3 matrix below:
c d . . . buf = [0,0,0] a . . . b . . .
Let's solve this matrix row by row. The first row is the easiest part because it's just an increasing integer sequence from 0 to m. So just fill the buffer with it:
c d 0 1 2 buf := 0 a . . . buf := 1 b . . . buf := 2
The tricky part starts here. For the second row, the value of the leftmost cell is always 1, but we cannot simply assign it to buf or we lose the information required to compute the cell to its right. So first we save buf to a temporary variable and then update buf:
c d 0 1 2 tmp := buf a 1 . . buf := 1 b . . .
Now we can trivially compute the middle cell of the second row using the following variables:
buf = the cell to the left buf = the cell above tmp = the cell in the upper left
In other words:
<the value of the middle cell> = min(buf, buf, tmp) + 1 = 1
Of course, we must not assign the result to buf in a hurry; Again, this makes the next cell incomputable. So we need to save the value of buf to tmp first, and then update buf:
c d 0 1 2 cost := min(buf, buf, tmp) + 1 a 1 1 . tmp := buf b . . . buf := cost
We can subsequently compute the last cell of the row. In this case, the value is
<the value of the right cell> = min(buf, buf, tmp) + 1 = 2.
So update buf and now we finish the second row:
c d 0 1 2 cost := min(buf, buf, tmp) + 1 a 1 1 2 tmp := buf b . . . buf := cost
We can repeat the same procedure above to fill the third row, which leads to the following result:
c d 0 1 2 buf = [2,2,2] a 1 1 2 b 2 2 2
Therefore, we can conclude that the Levenshtein distance between 'ab' and 'cd' is 2.
Below is the straight-forward implementation of this memory-efficient algorithm in Python:
def wagner_fischer_opt(s, t): n = len(s) m = len(t) buf = list(range(m + 1)) for i in range(1, n + 1): tmp = buf buf = i for j in range(1, m + 1): if s[i - 1] == t[j - 1]: tmp, buf[j] = buf[j], tmp else: tmp, buf[j] = buf[j], min(buf[j], buf[j - 1], tmp) + 1 return buf[m]
Please note that this technique does not only make the computation use less space, but also use less time. In general, this memory-optimized implementation runs 5-10% faster than the naive O(nm) space implementation. You can confirm this in the benchmark result below.
The fundamental observation of this section can be summarized as follows:
In general, we don't need to fill every cell in a Wagner-Fischer matrix to compute edit distance.
The rest of this section provides a theoretical base for the observation, and shows how we can apply it to actual problems.
First let me introduce a useful theorem. Suppose we have a pair of strings s and t, whose lengths are n and m respectively with n ≤ m. We can guarantee that the Levenshtein distance between them cannot exceed the length of the longer string t:
LevenshteinDistance(s, t) ≤ m
Here is a proof. Let's denote "i-th character of a string x" by "xi". Using this notation, we can represent the strings s and t as follows.
s = s1 s2 s3 ... sn-1 sn t = t1 t2 t3 ... tm-1 tm
Since n ≤ m, we can rewrite the above expressions into:
s = s1 s2 s3 ... sn-1 sn t = t1 t2 t3 ... tn-1 tn ... tm
Now think about this edit operation:
With this operation, we can always convert t to s. And since this procedure consists of m edit steps (= n substitutions + (m - n) deletions), it cannot cost more than m. So the theorem has been proved □
Note: From another angle, this m-step edit operation can be visually described as a "diagonal" path in the WF matrix. For example, if s = "abcd" and t = "pqrs", this m-step operation takes the path below:
p q r s o . . . . a . o . . . b . . o . . c . . . o . d . . . . o
This theorem, trivial as it may seem, is in fact quite effective for cutting the cost of the Wagner-Fischer algorithm.
As a starter, let's consider the cell in the upper-right corner of a WF matrix. For example, using the last example in the previous section, this cell can be shown as below:
p q r s . . . . * a . . . . . b . . . . . c . . . . . d . . . . .
From the basic property of a WF matrix, we can immediately see that
Here is how the edit path looks like:
p q r s 0 1 2 3 4 a . . . . 5 b . . . . 6 c . . . . 7 d . . . . 8
Recalling the previous section, we already have an edit path that can transform s into t with the cost m in the worst case. So it does not make sense to consider such a path that is guaranteed to cost m even in the best case scenario. This means that we can safely ignore the corner cell when filling the matrix.
p q r s A B B B B A := the "m-step" path a . A . . B B := the path passing through the corner cell b . . A . B c . . . A B cost(A) ≤ m ≤ cost(B) d . . . . A
The same argument can be applied to the other cells in the matrix as well. For example, consider the cells adjacent to the corner cell:
p q r s . . . * a . . . . * b . . . . . c . . . . . d . . . . .
Although there are several edit paths that step through one or two of these cells, all of them share the following edit sequences:
(n - 1) insertions + (m - 1) deletions
Using the argument above, it follows that there is no reason to compute these cells if m ≤ (m - 1) + (n - 1). By subtracting m from the both sides, we can remove the variable m from the inequality and reduce the condition to 2 ≤ n. In other words, unless the length of the shorter string is less than 2, we do not need to bother computing these cells.
Let's generalize this observation. Choose a cell in a WF matrix and denote the Manhattan distance from it to the cell in the upper-right corner by k. Since any path passing through the chosen cell requires, at least, (n - k) insertions + (m - k) deletions , we can ignore this cell if the condition m ≤ (n - k) + (m - k) holds. By simplifying this inequality condition, we can conclude that cells satisfying 2k ≤ n can be safely left blank.
Of course, we can trivially apply the same argument to cells in the bottom-left corner.
This section shows how to apply the observation in the previous section to an actual problem.
Back to the example where s = "abcd" and t = "pqrs". In this case, the length of the shorter string is 4 (so n=4), thus the following 12 cells satisfy the condition 2k ≤ n.
p q r s . . * * * a . . . * * b * . . . * c * * . . . d * * * . .
So, in order to compute the edit distance, we can just focus on filling the remaining 13 cells while leaving these marked cells blank. For example, to fill the cell marked "Z" below, we only need to consider the cell to the left (marked "X") and the cell in the upper-left (marked "Y").
p q r s . Y * * * a . X Z * * Z = min(X, Y) + 1 b * . . . * c * * . . . d * * * . .
Below shows the WF matrix after flood-filing. You can see that the edit distance between "abcd" and "pqrs" is computed correctly.
p q r s 0 1 * * * a 1 1 2 * * b * 2 2 3 * c * * 3 3 4 d * * * 4 4
Although the gist of this algorithm is relatively easy to grasp, there
are, in fact, a number of pitfalls when actually implementing it. if you
are interested in the gory details, please look at
wagner_fischer_O2() in the benchmark script below,
For benchmarking purpose, I implemented three versions of algorithms to compute Levenshtein distance:
See the benchmark script benchmark.c for details how I set up this perf test.
The system used for this benchmark was:
Here is the result:
|Normal Wagner Fischer||6.382||657,245|
|+ (Memory Reduction)||5.903||710,527|
|+ (Memory Reduction) + (Branch Pruning)||3.993||1,050,380|
The result shows that the most optimized version runs (1 - 3.993 / 6.382) ≈ 37.4% faster than the naive implementation.
There is a Python module named polyleven which implements the optimized algorithm.
I hereby place this document into the public domain. All my copyrights, including related and neighbouring rights, of this document and the accompanied benchmark program are abandoned.
2018 Fujimoto Seiji <firstname.lastname@example.org>