CEPTORD.NET Fujimoto Seiji July 2019 Updated: 2020-04-26 Can We Optimize the Wagner-Fischer Algorithm? Table of Contents 1. Introduction 2. How to Reduce Memory Usage 3. How to Reduce Computation Time 4. Benchmark 5. References 6. License Answer: Yes. 1. Introduction In this expository note, I explain several techniques to optimize the Wagner-Fischer algorithm. This note contains a small benchmark test program as well. The result suggests that it is possible to reduce the computation time by 20-40% in general cases. 2. How to Reduce Memory Usage It is often claimed that the Wagner-Fischer is an O(mn) space algorithm. Since we need to handle a (m + 1) x (n + 1) matrix, it might feel obvious that we need O(nm) buffer for that computation. However, there is a rather simple technique to reduce the memory usage to O(n). Here is the outline: (1) Prepare an array of length (m + 1). (2) Fill the array with an increasing series {0, 1, 2, ..., m}. (3) Compute the second row of the matrix, while updating the array incrementally. (4) Repeat step 3 until you reach the last row. Obviously the most important step is (3). If we can compute a row of DP matrix with an array of size O(n). we can just repeat the step up to the last row. So how can we do that? The following explanation should clarify it. 2.1. Step-by-step Explanation Suppose we want to compute the Levenshtein distance between 'ab' and 'cd'. In this case, we need to fill the matrix M of size 3 x 3 as shown below. c d . . . buf := [0, 0, 0] a . . . b . . . Let's solve this matrix using the buffer. The first row is the easiest part because we just need to fill the buffer with numbers from 0 to m. c d buf[0] := 0 0 1 2 buf[1] := 1 a . . . buf[2] := 2 b . . . Moving to the second row, we know that the value of the leftmost cell is 1. But the assignment should be done with care, because M[0][0] is required to compute the next cell M[1][1]. So first we save buf[0] to a temporary variable and then update buf[0]: c d 0 1 2 tmp := buf[0] a 1 . . buf[0] := 1 b . . . Now we can compute M[1][1] using the following values. buf[0] = the cell to the left buf[1] = the cell above tmp = the cell in the upper left In other words: buf[1] = min(buf[0], buf[1], tmp) + 1 To avoid losing information required to compute M[1][2], we need to update the buffer in the following steps. c d 0 1 2 val := min(buf[0], buf[1], tmp) + 1 a 1 1 . tmp := buf[1] b . . . buf[1] := val In much the same way, we can compute the last cell of the row. c d 0 1 2 val := min(buf[1], buf[2], tmp) + 1 a 1 1 2 tmp := buf[2] b . . . buf[2] := val We can just repeat the procedure to fill the third row. This leads us to the following result. c d 0 1 2 buf = [2,2,2] a 1 1 2 b 2 2 2 So the distance between 'ab' and 'cd' turns out to be 2. 2.2. Implementation in Python The following is the straight-forward implementation in Python. def wagner_fischer_O1(s, t): n = len(s) m = len(t) buf = list(range(m + 1)) for i in range(1, n + 1): tmp = buf[0] buf[0] = i for j in range(1, m + 1): if s[i - 1] == t[j - 1]: tmp, buf[j] = buf[j], tmp else: val = min(buf[j], buf[j - 1], tmp) + 1 tmp, buf[j] = buf[j], val return buf[m] An interesting aspect of this technique is that it often improves not only the memory usage, but also the computation time as well (only a few percent improvement at best, though). See the benchmark result in 4.2 for more details. 3. How to Reduce Computation Time This section explains "generalized" Ukkonen's optimization. Ukkonen's technique is well known; It states that if we don't mind editing steps that cost more than k, we only need to compute n x (2k + 1) cells at most. The less known fact is that we can apply this technique to general cases where there is no threshold parameter. 3.1. Upper Bound Theorem First let me introduce a theorem. Suppose we have two strings S and T, whose lengths are n and m respectively with n ≤ m. We can guarantee that the Levenshtein distance cannot exceed m. LevenshteinDistance(S, T) ≤ m Here is a proof. Let's notate "i-th character of a string X" by "Xi". Using this notation, we can write S and T as follows. S = S1 S2 S3 ... Sn-1 Sn T = T1 T2 T3 ... Tm-1 Tm Since n ≤ m, we can rewrite T to: S = S1 S2 S3 ... Sn-1 Sn T = T1 T2 T3 ... Tn-1 Tn ... Tm Now we can see that we can convert T to S using the following steps. * Substitute the sequence of T1 ... Tn to S1 ... Sn. * Delete Tn+1 ... Tm. The cost of performing this operation is at most m (because it needs n substitutions and (m - n) deletions). Thus the theorem has been proved □ 3.2. Optimization Let's use this theorem to optimize the algorithm. For starters, let's focus on the cell in the upper-right corner of a matrix. For example, if S='abcd' and T='pqrs', this cell can be shown as below. p q r s . . . . * a . . . . . b . . . . . c . . . . . d . . . . . The thing worth noting here is that there is exactly a single path that goes through the cell i.e. an edit path of n deletions + m insertions. Here is how the path looks like. p q r s 0 1 2 3 4 a . . . . 5 b . . . . 6 c . . . . 7 d . . . . 8 This is indeed the most expensive way to convert S to T. Since it needs (n + m) steps, the cost to follow the path is always (n + m). We know that the edit distance between S and T cannot exceed m. So there is no reason to consider a path that costs n + m. This means that the corner cell is irrelevant when computing the edit distance. We can always skip the cell and the final result is still correct. The same argument can be applied to the other cells as well. For example, let's consider the cells adjacent to the corner cell. p q r s . . . * a . . . . * b . . . . . c . . . . . d . . . . . Although there are several edit paths that step through either or both of them, such paths always contain the following steps. (n - 1) insertions + (m - 1) deletions Following the reasoning above, if the inequality condition m ≤ (m - 1) + (n - 1) holds, we can safely ignore these cells. By simplifying the inequality condition, we get 2 ≤ n. In other words, unless the length of S is less than 2, we do not need to bother computing these cells. Let's generalize this observation. Choose a cell C in a matrix and denote the Manhattan distance from it to the nearest corner cell (the upper-right one or the lower-left one) by Dc. Since any path passing through the cell requires (n - Dc) insertions + (m - Dc) deletions, we can ignore this cell if the condition m ≤ (n - Dc) + (m - Dc) holds. In short, we can safely ignore the cells satisfying 2Dc ≤ n, and the answer is still correct. 3.3. Actual Example Back to the example above where S = "abcd" and T = "pqrs". In this case m is 4, thus the following 12 cells satisfy Dc ≤ m. p q r s . . * * * a . . . * * b * . . . * c * * . . . d * * * . . So we need to fill the remaining 13 cells to compute the distance, For example, to fill the cell marked Z below, we only need to consider the cell to the left (marked X) and the cell in the upper- left (marked Y). p q r s . Y * * * a . X Z * * b * . . . * c * * . . . d * * * . . Below shows the matrix after flood-filing. You can see that the edit distance between "abcd" and "pqrs" is computed correctly. p q r s 0 1 * * * a 1 1 2 * * b * 2 2 3 * c * * 3 3 4 d * * * 4 4 3.4. How Can We Implement This? If you are looking for an example implementation, please refer to wagner_fischer_O2() contained in the benchmark script below, 4. Benchmark 4.1. Method I implemented the three versions of the Wagner-Fischer algorithm. 1. Wagner-FIscher (no optimization) 2. Wagner-FIscher + Memory Reduction 3. Wagner-Fischer + Memory Reduction + CPU Time Reduction See the benchmark script benchmark.c for details. 4.2. Result The following numbers are retrieved using Intel Core i3-4010U (1.70GHz) with GCC 6.3.0. Function TIME[s] SPEED[calls/s] wagner_fischer 6.382 657245 wagner_fischer_O1 5.903 710527 wagner_fischer_O2 3.993 1050380 The following graph shows the result graphically.5. References 1. Wagner, Robert A., and Michael J. Fischer. "The string-to- string correction problem." Journal of the ACM 21.1 (1974): 168-173. https://dl.acm.org/doi/pdf/10.1145/321796.321811 2. Ukkonen, Esko. "Algorithms for approximate string matching." Information and control 64.1-3 (1985): 100-118. https://core.ac.uk/download/pdf/81135723.pdf 6. License I hereby place this document into the public domain. All my copyrights, including related and neighboring rights, of this document and the accompanied program are abandoned. 2019 Fujimoto Seiji <fujimoto@ceptord.net> Back to index page