CEPTORD.NET                                               Fujimoto Seiji
                                                               July 2019
                                                     Updated: 2020-04-26

              Can We Optimize the Wagner-Fischer Algorithm?


Table of Contents

    1.  Introduction
    2.  How to Reduce Memory Usage
    3.  How to Reduce Computation Time
    4.  Benchmark
    5.  References
    6.  License

Answer: Yes.

1.  Introduction

    In this expository note, I explain several techniques to optimize
    the Wagner-Fischer algorithm.

    This note contains a small benchmark test program as well. The
    result suggests that it is possible to reduce the computation time
    by 20-40% in general cases.

2.  How to Reduce Memory Usage

    It is often claimed that the Wagner-Fischer is an O(mn) space
    algorithm. Since we need to handle a (m + 1) x (n + 1) matrix, it
    might feel obvious that we need O(nm) buffer for that computation.

    However, there is a rather simple technique to reduce the memory
    usage to O(n). Here is the outline:

      (1) Prepare an array of length (m + 1).

      (2) Fill the array with an increasing series {0, 1, 2, ..., m}.

      (3) Compute the second row of the matrix, while updating the array
          incrementally. 

      (4) Repeat step 3 until you reach the last row.
 
    Obviously the most important step is (3). If we can compute a row of
    DP matrix with an array of size O(n). we can just repeat the step
    up to the last row. So how can we do that? The following explanation
    should clarify it.

2.1. Step-by-step Explanation

    Suppose we want to compute the Levenshtein distance between 'ab' and
    'cd'. In this case, we need to fill the matrix M of size 3 x 3 as
    shown below.

         c d
        . . .   buf := [0, 0, 0]
      a . . .
      b . . .

    Let's solve this matrix using the buffer.

    The first row is the easiest part because we just need to fill the
    buffer with numbers from 0 to m.

          c d   buf[0] := 0
        0 1 2   buf[1] := 1
      a . . .   buf[2] := 2
      b . . .

    Moving to the second row, we know that the value of the leftmost
    cell is 1. But the assignment should be done with care, because
    M[0][0] is required to compute the next cell M[1][1].

    So first we save buf[0] to a temporary variable and then update
    buf[0]:

          c d
        0 1 2   tmp := buf[0]
      a 1 . .   buf[0] := 1
      b . . .

    Now we can compute M[1][1] using the following values.

      buf[0] = the cell to the left
      buf[1] = the cell above
      tmp    = the cell in the upper left

    In other words:

       buf[1] = min(buf[0], buf[1], tmp) + 1

    To avoid losing information required to compute M[1][2], we need to
    update the buffer in the following steps.

          c d
        0 1 2   val := min(buf[0], buf[1], tmp) + 1
      a 1 1 .   tmp := buf[1]
      b . . .   buf[1] := val

    In much the same way, we can compute the last cell of the row.

          c d
        0 1 2   val := min(buf[1], buf[2], tmp) + 1
      a 1 1 2   tmp := buf[2]
      b . . .   buf[2] := val

    We can just repeat the procedure to fill the third row. This leads
    us to the following result.

          c d
        0 1 2   buf = [2,2,2]
      a 1 1 2
      b 2 2 2

    So the distance between 'ab' and 'cd' turns out to be 2.

2.2. Implementation in Python

    The following is the straight-forward implementation in Python.

      def wagner_fischer_O1(s, t):
          n = len(s)
          m = len(t)
          buf = list(range(m + 1))

          for i in range(1, n + 1):
              tmp = buf[0]
              buf[0] = i

              for j in range(1, m + 1):
                  if s[i - 1] == t[j - 1]:
                      tmp, buf[j] = buf[j], tmp
                  else:
                      val = min(buf[j], buf[j - 1], tmp) + 1
                      tmp, buf[j] = buf[j], val
        return buf[m]

     An interesting aspect of this technique is that it often improves
     not only the memory usage, but also the computation time as well
     (only a few percent improvement at best, though).

     See the benchmark result in 4.2 for more details.

3.  How to Reduce Computation Time

    This section explains "generalized" Ukkonen's optimization.

    Ukkonen's technique is well known; It states that if we don't mind
    editing steps that cost more than k, we only need to compute
    n x (2k + 1) cells at most.

    The less known fact is that we can apply this technique to general
    cases where there is no threshold parameter.

3.1. Upper Bound Theorem

    First let me introduce a theorem. Suppose we have two strings S and T,
    whose lengths are n and m respectively with n ≤ m.  We can guarantee
    that the Levenshtein distance cannot exceed m.

      LevenshteinDistance(S, T) ≤ m

    Here is a proof. Let's notate "i-th character of a string X" by "Xi".
    Using this notation, we can write S and T as follows.

      S = S1 S2 S3 ... Sn-1 Sn
      T = T1 T2 T3 ... Tm-1 Tm

    Since n ≤ m, we can rewrite T to:

      S = S1 S2 S3 ... Sn-1 Sn
      T = T1 T2 T3 ... Tn-1 Tn ... Tm

    Now we can see that we can convert T to S using the following steps.

      * Substitute the sequence of T1 ... Tn to S1 ... Sn.
      * Delete Tn+1 ... Tm.

    The cost of performing this operation is at most m (because it needs
    n substitutions and (m - n) deletions). Thus the theorem has been
    proved □

3.2. Optimization

    Let's use this theorem to optimize the algorithm. For starters,
    let's focus on the cell in the upper-right corner of a matrix.

    For example, if S='abcd' and T='pqrs', this cell can be shown as
    below.

          p q r s
        . . . . *
      a . . . . .
      b . . . . .
      c . . . . .
      d . . . . .

    The thing worth noting here is that there is exactly a single path
    that goes through the cell i.e. an edit path of n deletions + m
    insertions.

    Here is how the path looks like.

          p q r s
        0 1 2 3 4
      a . . . . 5
      b . . . . 6
      c . . . . 7
      d . . . . 8

    This is indeed the most expensive way to convert S to T. Since it
    needs (n + m) steps, the cost to follow the path is always (n + m).

    We know that the edit distance between S and T cannot exceed m. So
    there is no reason to consider a path that costs n + m. This means
    that the corner cell is irrelevant when computing the edit distance.
    We can always skip the cell and the final result is still correct.

    The same argument can be applied to the other cells as well. For
    example, let's consider the cells adjacent to the corner cell.

          p q r s
        . . . *
      a . . . . *
      b . . . . .
      c . . . . .
      d . . . . .

    Although there are several edit paths that step through either or
    both of them, such paths always contain the following steps.

      (n - 1) insertions + (m - 1) deletions

    Following the reasoning above, if the inequality condition m ≤
    (m - 1) + (n - 1) holds, we can safely ignore these cells. By
    simplifying the inequality condition, we get 2 ≤ n. In other
    words, unless the length of S is less than 2, we do not need to
    bother computing these cells.

    Let's generalize this observation. Choose a cell C in a matrix and
    denote the Manhattan distance from it to the nearest corner cell
    (the upper-right one or the lower-left one) by Dc.

    Since any path passing through the cell requires (n - Dc) insertions
    + (m - Dc) deletions, we can ignore this cell if the condition
    m ≤ (n - Dc) + (m - Dc) holds.

    In short, we can safely ignore the cells satisfying 2Dc ≤ n,
    and the answer is still correct.

3.3. Actual Example

    Back to the example above where S = "abcd" and T = "pqrs". In this
    case m is 4, thus the following 12 cells satisfy Dc ≤ m.

          p q r s
        . . * * *
      a . . . * *
      b * . . . *
      c * * . . .
      d * * * . .

    So we need to fill the remaining 13 cells to compute the distance,
    For example, to fill the cell marked Z below, we only need to
    consider the cell to the left (marked X) and the cell in the upper-
    left (marked Y).

          p q r s
        . Y * * *
      a . X Z * *
      b * . . . *
      c * * . . .
      d * * * . .

    Below shows the matrix after flood-filing. You can see that the
    edit distance between "abcd" and "pqrs" is computed correctly.

          p q r s
        0 1 * * *
      a 1 1 2 * *
      b * 2 2 3 *
      c * * 3 3 4
      d * * * 4 4

3.4. How Can We Implement This?

    If you are looking for an example implementation, please refer to
    wagner_fischer_O2() contained in the benchmark script below,

4.  Benchmark

4.1. Method

    I implemented the three versions of the Wagner-Fischer algorithm.

      1. Wagner-FIscher (no optimization)
      2. Wagner-FIscher + Memory Reduction
      3. Wagner-Fischer + Memory Reduction + CPU Time Reduction

    See the benchmark script benchmark.c for details.

4.2. Result

    The following numbers are retrieved using Intel Core i3-4010U
    (1.70GHz) with GCC 6.3.0.

      Function                TIME[s]  SPEED[calls/s]
      wagner_fischer            6.382          657245
      wagner_fischer_O1         5.903          710527
      wagner_fischer_O2         3.993         1050380

    The following graph shows the result graphically.

    Comparison of the three implementaions of the Wagner-Fischer algorithm

5.  References

    1. Wagner, Robert A., and Michael J. Fischer. "The string-to-
       string correction problem." Journal of the ACM 21.1 (1974):
       168-173.
       https://dl.acm.org/doi/pdf/10.1145/321796.321811

    2. Ukkonen, Esko. "Algorithms for approximate string matching."
       Information and control 64.1-3 (1985): 100-118.
       https://core.ac.uk/download/pdf/81135723.pdf

6.  License

    I hereby place this document into the public domain. All my
    copyrights, including related and neighboring rights, of this
    document and the accompanied program are abandoned.

    2019 Fujimoto Seiji <fujimoto@ceptord.net>

                              Back to index page