Wagner-Fischer vs Myers' Algorithm

Fujimoto Seiji <fujimoto@ceptord.net>

This article compares the two most common algorithms for computing Levenshtein Distance.

The key result is that Myers' algorithm almost always outperforms the Wagner-Fischer algorithm (the exception is very short inputs of length ≤ 6 ). The performance difference becomes striking when input strings become longer.

Introduction

Algorithm selection is the corner-stone of problem solving. It can result in a major difference in performance depending on which algorithm is applied. Yet, the empirical evidence on which algorithm should be applied (and when) is often limited.

To fill the gap, I wrote an implementation of the Wagner-Fischer algorithm and Myers' bit-parallel algorithm, and tested each implementation against the same data set.

The test data is random hex strings ranging from 2 characters to 64 characters. The computation time was measured by processing 1 million pairs of strings.

Key Result

The figure 1 above shows the raw result data. The figure 2 below shows the same result using a log scale, to better illustrate the performance difference for shorter strings.

The "break-even" point seems to be six characters. Longer than that, Myers's algorithm works better than the Wagner-Fischer algorithm. Shorter than that, the converse holds true.

The numbers are retrieved on Intel Xeon(R) E5-2660 (2.60Ghz) with GCC 8.3.0.

You can download the benchmark script from benchmark.c and run it as follows:

$ cc -o benchmark -O2 benchmark.c
$ ./benchmark

Implementation Notes

To the Wagner-Fischer algorithm, I applied Ukkonen's optimization to improve the performance. My article "Can we optimize the Wagner-Fischer algorithm?" contains a detailed explanation of this technique. To my knowledge, this is the best generic implementation of this algorithm.

To Myers' bit-parallel algorithm, I applied an unpublished optimization technique to speed up the computation. In particular, the original paper [Myers 1999] required the computation of a lookup table for every alphabet (e.g. 52 letters for ASCII alphabets, or 10k+ letters for Unicode). This requirement is often impractical for real-world applications.

The basic idea of my improvement is that the lookup table can be pre-computed in O(n) time, rather than O(Σ), where n is the length of strings and Σ is the total number of alphabets. I found this technique adds a good speedup to the algorithm.

References

  1. Wagner, Robert A., and Michael J. Fischer. "The string-to- string correction problem." Journal of the ACM 21.1 (1974): 168-173.
    https://dl.acm.org/doi/pdf/10.1145/321796.321811
  2. Myers, Gene. "A fast bit-vector algorithm for approximate string matching based on dynamic programming." Journal of ACM (JACM) 46.3 (1999): 395-415.
    https://core.ac.uk/download/pdf/189740935.pdf

License

I hereby place this article and the accompanied software into the public domain. All my copyrights, including related and neighbouring rights, are abandoned.

2020 Fujimoto Seiji <fujimoto@ceptord.net>