Version 2019.06.02

TinyFastSS

An efficient data structure for Levenshtein Distance

FastSS is a data structure for approximate string matching that allows you to retrieve a set of "similar" words very quickly. FastSS was invented by researchers at Zurich University in 2007.

TinyFastSS is a simple Python implementation of FastSS, written in less than 300 LoC. The source code is available on GitHub.

Usage

Create an index file.

import fastss

with fastss.open('fastss.dat') as index:
    index.add("test")
    index.add("text")

Perform a fuzzy string search.

import fastss

with fastss.open('fastss.dat') as index:
    print(index.query('test'))

Command-Line Usage

Also you can use TinyFastSS from the command line. Here is the minimum example.

$ cat dictionary.txt | head -n 3
aardvark
abacus
aerial
$ python -m fastss -c index.dat dictionary.txt
$ python -m fastss -q index.dat adaptive
{"0": ["adaptive"], "1": ["adoptive"], "2": ["additive"]}

Performance

I used a dictionary retrieved from SCOWL v2015-08-24 (english-50) to measure the performance of TinyFastSS. The dictionary file contained 98,986 words and was 909 KB in disk size.

The machine used in this test was Intel Core i3-4010U (1.70GHz) with 4GB mem.

Index creation

Roughly it took 3 minutes to complete the index creation for 100k words.

$ time python -m fastss -c fastss.dat dictonary.txt
3m0.71s real     2m44.35s user     0m16.43s system

The size of the resulting index file was 161 MB.

Query

I choose a handful of words and measure the time it takes to lookup using the following command.

$ python -m timeit -s 'import fastss; index=fastss.open("fastss.dat")' 'index.query("sterner")'
100 loops, best of 3: 7.67 msec per loop

Here is the summary of the result.

#           TIME [msec]  SPEED [query/sec]
sterner            7.57              132.1
spotlighted        2.43              411.5
burn              11.70               85.4
nirvana            1.16              862.0
conveyor           1.99              502.5
----------- -----------  -----------------
AVERAGE            4.97              201.2

So it took about 5ms to perform a single query (in other words, 200 queries per sec).

License

I hereby place my work TinyFastSS into the public domain. All my copyrights, including related and neighbouring rights, of this program and the accompanied documents are abandoned.

2012-2019 Fujimoto Seiji <fujimoto@ceptord.net>


Back to ceptord.net