TinyFastSS – An Implementatin of FastSS¶

Author:: Fujimoto Seiji
Published:: 2012-03-01
Copyright:: This document has been placed in the public domain.

1. Introduction¶

FastSS is a data structure for approximate string matching that allows you to retrieve a set of similar words very quickly. FastSS was invented by researchers at Zurich University in 2007 (See [Bocek2007] for more details).

TinyFastSS is a simple Python implementation of FastSS, written in less than 300 LoC.

2. Installation¶

You can install TinyFastSS from PyPI:

$ pip install TinyFastSS

The source code is available on https://github.com/fujimotos/TinyFastSS

3. Usage¶

3.1. Basic Usage¶

Create an index from your word sets.

import fastss

with fastss.open('fastss.dat') as index:
    index.add("test")
    index.add("text")

Perform a similarity search using index.query().

with fastss.open('fastss.dat') as index:
    print(index.query('test'))

3.2. Use TinyFastSS from Command Line¶

You also can use TinyFastSS from console. Here is an example:

$ head -n 3 /usr/share/dict/words
aardvark
abacus
aerial
$ python -m fastss -c index.dat /usr/share/dict/words
$ python -m fastss -q index.dat adaptive
{"0": ["adaptive"], "1": ["adoptive"], "2": ["additive"]}

4. Benchmark¶

4.1. Dataset¶

I benchmarked TinyFastSS using SCOWL v2015-08-24 (english-50).

The word set contained 98,986 words.
The test machine was Intel Core i3-4010U (1.70GHz) with 4GB mem.

4.2. Index Creation Perrmance¶

It took 3 minutes to create an index file:

$ time python -m fastss -c fastss.dat dictonary.txt
3m0.71s real     2m44.35s user     0m16.43s system

The resulting index was 161 MB in size.

4.3. Query Perrmance¶

I tested the query performance using randomly choosen words.

WORD	TIME [msec]	SPEED [query/sec]
sterner	7.57	132.1
spotlighted	2.43	411.5
burn	11.70	85.4
nirvana	1.16	862.0
conveyor	1.99	502.5

It took ~5ms on average to perform a query.