TinyFastSS – An Implementatin of FastSS

Author:

Fujimoto Seiji

Published:

2012-03-01

Copyright:

This document has been placed in the public domain.

1. Introduction

FastSS is a data structure for approximate string matching that allows you to retrieve a set of similar words very quickly. FastSS was invented by researchers at Zurich University in 2007 (See [Bocek2007] for more details).

TinyFastSS is a simple Python implementation of FastSS, written in less than 300 LoC.

2. Installation

You can install TinyFastSS from PyPI:

$ pip install TinyFastSS

The source code is available on https://github.com/fujimotos/TinyFastSS

3. Usage

3.1. Basic Usage

Create an index from your word sets.

import fastss

with fastss.open('fastss.dat') as index:
    index.add("test")
    index.add("text")

Perform a similarity search using index.query().

with fastss.open('fastss.dat') as index:
    print(index.query('test'))

3.2. Use TinyFastSS from Command Line

You also can use TinyFastSS from console. Here is an example:

$ head -n 3 /usr/share/dict/words
aardvark
abacus
aerial
$ python -m fastss -c index.dat /usr/share/dict/words
$ python -m fastss -q index.dat adaptive
{"0": ["adaptive"], "1": ["adoptive"], "2": ["additive"]}

4. Benchmark

4.1. Dataset

I benchmarked TinyFastSS using SCOWL v2015-08-24 (english-50).

  • The word set contained 98,986 words.

  • The test machine was Intel Core i3-4010U (1.70GHz) with 4GB mem.

4.2. Index Creation Perrmance

It took 3 minutes to create an index file:

$ time python -m fastss -c fastss.dat dictonary.txt
3m0.71s real     2m44.35s user     0m16.43s system

The resulting index was 161 MB in size.

4.3. Query Perrmance

I tested the query performance using randomly choosen words.

WORD

TIME [msec]

SPEED [query/sec]

sterner

7.57

132.1

spotlighted

2.43

411.5

burn

11.70

85.4

nirvana

1.16

862.0

conveyor

1.99

502.5

It took ~5ms on average to perform a query.