# TinyFastSS

Efficient Data Structure for Fuzzy String Search.

2014-12-08
Now the development moved to GitHub! This page is retained as a historical archive, and not reflect the current state of TinyFastSS library.

## 1.Abstract

FastSS is an extremely efficient data structure for approximate string search, invented by researchers at Zurich University. Given a set of words, it allows exhaustive retrieval of entries whose edit distance to query string is equal to or less than k. As shown in the original paper, the amount of time it requires to lookup is quite smaller than any other approaches.

TinyFastSS is a simple Python implementation of FastSSwC (variant of FastSS) with fixed bound k=2 , written in less than 80 lines.

## 2.Usage

#### 2.1.Basic Usage

The tinyfss.py has a class named FastSS. First you need to instantiate the class, add some words with add() method, and generate the index with makeindex() method.

Example: Preparation

> from tinyfss import FastSS
> d = FastSS()
> d.makeindex()

Then you can do a query with search() method. It returns a dictionary, of which key and value are edit distance (0, 1 and 2) and the list of words which have the correspond edit distance from the query.

Example: Searching
> d.search("maker")
{0: ['maker'], 1: [], 2: ['talker']}
> d.search("talk")
{0: [], 1: ['tall'], 2: ['talker']}
> d.search("darker")
{0: [], 1: [], 2: ['talker', 'maker']}

You can modify the stored data and index further with add(), remove() and makeindex() method. Also you can access them directly via words and index attributes.

Example: Removing / Viewing Data
> d.remove("maker")
> d.remove("talker")
> d.makeindex()
> d.words
{'tall', 'baker'}
> len(d.index)
24

Note: Don't forget to call makeindex() after adding/removing words, since TinyFastSS doesn't support dynamic modification of index.

## 4.FAQ

#### 4.1. What is FastSS / FastSSwC?

FastSS is an index-based data structure for fuzzy string search, which enables exhaustive search of similar words in a large word set. The characteristics of FastSS is the very efficient lookup, and rather large index (at least, 10-100 times bigger than original data in size).

FastSSwC is a variant of FastSS (there are three versions of FastSS: original FastSS, FastSSwC, and FastBlockSS). The whole point of FastSSwC (and FastBlockSS, too) is to reduce the index size by allowing the lookup process to consume additional time. In general, FastSSwC reduces the index size by half while making search process 1.5-2 times slower.

For more details, visit the official website.

#### 4.2. How does it work?

The original paper explains it in details.

#### 4.3. Does TinyFastSS store the data and index in-memory?

Yes, it does. If you find it difficult to store the whole data and index in memory, you should implement an associative array on disk using DBMS. (For the purpose, the dbm package would be handy)

## 5.Performance

With English dictionaries of varying sizes (derived from SCOWL), the amount of time it takes to generate index and query randomly-chosen 1000 words was measured.

(Test Machine - CPU: Celeron T3000 1.80GHz / Memory: 2GB)

Dictionary SizeTime to Generate Index (size of index)Time to Query 1000 Words
Small (12356 words)1.80sec (328,512 keys)0.727sec
Medium (48834 words)8.45sec (1,445,828 keys)1.23sec
Large (94169 words)17.7sec (3,060,453 keys)1.44sec

Note: In this test, tinyfss.py uses the compare() function of Fast String Comparator, instead of original editdist() function which is based on well-known Wagnerâ€“Fischer algorithm. This makes search process 5-10 times faster.