TinyFastSS

Efficient Data Structure for Fuzzy String Search.

2014-12-08
Now the development moved to GitHub! This page is retained as a historical archive, and not reflect the current state of TinyFastSS library.

1.Abstract

FastSS is an extremely efficient data structure for approximate string search, invented by researchers at Zurich University. Given a set of words, it allows exhaustive retrieval of entries whose edit distance to query string is equal to or less than k. As shown in the original paper, the amount of time it requires to lookup is quite smaller than any other approaches.

TinyFastSS is a simple Python implementation of FastSSwC (variant of FastSS) with fixed bound k=2 , written in less than 80 lines.

2.Usage

2.1.Basic Usage

The tinyfss.py has a class named FastSS. First you need to instantiate the class, add some words with add() method, and generate the index with makeindex() method.

Example: Preparation

> from tinyfss import FastSS
> d = FastSS()
> d.add("tall")
> d.add("talker")
> d.add("maker")
> d.makeindex()

Then you can do a query with search() method. It returns a dictionary, of which key and value are edit distance (0, 1 and 2) and the list of words which have the correspond edit distance from the query.

Example: Searching
> d.search("maker")
{0: ['maker'], 1: [], 2: ['talker']}
> d.search("talk")
{0: [], 1: ['tall'], 2: ['talker']}
> d.search("darker")
{0: [], 1: [], 2: ['talker', 'maker']}

2.2.Advanced Usage

You can modify the stored data and index further with add(), remove() and makeindex() method. Also you can access them directly via words and index attributes.

Example: Removing / Viewing Data
> d.remove("maker")
> d.remove("talker")
> d.add("baker")
> d.makeindex()
> d.words
{'tall', 'baker'}
> len(d.index)
24

Note: Don't forget to call makeindex() after adding/removing words, since TinyFastSS doesn't support dynamic modification of index.

3.Download

Download from the link: tinyfss.py (Python 3.0 or later Recommended / Python 2 Compatible)

4.FAQ

4.1. What is FastSS / FastSSwC?

FastSS is an index-based data structure for fuzzy string search, which enables exhaustive search of similar words in a large word set. The characteristics of FastSS is the very efficient lookup, and rather large index (at least, 10-100 times bigger than original data in size).

FastSSwC is a variant of FastSS (there are three versions of FastSS: original FastSS, FastSSwC, and FastBlockSS). The whole point of FastSSwC (and FastBlockSS, too) is to reduce the index size by allowing the lookup process to consume additional time. In general, FastSSwC reduces the index size by half while making search process 1.5-2 times slower.

For more details, visit the official website.

4.2. How does it work?

The original paper explains it in details.

4.3. Does TinyFastSS store the data and index in-memory?

Yes, it does. If you find it difficult to store the whole data and index in memory, you should implement an associative array on disk using DBMS. (For the purpose, the dbm package would be handy)

5.Performance

With English dictionaries of varying sizes (derived from SCOWL), the amount of time it takes to generate index and query randomly-chosen 1000 words was measured.

(Test Machine - CPU: Celeron T3000 1.80GHz / Memory: 2GB)

Dictionary SizeTime to Generate Index (size of index)Time to Query 1000 Words
Small (12356 words)1.80sec (328,512 keys)0.727sec
Medium (48834 words)8.45sec (1,445,828 keys)1.23sec
Large (94169 words)17.7sec (3,060,453 keys)1.44sec

Note: In this test, tinyfss.py uses the compare() function of Fast String Comparator, instead of original editdist() function which is based on well-known Wagner–Fischer algorithm. This makes search process 5-10 times faster.

6.License

The TinyFastSS is released under the MIT License. See below for details.

Copyright (c) 2012 Fujimoto<fujimoto@writingarchives.sakura.ne.jp>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Top

Report to: Fujimoto <fujimoto@writingarchives.sakura.ne.jp>