CEPTORD.NET                                               Fujimoto Seiji
                                                              March 2012
                                                     Updated: 2020-05-31

                               TinyFastSS


Table of Contents

    1.  Introduction
    2.  Installation
    3.  Usage
    4.  Benchmark
    5.  References
    6.  License

1.  Introduction

    FastSS is a data structure for approximate string matching that
    allows you to retrieve a set of "similar" words very quickly.
    FastSS was invented by researchers at Zurich University in 2007.

    http://fastss.csg.uzh.ch/

    TinyFastSS is a simple Python implementation of FastSS, written in
    less than 300 LoC.

2.  Installation

    You can install TinyFastSS from PyPI.

    $ pip install TinyFastSS

    The source code is available on GitHub.

3.  Usage

3.1. Basic Usage

    Create an index from your word sets.

        import fastss

        with fastss.open('fastss.dat') as index:
            index.add("test")
            index.add("text")

    Perform a similarity search using index.query().

        with fastss.open('fastss.dat') as index:
            print(index.query('test'))

3.2. Use TinyFastSS from Command Line

    You also can use TinyFastSS from console. Here is an example:

    $ head -n 3 /usr/share/dict/words
    aardvark
    abacus
    aerial
    $ python -m fastss -c index.dat /usr/share/dict/words
    $ python -m fastss -q index.dat adaptive
    {"0": ["adaptive"], "1": ["adoptive"], "2": ["additive"]}

4.  Benchmark

4.1. Dataset

    I benchmarked TinyFastSS against SCOWL v2015-08-24 (english-50). The
    word set contained 98,986 words (909 KB).

    http://wordlist.aspell.net/

    The machine used was Intel Core i3-4010U (1.70GHz) with 4GB mem.

4.2. Index Creation Perrmance

    Roughly it took 3 minutes to create an index file (161 MB).

    $ time python -m fastss -c fastss.dat dictonary.txt
    3m0.71s real     2m44.35s user     0m16.43s system

4.3. Query Perrmance

    I tested the query performance using randomly choosen words.

    #           TIME [msec]  SPEED [query/sec]
    sterner            7.57              132.1
    spotlighted        2.43              411.5
    burn              11.70               85.4
    nirvana            1.16              862.0
    conveyor           1.99              502.5
    ----------- -----------  -----------------
    AVERAGE            4.97              201.2

    It took about ~5ms on average to perform a single query.

5.  References

    1. Bocek, Thomas, et al. Fast similarity search in large
       dictionaries. 2007.
       https://fastss.csg.uzh.ch/ifi-2007.02.pdf

6.  License

    I hereby place my work TinyFastSS into the public domain.
    All my copyrights, including related and neighbouring rights,
    of this program and the accompanied documents are abandoned.

    2012-2020 Fujimoto Seiji <fujimoto@ceptord.net>

                              Back to index page