Permutation Indexing: Fast Approximate Retrieval from Large Corpora
Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques:
- partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and
- a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids.