Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System

Ethan Millar, Dan Shen, Junli Liu, Charles K. Nicholas
2000 Journal of Digital Information  
Information retrieval has become more and more important due to the rapid growth in the amount of all kinds of information. However, there are few suitable systems available. This paper presents a few approaches that enable large-scale information retrieval for the TELLTALE system. TELLTALE is an information retrieval system that provides full-text search for text corpora in which documents may be garbled by OCR (Optical Character Recognition) or transmission errors, and that may contain
more » ... e languages. Given a kilobyte query document, Telltale can find similar documents from within a gigabyte of text data in 45 seconds on an ordinary PC-class machine. This remarkable performance is achieved by integrating new data structures and gamma compression into TELLTALE. This paper also compares several different types of query methods such as tf.idf and incremental similarity to the original technique of centroid subtraction. The new similarity techniques give better execution-time performance, but at some cost in retrieval effectiveness.
dblp:journals/jodi/MillarSLN00 fatcat:czg3oukz2vfwjl7nhjmw6rq6fa