Self-indexing Natural Language [chapter]

Nieves R. Brisaboa, Antonio Fariña, Gonzalo Navarro, Angeles S. Places, Eduardo Rodríguez
2008 Lecture Notes in Computer Science  
Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Selfindexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this
more » ... per we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases.
doi:10.1007/978-3-540-89097-3_13 fatcat:uk37adf4f5bfndj5c35z5a7c2e