Assigning document identifiers to enhance compressibility of Web Search Engines indexes

Fabrizio Silvestri, Raffaele Perego, Salvatore Orlando
2004 Proceedings of the 2004 ACM symposium on Applied computing - SAC '04  
Granting efficient accesses to the index is a key issue for the performances of Web Search Engines (WSE). In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes where the posting lists are stored as sequences of d gaps (i.e. differences among successive document identifiers) compressed using variable length encoding methods. This paper describes the use of a lightweight clustering algorithm aimed at assigning the identifiers to documents in a
more » ... way that minimizes the average values of d gaps. The simulations performed on a real dataset, i.e. the Google contest collection, show that our approach allows to obtain an IF index which is, depending on the d gap encoding chosen, up to 23% smaller than the one built over randomly assigned document identifiers. Moreover, we will show, both analytically and empirically, that the complexity of our algorithm is linear in space and time.
doi:10.1145/967900.968024 dblp:conf/sac/SilvestriPO04 fatcat:xyemltmfynaobay5xlpgzyllwe