Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap

David Talbot, Miles Osborne
2007 Conference on Empirical Methods in Natural Language Processing  
A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements fall significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiable probability. Here we present a general framework for deriving smoothed language model probabilities from BFs. We investigate how a BF containing n-gram statistics can be used as a direct replacement for a conventional n-gram model. Recent work has demonstrated that corpus
more » ... tatistics can be stored efficiently within a BF, here we consider how smoothed language model probabilities can be derived efficiently from this randomised representation. Our proposal takes advantage of the one-sided error guarantees of the BF and simple inequalities that hold between related n-gram statistics in order to further reduce the BF storage requirements and the error rate of the derived probabilities. We use these models as replacements for a conventional language model in machine translation experiments.
dblp:conf/emnlp/TalbotO07 fatcat:l7dt4hvvw5bu3acxycu3iqi3hq