Lucene for n-grams using the CLUEWeb Collection

Gregory B. Newby, Christopher T. Fallen, Kylie McCormick
2009 Text Retrieval Conference  
The ARSC team made modifications to the Apache Lucene engine to accommodate "go words," taken from the Google Gigaword vocabulary of n-grams. Indexing the Category "B" subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3-grams.
dblp:conf/trec/NewbyFM09 fatcat:ncth6q4xu5fcrp5wm7ykc7ojly