Choosing the Right Bigrams for Information Retrieval [chapter]

Maojin Jiang, Eric Jensen, Steve Beitzel, Shlomo Argamon
2004 Classification, Clustering, and Data Mining Applications  
After more than 30 years of research in information retrieval, the dominant paradigm remains the "bag-of-words", in which query terms are considered independent of their coocurrences with each other. Although there has been some work on incorporating phrases or other syntactic information into IR, such attempts have given modest and inconsistent improvements, at best. This paper is a first step at investigating more deeply the question of using bigrams for information retrieval. Our results
more » ... cate that only certain kinds of bigrams are likely to aid retrieval. We used linear regression methods on data from TREC 6, 7, and 8 to identify which bigrams are able to help retrieval at all. Our characterization was then tested through retrieval experiments using our information retrieval engine, AIRE, which implements many standard ranking functions and retrieval utilities.
doi:10.1007/978-3-642-17103-1_50 fatcat:i5p5uksdk5dyvdrjvh4kp4o2nu