Modeling and solving term mismatch for full-text retrieval

Le Zhao
2012 SIGIR Forum  
Even though modern retrieval systems typically use a multitude of features to rank documents, the backbone for search ranking is usually the standard tf.idf retrieval models. This thesis addresses a limitation of the fundamental retrieval models, the term mismatch problem, which happens when query terms fail to appear in the documents that are relevant to the query. The term mismatch problem is a long standing problem in information retrieval. However, it was not well understood how often term
more » ... ismatch happens in retrieval, how important it is for retrieval, or how it affects retrieval performance. This thesis answers the above questions, and proposes principled solutions to address this limitation. The new understandings of the retrieval models will benefit its users, as well as inform the development of software applications built on top of them. This new direction of research is enabled by the formal definition of the probability of term mismatch, and quantitative data analyses around it. In this thesis, term mismatch is defined as the probability of a term not appearing in a document that is relevant to the query. The complement of term mismatch is the term recall, the probability of a term appearing in relevant documents. Even though the term recall probability is known to be a fundamental quantity in the theory of probabilistic information retrieval, prior research in ad hoc retrieval provided few clues about how to estimate term recall reliably. vi word expansion that may use the same set of high quality manual expansion terms. Promising problems for future research are identified, together with research areas where the term mismatch research may make an impact. vii viii Acknowledgments Jamie Callan, my thesis advisor, has been constantly contributing ideas into the research, has given me the freedom to explore, and has provided full support for this research even at the beginning where this research does not seem promising yet and is not well aligned with the original plan of working on structured retrieval. Looking back my last 6 years at Carnegie Mellon, Jamie has provided plenty and careful guidance at the beginning, but has gradually given me more and more freedom to try new ideas. Jamie has given me the opportunity to work on a large number of tasks related to my interest in core retrieval modeling. Jamie's advices and encouragements always come at the appropriate moment and place and with the right amount, keeping me busy and focused. Without Jamie, this journey would seem endless, it would be very easy to get lost and it would not be fun. I am very glad that all our efforts were not wasted. A part of the work has turned into this dissertation, and the rest prepared me well for my future adventures. Up to the writing of this document, anonymous reviewers from 4 venues (conferences and NSF) contributed lots of helpful suggestions and comments. My officemates Ni Lao and Frank Lin contributed through constant office room discussions. Interactions with lots of other people either face to face or through email has made the work better in various ways, they include
doi:10.1145/2422256.2422277 fatcat:iboh56u5kvdrhcnt4uqyorwvp4