Data Fusion for Japanese Term and Character N-gram Search

Michiko Yasukawa, J. Shane Culpepper, Falk Scholer
2015 Proceedings of the 20th Australasian Document Computing Symposium on ZZZ - ADCS '15  
Term segmentation plays a vital role in building effective information retrieval systems. In particular, languages such as Japanese and Chinese require a morphological analyzer or a word segmenter to identify potential terms. The alternative approach to indexing a segmented collection is n-gram search, where every n-length sequence of symbols is indexed. Both approaches have strengths and weaknesses when applied to non-English collections. In this study, we explore data fusion techniques to
more » ... er the following question: if there are multiple ranked lists of documents from both word and n-gram indexes, can we improve overall effectiveness by combining them? We consider three empirical methods for combining search results using eight different search indexes and twenty-one different search models with and without automatic query expansion. Our approach is language independent; however, we focus on Japanese test collections -NTCIR IR4QA -as our testbed for the current experiments. Our experimental results demonstrate that the combination of the two different segmentation approaches has the potential to significantly outperform the best word-segmented search methods.
doi:10.1145/2838931.2838939 dblp:conf/adcs/YasukawaCS15 fatcat:a3bdwzns2rh45brzojnzvoiysu