Cross-Language High Similarity Search: Why No Sub-linear Time Bound Can Be Expected [chapter]

Maik Anderka, Benno Stein, Martin Potthast
2010 Lecture Notes in Computer Science  
This paper contributes to an important variant of cross-language information retrieval, called cross-language high similarity search. Given a collection D of documents and a query q in a language different from the language of D, the task is to retrieve highly similar documents with respect to q. Use cases for this task include cross-language plagiarism detection and translation search. The current line of research in cross-language high similarity search resorts to the comparison of q and the
more » ... ocuments in D in a multilingual concept space-which, however, requires a linear scan of D. Monolingual high similarity search can be tackled in sub-linear time, either by fingerprinting or by "brute force n-gram indexing", as it is done by Web search engines. We argue that neither fingerprinting nor brute force n-gram indexing can be applied to tackle cross-language high similarity search, and that a linear scan is inevitable. Our findings are based on theoretical and empirical insights. C. Gurrin et al (Eds.): Advances in Information Retrieval
doi:10.1007/978-3-642-12275-0_66 fatcat:pqn4y4dzbbb4dgbj5yisdj6wgq