Exploiting semantic annotations in math information retrieval

Petr Sojka
2012 Proceedings of the fifth workshop on Exploiting semantic annotations in information retrieval - ESAIR '12  
The design and architecture of MIaS (Math Indexer and Searcher), a system for mathematics retrieval is presented, and design decisions are discussed. We argue for an approach based on combining Presentation and Content MathML using: a similarity of math subformulae, semantic annotations by Mathematical Subject Classification code expansions, statistical semantics keywords generated by topic modelling (LDA), and math corpus preprocessing to disambiguate the content and find the domain
more » ... s. The whole system is being implemented as a math-aware search engine based on the state-of-the-art system Apache Lucene. Scalability issues were checked against more than 400,000 arXiv documents with 158 million mathematical formulae. Almost three billion MathML subformulae were indexed using a Solr-compatible Lucene. I do not seek. I find. (Pablo Picasso)
doi:10.1145/2390148.2390157 dblp:conf/cikm/Sojka12 fatcat:ijtjrupujbdnleiq3ipp7moxd4