Searching web data: An entity retrieval and high-performance indexing model

Renaud Delbru, Stephane Campinas, Giovanni Tummarello
<span title="">2012</span> <i title="Elsevier BV"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/etvhqodrjrc3tpf4c7eo6waru4" style="color: black;">Journal of Web Semantics</a> </i> &nbsp;
More and more (semi) structured information is becoming available on the Web in the form of documents embedding metadata (e.g., RDF, RDFa, Microformats and others). There are already hundreds of millions of such documents accessible and their number is growing rapidly. This calls for large scale systems providing effective means of searching and retrieving this semi-structured information with the ultimate goal of making it exploitable by humans and machines alike. This article examines the
more &raquo; ... t from the traditional web document model to a web data object (entity) model and studies the challenges faced in implementing a scalable and high performance system for searching semi-structured data objects over a large heterogeneous and decentralised infrastructure. Towards this goal, we define an entity retrieval model, develop novel methodologies for supporting this model and show how to achieve a high-performance entity retrieval system. We introduce an indexing methodology for semi-structured data which offers a good compromise between query expressiveness, query processing and index maintenance compared to other approaches. We address high-performance by optimisation of the index data structure using appropriate compression techniques. Finally, we demonstrate that the resulting system can index billions of data objects and provides keyword-based as well as more advanced search interfaces for retrieving relevant data objects in sub-second time. This work has been part of the Sindice search engine project at the Digital Enterprise Research Institute (DERI), NUI Galway. The Sindice system currently maintains more than 200 million pages downloaded from the Web and is being used actively by many researchers within and outside of DERI.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1016/j.websem.2011.04.004">doi:10.1016/j.websem.2011.04.004</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/ts4tyui34nf7ldub2l7tcavdpy">fatcat:ts4tyui34nf7ldub2l7tcavdpy</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20160929022452/http://www.websemanticsjournal.org:80/index.php/ps/article/download/223/220" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/80/08/8008cbeb924fc0243e6c392614f84276096e31dd.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1016/j.websem.2011.04.004"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> elsevier.com </button> </a>