Entity Extraction and Consolidation for Social Web Content Preservation

Stefan Dietze, Diana Maynard, Elena Demidova, Thomas Risse, Wim Peters, Katerina Doka, Yannis Stavrakas
2012 International Conference on Theory and Practice of Digital Libraries  
With the rapidly increasing pace at which Web content is evolving, particularly social media, preserving the Web and its evolution over time becomes an important challenge. Meaningful analysis of Web content lends itself to an entity-centric view to organise Web resources according to the information objects related to them. Therefore, the crucial challenge is to extract, detect and correlate entities from a vast number of heterogeneous Web resources where the nature and quality of the content
more » ... ay vary heavily. While a wealth of information extraction tools aid this process, we believe that, the consolidation of automatically extracted data has to be treated as an equally important step in order to ensure high quality and non-ambiguity of generated data. In this paper we present an approach which is based on an iterative cycle exploiting Web data for (1) targeted archiving/crawling of Web objects, (2) entity extraction, and detection, and (3) entity correlation. The long-term goal is to preserve Web content over time and allow its navigation and analysis based on well-formed structured RDF data about entities.
dblp:conf/ercimdl/DietzeMDRPDS12 fatcat:uw7sscgrpvbipaoyrz3vlijpkm