Language-Agnostic Relation Extraction from Abstracts in Wikis

Nicolas Heist, Sven Hertling, Heiko Paulheim
2018 Information  
Large-scale knowledge graphs, such as DBpedia, Wikidata, or YAGO, can be enhanced by relation extraction from text, using the data in the knowledge graph as training data, i.e., using distant supervision. While most existing approaches use language-specific methods (usually for English), we present a language-agnostic approach that exploits background knowledge from the graph instead of language-specific techniques and builds machine learning models only from language-independent features. We
more » ... monstrate the extraction of relations from Wikipedia abstracts, using the twelve largest language editions of Wikipedia. From those, we can extract 1.6 M new relations in DBpedia at a level of precision of 95%, using a RandomForest classifier trained only on language-independent features. We furthermore investigate the similarity of models for different languages and show an exemplary geographical breakdown of the information extracted. In a second series of experiments, we show how the approach can be transferred to DBkWik, a knowledge graph extracted from thousands of Wikis. We discuss the challenges and first results of extracting relations from a larger set of Wikis, using a less formalized knowledge graph. Information 2018, 9, 75 2 of 22 content pages that are interlinked, where each page has a specific topic. Practically, we limit ourselves to installations of the MediaWiki platform [13] , which is the most wide-spread Wiki platform [14] , although implementations for other platforms would be possible. As an abstract, we consider the contents of the Wiki page that appear in a Wiki before the first structuring element (e.g., a headline or a table of contents), as depicted in Figure 1 . Figure 1 . An example Wikipedia page. As the abstract, we consider the beginning of a Web page before the first structuring element (here: the table of contents). Abstract Infobox
doi:10.3390/info9040075 fatcat:juilvyj47bgclpax63dv4wimw4