Combining linguistic and statistical analysis to extract relations from web documents

Fabian M. Suchanek, Georgiana Ifrim, Gerhard Weikum
2006 Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '06  
The World Wide Web provides a nearly endless source of knowledge, which is mostly given in natural language. A first step towards exploiting this data automatically could be to extract pairs of a given semantic relation from text documents -for example all pairs of a person and her birthdate. One strategy for this task is to find text patterns that express the semantic relation, to generalize these patterns, and to apply them to a corpus to find new pairs. In this paper, we show that this
more » ... ch profits significantly when deep linguistic structures are used instead of surface text patterns. We demonstrate how linguistic structures can be represented for machine learning, and we provide a theoretical analysis of the pattern matching approach. We show the practical relevance of our approach by extensive experiments with our prototype system Leila. might be interested in extracting all pairs of a person and her birth date (the birthdate-relation), all pairs of a company and the city of its headquarters (the headquartersrelation) or all pairs of an entity and the class it belongs to (the instanceOf-relation). The most promising techniques to extract information from unstructured text seem to be natural language processing (NLP) techniques. Most approaches, however, have limited the NLP part to part-of-speech tagging. This paper demonstrates that information extraction can profit significantly from deep natural language processing. It shows how deep syntactic structures can be represented suitably and it provides a statistical analysis of the pattern matching approach.
doi:10.1145/1150402.1150492 dblp:conf/kdd/SuchanekIW06 fatcat:n34emqgpjraspefbbv3un4fiiu