Automatic ontology-based knowledge extraction from Web documents
IEEE Intelligent Systems
T o bring the Semantic Web to life and provide advanced knowledge services, we need efficient ways to access and extract knowledge from Web documents. Although Web page annotations could facilitate such knowledge gathering, annotations are rare and will probably never be rich or detailed enough to cover all the knowledge these documents contain. Manual annotation is impractical and unscalable, and automatic annotation tools remain largely undeveloped. Specialized knowledge services therefore
... uire tools that can search and extract specific knowledge directly from unstructured text on the Web, guided by an ontology that details what type of knowledge to harvest. An ontology uses concepts and relations to classify domain knowledge. Other researchers have used ontologies to support knowledge extraction, 1,2 but few have explored their full potential in this domain. The Artequakt project links a knowledge-extraction tool with an ontology to achieve continuous knowledge support and guide information extraction. The extraction tool searches online documents and extracts knowledge that matches the given classification structure. It provides this knowledge in a machine-readable format that will be automatically maintained in a knowledge base (KB). Knowledge extraction is further enhanced using a lexicon-based term expansion mechanism that provides extended ontology terminology. Artequakt Many information extraction (IE) systems can recognize entities in documents-for example, that "Rembrandt" is a person or "15 July 1606" is a date. However, such information isn't very useful without the relation between these entities-that is, Rembrandt was born on 15 July 1606. Extracting such relations automatically lets us acquire more complete knowledge to populate the ontology. Artequakt attempts to identify entity relationships using ontol-ogy relation declarations and lexical information. Storing information in a structured KB supports diverse knowledge services-for example, reconstructing the original source material to produce a dynamic presentation tailored to user needs. Previous work in this area has highlighted the difficulties of maintaining a rhetorical structure across a dynamically assembled sequence. 3 Most dynamic narrative techniques have used robust story schema such as the typical news program format (a sequence of atomic bulletins). 4 Building a story-schema layer over an ontology lets us create dynamic stories within specific domains. By populating the ontology through automatic-knowledge-acquisition software, we let users construct those stories from the Web's vast wealth of information. Artequakt combines expertise and experience from three separate projects: • Artiste: A European project to develop a distributed database of art images. This has recently been succeeded by Sculpteur, which will extend the database to 3D objects and integrate with the Semantic Web. • The Equator IRC: An Engineering and Physical Sciences Research Council-funded Interdisciplinary Research Collaboration that uses narrative techniques to structure and present information. • The AKT IRC: An EPSRC-funded IRC examining all aspects of the knowledge life cycle. During Artequakt's first stage, we created an ontology for the artists and paintings domain. We developed Artequakt automatically extracts knowledge about artists from the Web, populates a knowledge base, and uses it to generate personalized biographies.