From Text to Knowledge [chapter]

M. Fernández, E. Villemonte de la Clergerie, M. Vilares
2007 Lecture Notes in Computer Science  
Nowadays, in spite of the increasing amount of information available in electronic format, most of human knowledge is still only available in textual format, from which it is not possible to directly consider automatic management tasks. This makes practicable knowledge acquisition a highly interesting topic, in particular in the case of technical and/or scientific documents with a highly structured wording that could simplify their computational treatment. In this context, we focus on
more » ... data extraction from text. The goal is to generate a knowledge structure to develop question-answering facilities on textual documents. In order to favour understanding, we introduce the proposal from a botanic corpus describing the West African flora. It is composed of about forty volumes in French, organized as a sequence of sections, each one dedicated to one species and following a systematic structural schema. So, for example, sections include a descriptive part enumerating morphological aspects such as color, texture, size or form. This implies the presence of nominal phrases, adjectives; and also adverbs to express frequency and intensity, and named entities to denote dimensions. A first phase consisting of performing such a translation has been applied using an ocr platform and a complementary error correction technique [5], although the description of this initial task is not of interest for the purposes of this paper. The next step consists of capturing the structure of the text using a combination of mark-up language, such as xml, and chunking tasks. The goal is to establish the linguistic context the analyzer will work with in order to serve as a guideline for the later knowledge acquisition process. Also, as a result, we can browse the document. We are now ready to introduce knowledge acquisition, by extracting and later connecting terms in order to detect pertinent relations and eliminate nondeterministic interpretations. To deal with this, two principles are considered: the distributional semantics model [4] establishing that words whose meaning is close often appear in similar syntactic contexts; and the assumption that terms shared by these contexts are usually nouns and adjectives [1]. [2] As a starting point, we parse the text on the basis of the meta-grammar concept [2], providing both
doi:10.1007/978-3-540-75867-9_34 fatcat:f7izph7hmfe2vgz7irhvufoo2q