Corpus Linguistics for establishing the natural language content of Digital Library documents [chapter]

Robert P. Futrelle, Xiaolan Zhang, Yumiko Sekiya
1995 Lecture Notes in Computer Science  
Digital Libraries will hold huge amounts of text and other forms of information. For the collections to be maximally useful, they must be highly organized with useful indexes and intraand inter-document linkages. This brings with it a demand for ever-better methods for automated analysis of text to build the indexes and links. It requires turning implicit information, "encrypted in natural language" into explicit information. We discuss approaches to the automation task built on the techniques
more » ... f corpus linguistics. This paper focuses on word classification as an example of the utility of corpus methods. Results are presented for the syntactic and semantic classification of words from a biological corpus. The word classes identified can then be used for indexing, query expansion, syntactic analysis and for linking separate library collections by aligning word senses. The paper also discusses derivative objects, diagram analysis and authoring tools. Finally, we outline a new approach to word classification and other language structure analyses based on the minimal complexity principle, in turn based on the theory of Kolmogorov complexity.
doi:10.1007/bfb0026855 fatcat:ebzlatifdzbapdlk5wmdfgfsxq