Extraction of Semantic XML DTDs from Texts Using Data Mining Techniques

Karsten Winkler, Myra Spiliopoulou
2001 International Conference on Knowledge Capture  
Although composed of unstructured texts, documents contained in textual archives such as public announcements, patient records and annual reports to shareholders often share an inherent though undocumented structure. In order to facilitate efficient, structure-based search in archives and to enable information integration of text collections with related data sources, this inherent structure should be made explicit as detailed as possible. Inferring a semantic and structured XML document type
more » ... finition (DTD) for an archive and subsequently transforming the corresponding texts into XML documents is a successful method to achieve this objective. The main contribution of this paper is a new method to derive structured XML DTDs in order to extend previously derived flat DTDs. We use the DIAsDEM framework to derive a preliminary, unstructured XML DTD whose components are supported by a large number of documents. However, all XML tags contained in this preliminary DTD cannot a priori be assumed to be mandatory. Additionally, there is no fixed order of XML tags and automatically tagging an archive using a derived DTD always implicates tagging errors. Hence, we introduce the notion of probabilistic XML DTDs whose components are assigned probabilities of being semantically and structurally correct. Our method for establishing a probabilistic XML DTD is based on discovering associations between, resp. frequent sequences of XML tags.
dblp:conf/kcap/WinklerS01 fatcat:exsiw477zvej3jt3bjc3t4p7cy