Structuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD [chapter]

Karsten Winkler, Myra Spiliopoulou
2002 Lecture Notes in Computer Science  
Domain-specific documents often share an inherent, though undocumented structure. This structure should be made explicit to facilitate efficient, structure-based search in archives as well as information integration. Inferring a semantically structured XML DTD for an archive and subsequently transforming its texts into XML documents is a promising method to reach these objectives. Based on the KDD-driven DIAsDEM framework, we propose a new method to derive an archive-specific structured XML
more » ... ment type definition (DTD). Our approach utilizes association rule discovery and sequence mining techniques to structure a previously derived flat, i.e. unstructured DTD. We introduce the notion of a probabilistic DTD that is derived by discovering associations among and frequent sequences of XML tags, respectively.
doi:10.1007/3-540-45681-3_38 fatcat:n2mxpwd2wvfl7ghdqgfuybzlga