Reconstructing Semantic Structures in Technical Documentation with Vector Space Classification

Jan Oevermann
2016 International Conference on Semantic Systems  
With the increasing popularity of component content management systems, a large part of technical documentation in manufacturing and mechanical engineering is written semantically structured in xml-based information models. Content delivery portals can utilize these information to provide users with advanced retrieval or filtering functions. However, legacy content is often excluded from such granular access due to the lack of semantic structures in archival file formats, as for instance,
more » ... ed pdf documents. In this paper we introduce an approach that uses the classification knowledge present in available content components to reconstruct document structures in text extracted from legacy files. The method leverages transitions in classification confidence for distributed text chunks to detect boundaries between content components of different semantic classes. Classification is done using a modified vector space model for technical documentation. To measure confidence we derive a measure based on properties of cosine similarity in multiclass scenarios. We present first results that show a strong correlation of predicted semantic structures and original document outlines and give proposals for further improvement.
dblp:conf/i-semantics/Oevermann16 fatcat:xzdxewws65dcfedm2thz3fff7a