Combining Knowledge about Text Types and Document Structures for Enhanced Content Curation

Karolina Zaczynska, Florian Kintzel, Julián Moreno Schneider, Georg Rehm
2021 Conference on Digital Curation Technologies  
We present the conceptual design of a language technology (LT) system that enables enhanced document curation and processing of different documents types by providing customized NLP workflows that respond and adapt to the extracted characteristics of the input documents. To optimize document and text understanding, the processing steps will not only incorporate textual features but also layout and document type related features like document structure, and the communicative function of specific
more » ... parts or constituents of a document (e. g., header, subtitle, paragraph, footer). We tackle the lack of standardized representation formats for many of these document features by presenting the first draft of an ontology (QOntology) we plan to incorporate into the overall workflow manager. Since the work is still in progress, we present the theoretical background and conceptual design decisions of the approach which will be the basis of experiments in future work.
dblp:conf/qurator/ZaczynskaKSR21 fatcat:xjax2mnrobe4xkcg6ighfvwsci