Seminar 06491: Digital Historical Corpora
The seminar brought together scholars from (historical) linguistics, (historical) philology, computational linguistics and computer science who work with collections of historical texts. These texts or digital libraries or corpora 1 are collected for a number of different purposes such as lexicography, history, linguistics, philology etc. This, naturally, leads to different decisions in their design and architecture. However, there are many issues that are common to many projects working with
... storical texts. These include: Standards and methods of digitization: historical texts have to be digitized from different sources. Sometimes it is necessary to digitize directly from a manuscript or early print. In these cases it is not possible to use current OCR technology, and the texts have to be double keyed (for example according to the standards developed in the Kompetenzzentrum Retrodigitalisierung in Trier). Newer texts can sometimes be scanned and OCRed, although even the relatively 'clean' 19 th century newspaper texts are often problematic. Fraktur and some other scripts (e.g. old Cyrillic scripts) also pose problems for OCR. For some research questions it is possible to work with editions. In these cases the digitization itself is not an issue (if the editions are new). It has to be decided, however, how to deal with a critical apparatus. Design (composition) of corpora: While literary scholars often work on one text (or a small number of related texts), many research questions in linguistics and lexicography require a collection of several texts. Corpus design is, of course, always an issue in corpus construction. Ideally a matrix of the necessary parameters (text type, author, time etc.) is constructed and all 'cells' are filled with the appropriate texts. For older time periods this is often not possible since the texts might not have survived. A 'skewed' corpus, of course, only permits certain research questions. Standards and methods of annotation: For many research questions it is not sufficient to have the 'naked' text. The texts need to be annotated with further information. The texts need (a) header annotation (information about the whole text), (b) positional annotation (annotation for each token), and (c) structural annotation. The Text Encoding Initiative and other groups have developed suggestions for historical texts (the most detailed suggestions pertain to the header annotation). Annotation often cannot be done automatically since older texts are less standardized than newer texts-it is difficult to develop statistical or rule-based methods. It is necessary to discuss possible automation. It is also necessary to develop good annotation tools for manual or semi-automatic annotation. Corpus architecture: Most large modern corpora are stored in some table or tree format. Such architectures might not be the best option for historical corpora since they cannot accommodate conflicting annotation. Therefore one has to think about alternatives like multi-layer models or database models.