From a collection of documents to a published edition: how to use an end-to-end publication pipeline [article]

Floriane Chiffoleau, Hugo Scheithauer
2022 Zenodo  
In 2021, during the last edition of the TEI Conference "Next Gen TEI", I took part in a session where I presented a project I had been working on for a year and a half. This project, both relying massively on the Text Encoding Initiative and benefiting its community, focusses on the creation of a pipeline for the publication of digital scholarly editions. This pipeline, which was still a work in progress at the time of the 2021 Conference, but is now complete, aims at providing open-source,
more » ... , easy-to-use and interoperable tools; its goal is to support the editorial process from the digitization of a collection of documents to its publication in a machine-readable standard. In the following, I will succinctly describe the six steps that compose this pipeline, and then move to the way I intend to conduct the workshop based on them. Firstly, the collection of images that composes the corpus has to be stored and curated somewhere online, both to keep them available for researchers and for publication. For this task, we rely on IIIF, to ensure sustainability and interoperability. The three following steps, segmentation, transcription and post-OCR correction, are performed with eScriptorium, an open-source transcription application. It offers various features: uploading images, production of ground truths, manual or automatic segmentation and transcription, using custom models, training segmentation and transcription models, to name a few. Finally, if there are any remaining errors in the transcription (in case of an automatic transcription), it is possible to either correct them manually in eScriptorium or export the files and correct them with the help of specifically designed scripts. Once the transcription is fully done, we encode it in TEI XML. For this step, we provide various solutions, depending on the transcription file format (Page XML, XML ALTO, Text) chosen when exporting the transcription from eScriptorium. We also propose documented scripts that help automatize and speed up this process. Encoded files [...]
doi:10.5281/zenodo.7097126 fatcat:4v7l3m7zrnba7i5fnir6tn4f3u