Filters








13 Hits in 4.8 sec

New Approaches to OCR for Early Printed Books

Nikolaus Weichselbaumer, Mathias Seuret, Saskia Limbach, Rui Dong, Manuel Burghardt, Vincent Christlein
2020 DigItalia  
In a second step, we created an online training infrastructure (okralact), which allows for the use of various open source OCR engines such as Tesseract, OCRopus, Kraken and Calamari.  ...  Books printed before 1800 present major problems for OCR. One of the main obstacles is the lack of diversity of historical fonts in training data.  ...  Clemens Neudecker -Konstantin Baierer -Maria Federbusch -Matthias Boenig -Kay-Michael Würzner -Volker Hartmann -Elisa Herrmann, OCR-D: An end-to-end open source OCR framework for historical printed documents  ... 
doi:10.36181/digitalia-00015 fatcat:rukrigshsjddxcvhdc6bcyxvpi

Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs [article]

Matteo Romanello, Sven Najem-Meyer, Bruce Robertson
2021 arXiv   pre-print
In this paper, we evaluate the performances of two pipelines suitable for the OCR of historical classical commentaries.  ...  Yet, the exploitation of thousands of digitized historical commentaries was hitherto hindered by the poor quality of Optical Character Recognition (OCR), especially on commentaries to Greek texts.  ...  Second, OCR4all [14] is an open source OCR tool explicitly developed for users with no prior technical background, and especially those working on the earliest printed books.  ... 
arXiv:2110.06817v1 fatcat:jdcursc7vjcitpgnh6p74id32e

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis [article]

Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, Weining Li
2021 arXiv   pre-print
This paper introduces layoutparser, an open-source library for streamlining the usage of DL in DIA research and applications.  ...  To promote extensibility, layoutparser also incorporates a community platform for sharing both pre-trained models and full document digitization pipelines.  ...  Zejiang Shen thanks Doug Downey for suggestions.  ... 
arXiv:2103.15348v2 fatcat:7pz575jey5g63odk7axh7dpjzm

OCR4all – An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings [article]

Christian Reul, Dennis Christ, Alexander Hartelt, Nico Balbach, Maximilian Wehner, Uwe Springmann, Christoph Wick, Christine Grundig, Andreas Büttner, Frank Puppe
2019 arXiv   pre-print
for historical printings.  ...  In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow.  ...  for OCR4all.  ... 
arXiv:1909.04032v1 fatcat:czzg6o6i5baxdcnsc2cacm5xmy

Optical character recognition with neural networks and post-correction with finite state methods

Senka Drobac, Krister Lindén
2020 International Journal on Document Analysis and Recognition  
There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr/ tesseract),  ...  but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus.  ...  as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.  ... 
doi:10.1007/s10032-020-00359-9 fatcat:cjonawcrebec7n34iapdg7frqu

Transforming scholarship in the archives through handwritten text recognition

Guenter Muehlberger, Louise Seaward, Melissa Terras, Sofia Ares Oliveira, Vicente Bosch, Maximilian Bryan, Sebastian Colutto, Hervé Déjean, Markus Diem, Stefan Fiel, Basilis Gatos, Albert Greinoecker (+42 others)
2019 Journal of Documentation  
Social implications The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals.  ...  Practical implications Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and  ...  This project received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under Grant Agreement No. 600707.  ... 
doi:10.1108/jd-07-2018-0114 fatcat:ibaltwl2nnccpajqdvz2sgakx4

Text+: Language- and text-based Research Data Infrastructure

Erhard Hinrichs, Peter Leinen, Alexander Geyken, Andreas Speer, Regine Stein
2022 Zenodo  
Text+ will be flexible, scalable, and thus open for different discipline-specific requirements.  ...  Text+ aims to develop a research data infrastructure for Humanities disciplines and beyond whose primary research focus is on language and text.  ...  For OCR and structural detection, the OCR-D initiative offers specialised tools and reference data from the 16 th to the 19 th century.  ... 
doi:10.5281/zenodo.6451707 fatcat:fu5dams5jvcinjyrlv5bk2inkm

Text+: Language- and text-based Research Data Infrastructure

Erhard Hinrichs, Peter Leinen, Alexander Geyken, Andreas Speer, Regine Stein
2022 Zenodo  
Text+ will be flexible, scalable, and thus open for different discipline-specific requirements.  ...  Text+ aims to develop a research data infrastructure for Humanities disciplines and beyond whose primary research focus is on language and text.  ...  For OCR and structural detection, the OCR-D initiative offers specialised tools and reference data from the 16 th to the 19 th century.  ... 
doi:10.5281/zenodo.6452002 fatcat:4vf2jmx7pzfy7d2dnv2txicyyi

Ein Wolpertinger für die Vormoderne: Zu Nutzungs- und Forschungsmöglichkeiten von Transkribus bei der Arbeit mit mittelalterlichen und frühneuzeitlichen Handschriften und Drucken

Ina Serif, Thüringer Universitäts- Und Landesbibliothek Jena
2019
The paper provides an introduction to 'Transkribus', a software for the transcription of handwritten and printed documents.  ...  It explains the single steps from a digitized image to a document with manually or (semi)automated recognized layout and text, enabling the searchability of large bodies of sources.  ...  Abstract: The paper provides an introduction to 'Transkribus', a software for the transcription of handwritten and printed documents.  ... 
doi:10.22032/dbt.39516 fatcat:wnspvsrcjvdczdjiklkfcikhxa

Open Source Handwritten Text Recognition on Medieval Manuscripts using Mixed Models and Document-Specific Finetuning [article]

Christian Reul, Stefan Tomasek, Florian Langhanki, Uwe Springmann
2022
This paper deals with the task of practical and open source Handwritten Text Recognition (HTR) on German medieval manuscripts.  ...  We report on our efforts to construct mixed recognition models which can be applied out-of-the-box without any further document-specific training but also serve as a starting point for finetuning by training  ...  Acknowledgements The authors would like to thank our student research assistants Lisa Gugel, Kiara Hart, Ursula Heß, Annika Müller, and Anne Schmid for their extensive segmentation and transcription work  ... 
doi:10.48550/arxiv.2201.07661 fatcat:6maag6ppdnhsnmmbpmgfo76qfe

DHd2022: Kulturen des digitalen Gedächtnisses. Konferenzabstracts [article]

Michaela Geierhos, Peer Trilcke, Ingo Börner, Sabine Seifert, Anna Busch, Ulrike Wuttke, Melanie Seltmann, Kristina Genzel
2022 Zenodo  
"DHd2022: Kulturen des digitalen Gedächtnisses" an der Universität Potsdam und der Fachhochschule Potsdam vom 7.-11.3.2022  ...  ., 2017) , which was the first end-to-end neural network-based architecture for coreference resolution. e2e begins by building span representations for all spans up to a pre-defined length.  ...  for their consequences for historical research.  ... 
doi:10.5281/zenodo.6304590 fatcat:lhep7j4sk5g45fou6ls2bccgem

Globalisation, Entrepreneurship and the South Pacific: Reframing Australian Colonial Architecture, 1800-1850

William Cartwright, Harriet Edquist, Stuart King, Stephen Loo, Bernard Mees, Philippa Mein Smith, Paul Turnbull, Laurene Vaughan, Imogen Wegman
2017
Essentially, the National Library of Australia provided us with a copy of the OCR' d text of Trove's digital corpus of Australian Newspapers.  ...  The corpus was analysed using open source data mining and text analytics software so as to create indexes of concepts (Figure 3 ).  ... 
doi:10.17613/m6dz5d fatcat:jedwn5roxjhbpo7b22kydlu2zm

Aufsätze From MARC silos to Linked Data silos? Suominen/Hyvönen, From MARC silos to Linked Data silos? Aufsätze Suominen/Hyvönen, From MARC silos to Linked Data silos?

Osma Suominen
2017 unpublished
Libraries are opening up their bibliographic metadata as Linked Data. However, they have all used different data models for structuring their bibliographic data.  ...  In effect, libraries have moved from MARC silos to Linked Data silos of incompatible data models. Data sets can be difficult to combine and reuse.  ...  documents may choose which document to read first, when to interrupt their reading of that document, where to go next, and so forth.  ... 
fatcat:ccwtxnhof5hzhkwkq5c5wm6l6q