A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
New Approaches to OCR for Early Printed Books
2020
DigItalia
In a second step, we created an online training infrastructure (okralact), which allows for the use of various open source OCR engines such as Tesseract, OCRopus, Kraken and Calamari. ...
Books printed before 1800 present major problems for OCR. One of the main obstacles is the lack of diversity of historical fonts in training data. ...
Clemens Neudecker -Konstantin Baierer -Maria Federbusch -Matthias Boenig -Kay-Michael Würzner -Volker Hartmann -Elisa Herrmann, OCR-D: An end-to-end open source OCR framework for historical printed documents ...
doi:10.36181/digitalia-00015
fatcat:rukrigshsjddxcvhdc6bcyxvpi
Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs
[article]
2021
arXiv
pre-print
In this paper, we evaluate the performances of two pipelines suitable for the OCR of historical classical commentaries. ...
Yet, the exploitation of thousands of digitized historical commentaries was hitherto hindered by the poor quality of Optical Character Recognition (OCR), especially on commentaries to Greek texts. ...
Second, OCR4all [14] is an open source OCR tool explicitly developed for users with no prior technical background, and especially those working on the earliest printed books. ...
arXiv:2110.06817v1
fatcat:jdcursc7vjcitpgnh6p74id32e
LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis
[article]
2021
arXiv
pre-print
This paper introduces layoutparser, an open-source library for streamlining the usage of DL in DIA research and applications. ...
To promote extensibility, layoutparser also incorporates a community platform for sharing both pre-trained models and full document digitization pipelines. ...
Zejiang Shen thanks Doug Downey for suggestions. ...
arXiv:2103.15348v2
fatcat:7pz575jey5g63odk7axh7dpjzm
OCR4all – An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings
[article]
2019
arXiv
pre-print
for historical printings. ...
In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. ...
for OCR4all. ...
arXiv:1909.04032v1
fatcat:czzg6o6i5baxdcnsc2cacm5xmy
Optical character recognition with neural networks and post-correction with finite state methods
2020
International Journal on Document Analysis and Recognition
There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr/ tesseract), ...
but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. ...
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. ...
doi:10.1007/s10032-020-00359-9
fatcat:cjonawcrebec7n34iapdg7frqu
Transforming scholarship in the archives through handwritten text recognition
2019
Journal of Documentation
Social implications The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals. ...
Practical implications Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and ...
This project received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under Grant Agreement No. 600707. ...
doi:10.1108/jd-07-2018-0114
fatcat:ibaltwl2nnccpajqdvz2sgakx4
Text+: Language- and text-based Research Data Infrastructure
2022
Zenodo
Text+ will be flexible, scalable, and thus open for different discipline-specific requirements. ...
Text+ aims to develop a research data infrastructure for Humanities disciplines and beyond whose primary research focus is on language and text. ...
For OCR and structural detection, the OCR-D initiative offers specialised tools and reference data from the 16 th to the 19 th century. ...
doi:10.5281/zenodo.6451707
fatcat:fu5dams5jvcinjyrlv5bk2inkm
Text+: Language- and text-based Research Data Infrastructure
2022
Zenodo
Text+ will be flexible, scalable, and thus open for different discipline-specific requirements. ...
Text+ aims to develop a research data infrastructure for Humanities disciplines and beyond whose primary research focus is on language and text. ...
For OCR and structural detection, the OCR-D initiative offers specialised tools and reference data from the 16 th to the 19 th century. ...
doi:10.5281/zenodo.6452002
fatcat:4vf2jmx7pzfy7d2dnv2txicyyi
Ein Wolpertinger für die Vormoderne: Zu Nutzungs- und Forschungsmöglichkeiten von Transkribus bei der Arbeit mit mittelalterlichen und frühneuzeitlichen Handschriften und Drucken
2019
The paper provides an introduction to 'Transkribus', a software for the transcription of handwritten and printed documents. ...
It explains the single steps from a digitized image to a document with manually or (semi)automated recognized layout and text, enabling the searchability of large bodies of sources. ...
Abstract: The paper provides an introduction to 'Transkribus', a software for the transcription of handwritten and printed documents. ...
doi:10.22032/dbt.39516
fatcat:wnspvsrcjvdczdjiklkfcikhxa
Open Source Handwritten Text Recognition on Medieval Manuscripts using Mixed Models and Document-Specific Finetuning
[article]
2022
This paper deals with the task of practical and open source Handwritten Text Recognition (HTR) on German medieval manuscripts. ...
We report on our efforts to construct mixed recognition models which can be applied out-of-the-box without any further document-specific training but also serve as a starting point for finetuning by training ...
Acknowledgements The authors would like to thank our student research assistants Lisa Gugel, Kiara Hart, Ursula Heß, Annika Müller, and Anne Schmid for their extensive segmentation and transcription work ...
doi:10.48550/arxiv.2201.07661
fatcat:6maag6ppdnhsnmmbpmgfo76qfe
DHd2022: Kulturen des digitalen Gedächtnisses. Konferenzabstracts
[article]
2022
Zenodo
"DHd2022: Kulturen des digitalen Gedächtnisses" an der Universität Potsdam und der Fachhochschule Potsdam vom 7.-11.3.2022 ...
., 2017) , which was the first end-to-end neural network-based architecture for coreference resolution. e2e begins by building span representations for all spans up to a pre-defined length. ...
for their consequences for historical research. ...
doi:10.5281/zenodo.6304590
fatcat:lhep7j4sk5g45fou6ls2bccgem
Globalisation, Entrepreneurship and the South Pacific: Reframing Australian Colonial Architecture, 1800-1850
2017
Essentially, the National Library of Australia provided us with a copy of the OCR' d text of Trove's digital corpus of Australian Newspapers. ...
The corpus was analysed using open source data mining and text analytics software so as to create indexes of concepts (Figure 3 ). ...
doi:10.17613/m6dz5d
fatcat:jedwn5roxjhbpo7b22kydlu2zm
Aufsätze From MARC silos to Linked Data silos? Suominen/Hyvönen, From MARC silos to Linked Data silos? Aufsätze Suominen/Hyvönen, From MARC silos to Linked Data silos?
2017
unpublished
Libraries are opening up their bibliographic metadata as Linked Data. However, they have all used different data models for structuring their bibliographic data. ...
In effect, libraries have moved from MARC silos to Linked Data silos of incompatible data models. Data sets can be difficult to combine and reuse. ...
documents may choose which document to read first, when to interrupt their reading of that document, where to go next, and so forth. ...
fatcat:ccwtxnhof5hzhkwkq5c5wm6l6q