A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Recognition of historical Greek polytonic scripts using LSTM networks
2015
2015 13th International Conference on Document Analysis and Recognition (ICDAR)
This paper reports on high-performance Optical Character Recognition (OCR) experiments using Long Short-Term Memory (LSTM) Networks for Greek polytonic script. ...
performed baseline experiments using LSTM Networks. ...
For our experiments, we used the open-source OCR system OCRopus [13] . ...
doi:10.1109/icdar.2015.7333865
dblp:conf/icdar/SimistiraUPGKL15
fatcat:unjxqg34kffv5ijtnc33gy6d4e
Name the Name - Named Entity Recognition in OCRed 19th and Early 20th Century Finnish Newspaper and Journal Collection Data
2020
Digital Humanities in the Nordic Countries Conference
With re-OCRed Tesseract output the results are 0.79, 0.72, and 0.42, respectively. Results of LSTM-CRF are similar. ...
and LSTM-CRF NER model. ...
Conclusions We have reported in this paper usage of two standard statistical NER tools, Stanford NER and LSTM-CRF model, for annotation of OCRed Finnish historical newspaper and journal data. ...
dblp:conf/dhn/RuokolainenK20
fatcat:u64aqfbd7fea3nxenwqmx4wwtu
Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography
[article]
2021
arXiv
pre-print
In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. ...
There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. ...
Bollmann & Søgaard (2016) have shown that a bi-directional long short-term memory (bi-LSTM) can be used to normalize historical German texts. ...
arXiv:2107.03266v1
fatcat:kdz3jws72fccbfn4tp5mf52unu
Optical Character Recognition for Printed Tamizhi Documents using Deep Neural Networks
2022
DESIDOC Journal of Library & Information Technology
The ancient historical documents are generally preserved as digitised texts using Optical Character Recognition (OCR) technique. ...
But the development of OCR for Tamizhi documents is highly challenging as many characters have similar shapes and structures with very small variations. ...
Tamizhi document images are given as training data for the network. The reason for using this architecture is, OCR is an image-based sequence recognition problem. ...
doi:10.14429/djlit.42.4.17742
fatcat:p6c73xgjbzg3noy7gqrncoy6hq
OCR of historical printings of Latin texts
2014
Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage - DATeCH '14
Using finite state tools and methods developed during the IMPACT program we show that efficent batch-oriented postcorrection can work for Latin as well, and that a lexicon of historical Latin spelling ...
This paper deals with the application of OCR methods to historical printings of Latin texts. ...
For Latin a historical orthography was used which changed slowly over time. ...
doi:10.1145/2595188.2595205
dblp:conf/datech/SpringmannNMSGF14
fatcat:y5alwyxxkvhkreg4elpv7vhrxa
A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus
[article]
2022
arXiv
pre-print
We combine two resources for the current work - an 80K word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) (Santorini, 2021) and 650 million words of OCR'd Yiddish text from the Yiddish ...
We compute word embeddings on the YBC corpus, and these embeddings are used with a tagger model trained and evaluated on the PPCHY. ...
Acknowledgments We would like to thank Assaf Urieli and the Yiddish Book Center for making available the OCR'd texts of the book collection. ...
arXiv:2204.01175v1
fatcat:bjcp5e7zqbfvlcxd6kz2z7qaxi
Whole page recognition of historical handwriting
[article]
2020
arXiv
pre-print
Historical handwritten documents guard an important part of human knowledge only within reach of a few scholars and experts. ...
This work fits in the wider field of competitions on historical documents, document layout and processing, textin-the-wild challenges and document segmentation. ...
OCR approaches, especially for older books with more artifacts and less standard fonts. ...
arXiv:2009.10634v1
fatcat:j5ivuqpp75dv5bkwq4dom4qdge
Time-Aware Word Embeddings for Three Lebanese News Archives
2020
International Conference on Language Resources and Evaluation
across news archives using an animated scatter plot. ...
To evaluate the accuracy of the learnt word embeddings, a benchmark of analogy tasks was used. ...
Alongside scanned newspaper archives, the AUB Archives collections feature numerous noteworthy historical documents. ...
dblp:conf/lrec/DoughmanSE20
fatcat:gkhr2nhjxfbrrmmh6kptqlrj6a
Namsel: An Optical Character Recognition System for Tibetan Text
2016
Himalayan Linguistics
The use of advanced computational methods for the analysis of large corpora of electronic texts is becoming increasingly popular in humanities and social science research. ...
The automated recognition of printed texts, known as Optical Character Recognition (OCR), offers a solution to this problem; however, until recently, robust OCR systems for the Tibetan language have not ...
Using either metric, Namsel has the option to decide whether to automatically re-OCR portions of text in order to improve accuracy and/or flag them for manual review. 12 For a discussion of normalization ...
doi:10.5070/h915129937
fatcat:7qlmdtmxhjhgdg6z5n7jqfwkwu
Digital Peter: Dataset, Competition and Handwriting Recognition Methods
[article]
2021
arXiv
pre-print
The new dataset may be useful for researchers to train handwriting text recognition models as a benchmark for comparing different models. ...
It consists of 9 694 images and text files corresponding to lines in historical documents. The open machine learning competition Digital Peter was held based on the considered dataset. ...
The article also analyses the work of models that have proven themselves in optical character recognition (OCR) tasks [16, 32, 35, 19] using both historical and modern handwritten texts. ...
arXiv:2103.09354v2
fatcat:rze2bzfojfh35gj6efvdttvdva
A SURVEY ON RECENT METHODOLOGIES IN MULTILINGUAL CHARACTER DETECTION AND RECOGNITION
2019
International Journal of Engineering Applied Sciences and Technology
Multilingual character detection and recognition from video subtitles, scenes and documents is additionally getting high consideration on this subject. ...
Different optical character recognition systems perform good for English characters but the accuracy for Hindi character recognition is not up to the mark. ...
Ahn, Ryu, Koo and Cho [14] proposed Binarization algorithm for text line detection in degraded historical documents. ...
doi:10.33564/ijeast.2019.v04i03.062
fatcat:6tswkhkmwbcfnne6tk6z2jm6jm
Restoration of Fragmentary Babylonian Texts Using Recurrent Neural Networks
[article]
2020
arXiv
pre-print
In this work we investigate the possibility of assisting scholars and even automatically completing the breaks in ancient Akkadian texts from Achaemenid period Babylonia by modelling the language using ...
As the "LSTM (full)" model needs to run separately for each candidate missing word, we first picked the top 100 candidates using "LSTM (start)". ...
For cuneiform texts this is not the case, and one has to use limited manually transliterated texts or automatic optical character recognition (OCR) which is still far from perfect (7) . ...
arXiv:2003.01912v1
fatcat:kdyxl7nbw5auhndlvtko2ldoxe
Lemmatization for variation-rich languages using deep learning
2016
Digital Scholarship in the Humanities
While this task has long been considered solved for modern languages such as English, there exist many (e.g. historic) languages for which the problem is harder to solve, due to a lack of resources and ...
The proposed system combines two approaches: on the one hand, we apply temporal convolutions to model the orthography of input words at the character level; secondly, we use distributional word embeddings ...
Acknowledgements We gratefully acknowledge the support of NVIDIA Corporation with the donation of the TITAN X used for this research. ...
doi:10.1093/llc/fqw034
dblp:journals/lalc/KestemontPND17
fatcat:hvousb7z35dcxpayfcppcctf4u
QURATOR: Innovative Technologies for Content and Data Curation
[article]
2020
arXiv
pre-print
Berlin-Brandenburg, into a global centre of excellence for curation technologies. ...
In all domains and sectors, the demand for intelligent systems to support the processing and generation of digital content is rapidly increasing. ...
, layout, language and orthography. ...
arXiv:2004.12195v1
fatcat:qipc64rao5eozgayewsqji236i
Book of Abstracts of the Digital Humanities in the Nordic Countries 5th conference. Riga, 20–23 October 2020
[article]
2020
Zenodo
Acknowledgements We thank the Fritz Thyssen Foundation for their funding for the research project Distant Viewing. ...
We thank the Staatsbibliothek Berlin for providing access to the Wegehaupt collection. ...
Enclisis of any kind is hard to investigate in texts, since orthography, both in the past and the present, normally does not mark it. ...
doi:10.5281/zenodo.4107117
fatcat:6ongky6p5rab7gvtawnjmp2ofm
« Previous
Showing results 1 — 15 out of 33 results