33 Hits in 4.7 sec

Recognition of historical Greek polytonic scripts using LSTM networks

Fotini Simistira, Adnan Ul-Hassan, Vassilis Papavassiliou, Basilis Gatos, Vassilis Katsouros, Marcus Liwicki
2015 2015 13th International Conference on Document Analysis and Recognition (ICDAR)  
This paper reports on high-performance Optical Character Recognition (OCR) experiments using Long Short-Term Memory (LSTM) Networks for Greek polytonic script.  ...  performed baseline experiments using LSTM Networks.  ...  For our experiments, we used the open-source OCR system OCRopus [13] .  ... 
doi:10.1109/icdar.2015.7333865 dblp:conf/icdar/SimistiraUPGKL15 fatcat:unjxqg34kffv5ijtnc33gy6d4e

Name the Name - Named Entity Recognition in OCRed 19th and Early 20th Century Finnish Newspaper and Journal Collection Data

Teemu Ruokolainen, Kimmo Kettunen
2020 Digital Humanities in the Nordic Countries Conference  
With re-OCRed Tesseract output the results are 0.79, 0.72, and 0.42, respectively. Results of LSTM-CRF are similar.  ...  and LSTM-CRF NER model.  ...  Conclusions We have reported in this paper usage of two standard statistical NER tools, Stanford NER and LSTM-CRF model, for annotation of OCRed Finnish historical newspaper and journal data.  ... 
dblp:conf/dhn/RuokolainenK20 fatcat:u64aqfbd7fea3nxenwqmx4wwtu

Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography [article]

Mika Hämäläinen, Niko Partanen, Khalid Alnajjar
2021 arXiv   pre-print
In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling.  ...  There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges.  ...  Bollmann & Søgaard (2016) have shown that a bi-directional long short-term memory (bi-LSTM) can be used to normalize historical German texts.  ... 
arXiv:2107.03266v1 fatcat:kdz3jws72fccbfn4tp5mf52unu

Optical Character Recognition for Printed Tamizhi Documents using Deep Neural Networks

Monisha Munivel, V. S. Felix Enigo
2022 DESIDOC Journal of Library & Information Technology  
The ancient historical documents are generally preserved as digitised texts using Optical Character Recognition (OCR) technique.  ...  But the development of OCR for Tamizhi documents is highly challenging as many characters have similar shapes and structures with very small variations.  ...  Tamizhi document images are given as training data for the network. The reason for using this architecture is, OCR is an image-based sequence recognition problem.  ... 
doi:10.14429/djlit.42.4.17742 fatcat:p6c73xgjbzg3noy7gqrncoy6hq

OCR of historical printings of Latin texts

Uwe Springmann, Dietmar Najock, Hermann Morgenroth, Helmut Schmid, Annette Gotscharek, Florian Fink
2014 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage - DATeCH '14  
Using finite state tools and methods developed during the IMPACT program we show that efficent batch-oriented postcorrection can work for Latin as well, and that a lexicon of historical Latin spelling  ...  This paper deals with the application of OCR methods to historical printings of Latin texts.  ...  For Latin a historical orthography was used which changed slowly over time.  ... 
doi:10.1145/2595188.2595205 dblp:conf/datech/SpringmannNMSGF14 fatcat:y5alwyxxkvhkreg4elpv7vhrxa

A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus [article]

Seth Kulick, Neville Ryant, Beatrice Santorini, Joel Wallenberg
2022 arXiv   pre-print
We combine two resources for the current work - an 80K word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) (Santorini, 2021) and 650 million words of OCR'd Yiddish text from the Yiddish  ...  We compute word embeddings on the YBC corpus, and these embeddings are used with a tagger model trained and evaluated on the PPCHY.  ...  Acknowledgments We would like to thank Assaf Urieli and the Yiddish Book Center for making available the OCR'd texts of the book collection.  ... 
arXiv:2204.01175v1 fatcat:bjcp5e7zqbfvlcxd6kz2z7qaxi

Whole page recognition of historical handwriting [article]

Hans J.G.A. Dolfing
2020 arXiv   pre-print
Historical handwritten documents guard an important part of human knowledge only within reach of a few scholars and experts.  ...  This work fits in the wider field of competitions on historical documents, document layout and processing, textin-the-wild challenges and document segmentation.  ...  OCR approaches, especially for older books with more artifacts and less standard fonts.  ... 
arXiv:2009.10634v1 fatcat:j5ivuqpp75dv5bkwq4dom4qdge

Time-Aware Word Embeddings for Three Lebanese News Archives

Jad Doughman, Fatima Abu Salem, Shady Elbassuoni
2020 International Conference on Language Resources and Evaluation  
across news archives using an animated scatter plot.  ...  To evaluate the accuracy of the learnt word embeddings, a benchmark of analogy tasks was used.  ...  Alongside scanned newspaper archives, the AUB Archives collections feature numerous noteworthy historical documents.  ... 
dblp:conf/lrec/DoughmanSE20 fatcat:gkhr2nhjxfbrrmmh6kptqlrj6a

Namsel: An Optical Character Recognition System for Tibetan Text

Zach Rowinski, Kurt Keutzer
2016 Himalayan Linguistics  
The use of advanced computational methods for the analysis of large corpora of electronic texts is becoming increasingly popular in humanities and social science research.  ...  The automated recognition of printed texts, known as Optical Character Recognition (OCR), offers a solution to this problem; however, until recently, robust OCR systems for the Tibetan language have not  ...  Using either metric, Namsel has the option to decide whether to automatically re-OCR portions of text in order to improve accuracy and/or flag them for manual review. 12 For a discussion of normalization  ... 
doi:10.5070/h915129937 fatcat:7qlmdtmxhjhgdg6z5n7jqfwkwu

Digital Peter: Dataset, Competition and Handwriting Recognition Methods [article]

Mark Potanin, Denis Dimitrov, Alex Shonenkov, Vladimir Bataev, Denis Karachev, Maxim Novopoltsev
2021 arXiv   pre-print
The new dataset may be useful for researchers to train handwriting text recognition models as a benchmark for comparing different models.  ...  It consists of 9 694 images and text files corresponding to lines in historical documents. The open machine learning competition Digital Peter was held based on the considered dataset.  ...  The article also analyses the work of models that have proven themselves in optical character recognition (OCR) tasks [16, 32, 35, 19] using both historical and modern handwritten texts.  ... 
arXiv:2103.09354v2 fatcat:rze2bzfojfh35gj6efvdttvdva


Snehal S Gaikwad ., S. L. Nalbalwar .
2019 International Journal of Engineering Applied Sciences and Technology  
Multilingual character detection and recognition from video subtitles, scenes and documents is additionally getting high consideration on this subject.  ...  Different optical character recognition systems perform good for English characters but the accuracy for Hindi character recognition is not up to the mark.  ...  Ahn, Ryu, Koo and Cho [14] proposed Binarization algorithm for text line detection in degraded historical documents.  ... 
doi:10.33564/ijeast.2019.v04i03.062 fatcat:6tswkhkmwbcfnne6tk6z2jm6jm

Restoration of Fragmentary Babylonian Texts Using Recurrent Neural Networks [article]

Ethan Fetaya, Yonatan Lifshitz, Elad Aaron, Shai Gordin
2020 arXiv   pre-print
In this work we investigate the possibility of assisting scholars and even automatically completing the breaks in ancient Akkadian texts from Achaemenid period Babylonia by modelling the language using  ...  As the "LSTM (full)" model needs to run separately for each candidate missing word, we first picked the top 100 candidates using "LSTM (start)".  ...  For cuneiform texts this is not the case, and one has to use limited manually transliterated texts or automatic optical character recognition (OCR) which is still far from perfect (7) .  ... 
arXiv:2003.01912v1 fatcat:kdyxl7nbw5auhndlvtko2ldoxe

Lemmatization for variation-rich languages using deep learning

Mike Kestemont, Guy de Pauw, Renske van Nie, Walter Daelemans
2016 Digital Scholarship in the Humanities  
While this task has long been considered solved for modern languages such as English, there exist many (e.g. historic) languages for which the problem is harder to solve, due to a lack of resources and  ...  The proposed system combines two approaches: on the one hand, we apply temporal convolutions to model the orthography of input words at the character level; secondly, we use distributional word embeddings  ...  Acknowledgements We gratefully acknowledge the support of NVIDIA Corporation with the donation of the TITAN X used for this research.  ... 
doi:10.1093/llc/fqw034 dblp:journals/lalc/KestemontPND17 fatcat:hvousb7z35dcxpayfcppcctf4u

QURATOR: Innovative Technologies for Content and Data Curation [article]

Georg Rehm, Peter Bourgonje, Stefanie Hegele, Florian Kintzel, Julián Moreno Schneider, Malte Ostendorff, Karolina Zaczynska, Armin Berger, Stefan Grill, Sören Räuchle, Jens Rauenbusch, Lisa Rutenburg (+28 others)
2020 arXiv   pre-print
Berlin-Brandenburg, into a global centre of excellence for curation technologies.  ...  In all domains and sectors, the demand for intelligent systems to support the processing and generation of digital content is rapidly increasing.  ...  , layout, language and orthography.  ... 
arXiv:2004.12195v1 fatcat:qipc64rao5eozgayewsqji236i

Book of Abstracts of the Digital Humanities in the Nordic Countries 5th conference. Riga, 20–23 October 2020 [article]

Sanita Reinsone, Anda Baklāne, Jānis Daugavietis
2020 Zenodo  
Acknowledgements We thank the Fritz Thyssen Foundation for their funding for the research project Distant Viewing.  ...  We thank the Staatsbibliothek Berlin for providing access to the Wegehaupt collection.  ...  Enclisis of any kind is hard to investigate in texts, since orthography, both in the past and the present, normally does not mark it.  ... 
doi:10.5281/zenodo.4107117 fatcat:6ongky6p5rab7gvtawnjmp2ofm
« Previous Showing results 1 — 15 out of 33 results