Filters








2,590 Hits in 6.4 sec

An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization [article]

Gongbo Tang and Fabienne Cap and Eva Pettersson and Joakim Nivre
2018 arXiv   pre-print
In this paper, we apply different NMT models to the problem of historical spelling normalization for five languages: English, German, Hungarian, Icelandic, and Swedish.  ...  In addition, we propose a hybrid method which further improves the performance of historical spelling normalization.  ...  We also thank the machine translation group at the University of Edinburgh for providing computational resources. Gongbo Tang is funded by Chinese Scholarship Council (NO. 201607110016).  ... 
arXiv:1806.05210v2 fatcat:lhfl7o3vpfhmvcedz6fukxjaty

Improving historical spelling normalization with bi-directional LSTMs and multi-task learning [article]

Marcel Bollmann, Anders Søgaard
2016 arXiv   pre-print
Our model compares well to previously established normalization algorithms when evaluated on a diverse set of texts from Early New High German.  ...  A common approach is to normalize the spelling of historical words to modern forms.  ...  Sec. 5), but we are not aware of any experiments with neural machine translation (Cho et al., 2014) on this domain.  ... 
arXiv:1610.07844v1 fatcat:eg3opbfwevcubhoy6svc2xpgyy

Revisiting

Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg Tiedemann, Eetu Mäkelä
2019 Proceedings of the 3rd Joint  
This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus.  ...  This paper discusses different methods for improving the normalization of these deviant forms by using different approaches.  ...  Research in this vein has existed already before the dawn of neural machine translation (NMT), during the era of statistical machine translation (SMT).  ... 
doi:10.18653/v1/w19-2509 dblp:conf/latech/HamalainenSRTM19 fatcat:3uu22nlyvbgurho3cfb4ged6aq

A Large-Scale Comparison of Historical Text Normalization Systems

Marcel Bollmann
2019 Proceedings of the 2019 Conference of the North  
There is no consensus on the state-of-theart approach to historical text normalization.  ...  Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder-decoder models, but studies have used different datasets  ...  machine translation (SMT) and its neural equivalent (NMT).  ... 
doi:10.18653/v1/n19-1389 dblp:conf/naacl/Bollmann19 fatcat:zpaijg6qw5b7xjnudab5rfwhvq

Two Demonstrations of the Machine Translation Applications to Historical Documents [article]

Miguel Domingo, Francisco Casacuberta
2021 arXiv   pre-print
We present our demonstration of two machine translation applications to historical documents.  ...  It adapts the document's spelling to modern standards in order to achieve an orthography consistency and accounting for the lack of spelling conventions.  ...  We gratefully acknowledge the support of NVIDIA Corporation with the donation of a GPU used for part of this research, and Andrés Trapiello and Ediciones Destino for granting us permission to use their  ... 
arXiv:2102.01417v1 fatcat:4v25wrtd7narbi7wo5wpdnadgy

From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction [article]

Mika Hämäläinen, Simon Hengchen
2019 arXiv   pre-print
We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.  ...  Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning.  ...  This paper focuses on correcting the OCR errors in ECCO. We present an unsupervised method based on the advances neural machine translation (NMT) in historical text normalization 3 .  ... 
arXiv:1910.05535v1 fatcat:xjgdymji7bh4tgfia26vyp2l7e

From the Paft to the Fiiture: a Fully Automatic NMT andWord Embeddings Method for OCR Post-Correction

Mika Hämäläinen, Department of Digital Humanities, University of Helsinki, Finland, Simon Hengchen, COMHIS, University of Helsinki, Finland
2019 Proceedings - Natural Language Processing in a Deep Learning World  
We present a fully automatic unsupervised way of extracting parallel data for training a characterbased sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.  ...  Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning.  ...  This paper focuses on correcting the OCR errors in ECCO. We present an unsupervised method based on the advances neural machine translation (NMT) in historical text normalization 3 .  ... 
doi:10.26615/978-954-452-056-4_051 dblp:conf/ranlp/HamalainenH19 fatcat:42eiup5dnzbjhdx2frqiruzfhe

A Large-Scale Comparison of Historical Text Normalization Systems [article]

Marcel Bollmann
2019 arXiv   pre-print
There is no consensus on the state-of-the-art approach to historical text normalization.  ...  Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder--decoder models, but studies have used different  ...  machine translation (SMT) and its neural equivalent (NMT).  ... 
arXiv:1904.02036v1 fatcat:d3dtc25gcngvjgv6no7c7a3pi4

Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography [article]

Mika Hämäläinen, Niko Partanen, Khalid Alnajjar
2021 arXiv   pre-print
In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling.  ...  Our best model reaches to 96.3\% accuracy in texts written by Agricola and 87.7\% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.  ...  -Conducting an evaluation for assessing the performance of the model on historical data from 1) the same source of the data used in building the model and 2) external out-of-domain historical data.  ... 
arXiv:2107.03266v1 fatcat:kdz3jws72fccbfn4tp5mf52unu

Survey of Automatic Spelling Correction

Daniel Hládek, Ján Staš, Matúš Pleva
2020 Electronics  
The second group uses an additional model of context. The third group of automatic spelling correction systems in the survey can adapt its model to the given problem.  ...  The survey describes selected approaches in a common theoretical framework based on Shannon's noisy channel. A separate section describes evaluation methods and benchmarks.  ...  Statistical machine-translation models based on string alignment, translation phrases, and n-gram language models are replaced by neural machine-translation systems.  ... 
doi:10.3390/electronics9101670 fatcat:pgf65dpwp5b2xc2hc6xxf5pplm

Learning attention for historical text normalization by learning to pronounce

Marcel Bollmann, Joachim Bingel, Anders Søgaard
2017 Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)  
Automated processing of historical texts often relies on pre-normalization to modern word forms.  ...  We analyze the induced models across 44 different texts from Early New High German.  ...  machine translation.  ... 
doi:10.18653/v1/p17-1031 dblp:conf/acl/BollmannBS17 fatcat:7bkzauziknenrpcx5toz7wxndm

Evaluating historical text normalization systems: How well do they generalize? [article]

Alexander Robertson, Sharon Goldwater
2018 arXiv   pre-print
We show that the neural models generalize well to unseen words in tests on five languages; nevertheless, they provide no clear benefit over the na\"ive baseline for downstream POS tagging of an English  ...  We highlight several issues in the evaluation of historical text normalization systems that make it hard to tell how well these systems would actually work in practice---i.e., for new datasets or languages  ...  This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of  ... 
arXiv:1804.02545v2 fatcat:z5l4kludarfhljwy4ur3zalzdy

Evaluating Historical Text Normalization Systems: How Well Do They Generalize?

Alexander Robertson, Sharon Goldwater
2018 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)  
We show that the neural models generalize well to unseen words in tests on five languages; nevertheless, they provide no clear benefit over the naïve baseline for downstream POS tagging of an English historical  ...  We highlight several issues in the evaluation of historical text normalization systems that make it hard to tell how well these systems would actually work in practice-i.e., for new datasets or languages  ...  This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of  ... 
doi:10.18653/v1/n18-2113 dblp:conf/naacl/RobertsonG18 fatcat:ejwndpsp6nfgnoo5plhgbs7fai

A Survey of Orthographic Information in Machine Translation [article]

Bharathi Raja Chakravarthi, Priya Rani, Mihael Arcan, John P. McCrae
2020 arXiv   pre-print
Machine translation is one of the applications of natural language processing which has been explored in different languages.  ...  Additionally, multilingual neural machine translation of closely related languages is given a particular focus in this survey.  ...  and normalization for automatic evaluation of machine translation.  ... 
arXiv:2008.01391v1 fatcat:dlpliyatkrgbhcablntktae2we

Few-Shot and Zero-Shot Learning for Historical Text Normalization

Marcel Bollmann, Natalia Korchagina, Anders Søgaard
2019 Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)  
Historical text normalization often relies on small training datasets.  ...  This paper evaluates 63 multi-task learning configurations for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme-tophoneme  ...  Acknowledgments We would like to thank the anonymous reviewers of this as well as previous iterations of this paper for several helpful comments.  ... 
doi:10.18653/v1/d19-6112 dblp:conf/acl-deeplo/BollmannKS19 fatcat:kl666d27svc45percyzikfmgti
« Previous Showing results 1 — 15 out of 2,590 results