59 Hits in 4.4 sec

Robust Open-Vocabulary Translation from Visual Text Representations [article]

Elizabeth Salesky, David Etter, Matt Post
2021 arXiv   pre-print
Machine translation models have discrete vocabularies and commonly use subword segmentation techniques to achieve an 'open vocabulary.'  ...  Motivated by the robustness of human language processing, we propose the use of visual text representations, which dispense with a finite set of text embeddings in favor of continuous vocabularies created  ...  Em beddings typically refer to entries in a fixed size weight matrix, with the vocabulary ID as an index.  ... 
arXiv:2104.08211v3 fatcat:qg45dbuzejfbpopnsbf5vt7v2u

Open Vocabulary Arabic Handwriting Recognition Using Morphological Decomposition

Mahdi Hamdani, Amr El-Desoky Mousa, Hermann Ney
2013 2013 12th International Conference on Document Analysis and Recognition  
The use of Language Models (LMs) is a very important component in large and open vocabulary recognition systems. This paper presents an open-vocabulary approach for Arabic handwriting recognition.  ...  The recognition system is based on Hidden Markov Models (HMMs) with position and context dependent character models.  ...  Open vocabulary Korean word recognition is proposed in [4] . The lexicon is automatically selected using a dynamic Bayesian network language model.  ... 
doi:10.1109/icdar.2013.63 dblp:conf/icdar/HamdaniMN13 fatcat:zwac2g4okfedpebwfmmg2kmz5i

NF-SAVO: Neuro-Fuzzy system for Arabic Video OCR

Mohamed Ben, Hichem karray, Adel. M., Ana Fernández
2012 International Journal of Advanced Computer Science and Applications  
In this paper we propose a robust approach for text extraction and recognition from video clips which is called Neuro-Fuzzy system for Arabic Video OCR.  ...  This type of text carries with it important information that helps in video referencing, indexing and retrieval.  ...  We noticed that in the case of a large or open vocabulary and in the context of recognition of a text for instance, the systems are often based on an analytical approach.  ... 
doi:10.14569/ijacsa.2012.031022 fatcat:exnrssn7orgxpi73pyr275b3ze

Named Entity Recognition in the Legal Domain using a Pointer Generator Network [article]

Stavroula Skylaki, Ali Oskooei, Omar Bari, Nadja Herger, Zac Kriegman
2020 arXiv   pre-print
/or OCR mistakes.  ...  The "gold standard" training data for NER systems provide annotation for each token of the text with the corresponding entity or non-entity label.  ...  Baseline NER models We compared the performance of our proposed approach with the following commonly used neural network architectures for NER: • spaCy, an open-source, NLP library for a variety of tasks  ... 
arXiv:2012.09936v1 fatcat:mqwiivtbi5fyngm2tqlwiwpq2a

Subword-based approaches for spoken document retrieval

Kenney Ng, Victor W. Zue
2000 Speech Communication  
We investigate the use of subword unit representations for SDR as an alternative to words generated by either keyword spotting or continuous speech recognition.  ...  The use of subword units in the recognizer constrains the size of the vocabulary needed to cover the language; and the use of subword units as indexing terms allows for the detection of new user-specified  ...  of words from text documents for language model training.  ... 
doi:10.1016/s0167-6393(00)00008-x fatcat:4jig4v5w25gqpjmbej6k2x2byq

Large vocabulary off-line handwriting recognition: A survey

A. L. Koerich, R. Sabourin, C. Y. Suen
2003 Pattern Analysis and Applications  
To illustrate some of the points raised, a large vocabulary off-line handwritten word recognition system will be described.  ...  The capability of dealing with large lexicons, however, opens up many more applications.  ...  acknowledge the CNPq-Brazil (grant refs 200276-98/0) and the MEQ-Canada for supporting this research, and the Service Technique de la Poste (SRTP-France) for providing us the database and the baseline system  ... 
doi:10.1007/s10044-002-0169-3 fatcat:axzlhu64mzg3zhpd3rfunpjyom

A survey on Arabic character segmentation

Yasser M. Alginahi
2012 International Journal on Document Analysis and Recognition  
This survey presents the description of the Arabic script characteristics with an overview on OCR systems and a comprehensive review mainly on off-line printed Arabic character segmentation techniques.  ...  Arabic character segmentation is a necessary step in Arabic Optical Character Recognition (OCR).  ...  Abuhaiba in [51] stated that, " … to produce an Arabic OCR system with performance comparable to that for OCR systems of other languages, we believe that breaking the cursive law of Arabic script is  ... 
doi:10.1007/s10032-012-0188-6 fatcat:w5hszp2ksbcb3kw627yw2cwehy

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research [article]

Nisansa de Silva
2022 arXiv   pre-print
Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European.  ...  However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic  ...  [295] used Tesseract 3 42 [296] for Sinhala OCR. An OCR and Text-to-Speech system for Sinhala named Bhashitha was proposed by De Zoysa et al. [241] .  ... 
arXiv:1906.02358v13 fatcat:c522aedklbbhraw4fthg3yvweu

A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models [article]

Firoj Alam, Arid Hasan, Tanvirul Alam, Akib Khan, Janntatul Tajrin, Naira Khan, Shammur Absar Chowdhury
2021 arXiv   pre-print
Bangla -- ranked as the 6th most widely spoken language across the world (, with 230 million native speakers -- is still considered as a low-resource language  ...  Our results show promising performance using transformer-based models while highlighting the trade-off with computational costs.  ...  Markov Models (MEMMs) [144] , and hybrid approach [18] .  ... 
arXiv:2107.03844v3 fatcat:hermrinleneercodguko6kwxhu

Book of Abstracts of the Digital Humanities in the Nordic Countries 5th conference. Riga, 20–23 October 2020 [article]

Sanita Reinsone, Anda Baklāne, Jānis Daugavietis
2020 Zenodo  
Acknowledgements This work has received financial support from the Latvian Language Agency through the grant agreement No. 4.6/2019-029.  ...  be a web optimized PDF also equipped with an OCR text-layer.  ...  : What Role Does the OCR System Play?  ... 
doi:10.5281/zenodo.4107117 fatcat:6ongky6p5rab7gvtawnjmp2ofm

Summarising Historical Text in Modern Languages [article]

Xutan Peng, Yi Zheng, Chenghua Lin, Advaith Siddharthan
2021 arXiv   pre-print
Based on cross-lingual transfer learning techniques, we propose a summarisation model that can be trained even with no cross-lingual (historical to modern) parallel data, and further benchmark it against  ...  We introduce the task of historical text summarisation, where documents in historical forms of a language are summarised in the corresponding modern language.  ...  Cross- lingual language model pretraining. Advances in Neural Information Processing Systems (NeurIPS).  ... 
arXiv:2101.10759v2 fatcat:qzhmredoyzffdkydnid26ua4c4

DMoG : A Data-Based Morphological Guesser

Vojtěch Kovář, Pavel Rychlý
2021 Zenodo  
We present a prototype implementation and an initial evaluation on Czech, which shows promising results.  ...  We present a novel corpus-based approach to lemmatization of unknown words.  ...  Furthermore, we publish an open dataset [16] of scanned images and OCR texts with human annotations for layout analysis, OCR evaluation, and language identification.  ... 
doi:10.5281/zenodo.6935329 fatcat:6jqt25fjcfe5fmcwfkmv2al4oe

Nonsymbolic Text Representation [article]

Hinrich Schuetze, Heike Adel, Ehsaneddin Asgari
2017 arXiv   pre-print
We show that our model performs better than prior work on an information extraction and a text denoising task.  ...  We introduce the first generic text representation model that is completely nonsymbolic, i.e., it does not require the availability of a segmentation or tokenization method that attempts to identify words  ...  In addition, hybrid word/n-gram language models for out-of-vocabulary words have been applied to speech recognition (Hirsimäki et al., 2006; Kombrink et al., 2010; Parada et al., 2011; Shaik et al., 2011  ... 
arXiv:1610.00479v3 fatcat:u66vpobfwbbuzayf7wc7euwsiq

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound [article]

Rowan Zellers and Jiasen Lu and Ximing Lu and Youngjae Yu and Yanpeng Zhao and Mohammadreza Salehi and Aditya Kusupati and Jack Hessel and Ali Farhadi and Yejin Choi
2022 arXiv   pre-print
Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.  ...  We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research.  ...  Thanks to James Bradbury and Skye Wanderman-Milne for help with JAX on TPUs. Thanks to the AI2 ReVIZ team, including Jon Borchardt and M Kusold, for help with the demo.  ... 
arXiv:2201.02639v4 fatcat:deywuxyj45eqvacjwwns7kmbh4

Robust Open-Vocabulary Translation from Visual Text Representations

Elizabeth Salesky, David Etter, Matt Post
2021 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing   unpublished
Machine translation models have discrete vo cabularies and commonly use subword seg mentation techniques to achieve an 'open vo cabulary.'  ...  Motivated by the robustness of hu man language processing, we propose the use of visual text representations, which dispense with a finite set of text embeddings in favor of continuous vocabularies created  ...  Em beddings typically refer to entries in a fixed size weight matrix, with the vocabulary ID as an index.  ... 
doi:10.18653/v1/2021.emnlp-main.576 fatcat:tayt2akcznbmnkvs3ktgaizkiu
« Previous Showing results 1 — 15 out of 59 results