Filters








165 Hits in 4.9 sec

Machine transliteration and transliterated text retrieval: a survey

Dinesh Kumar Prabhakar, Sukomal Pal
2018 Sadhana (Bangalore)  
With the advent of Web 2.0, user-generated content is increasing on the Web at a very rapid rate. A substantial proportion of this content is transliterated data.  ...  To leverage this huge information repository, there is a matching effort to process transliterated text. In this article, we survey the recent body of work in the field of transliteration.  ...  The experimental results showed that taking pronunciation variation into account did make extraction of paired cognates more effective [92] .  ... 
doi:10.1007/s12046-018-0828-8 fatcat:dg3gwugmqrfevnzu3deuk5w67i

A machine transliteration model based on correspondence between graphemes and phonemes

Jong-Hoon Oh, Key-Sun Choi, Hitoshi Isahara
2006 ACM Transactions on Asian Language Information Processing  
Machine transliteration is an automatic method for converting words in one language into phonetically equivalent ones in another language.  ...  Three types of machine transliteration models-grapheme-based, phoneme-based, and hybrid-have been proposed.  ...  In candidate extraction, an algorithm tries to find transliteration pair candidates in bilingual corpora.  ... 
doi:10.1145/1194936.1194938 fatcat:vnaz3ca2wbbtzokxzelmj32bme

Combining probability models and web mining models: a framework for proper name transliteration

Yilu Zhou, Feng Huang, Hsinchun Chen
2007 Journal of Special Topics in Information Technology and Management  
a Web mining model that uses word frequency of occurrence information from the Web.  ...  In this research we propose a generic transliteration framework, which incorporates an enhanced Hidden Markov Model (HMM) and a Web mining model.  ...  The Web mining approach is applicable to any pairs of languages. No rules, dictionaries, or training corpora are needed.  ... 
doi:10.1007/s10799-007-0031-9 fatcat:diqx2xyhwzealptfce6shwh3zq

Predicting Word Pronunciation in Japanese [chapter]

Jun Hatori, Hisami Suzuki
2011 Lecture Notes in Computer Science  
Our experimental results show that our classifier for validating the word-pronunciation pairs harvested from unannotated text achieves over 98% precision and recall.  ...  of a bilingual dictionary using the web; (2) Building a decoder for the task of pronunciation prediction, for which we apply the state-of-the-art discriminative substring-based approach.  ...  the task of pronunciation modeling; we then use the model to harvest word-pronunciation pairs from the web in the task of pronunciation acquisition.  ... 
doi:10.1007/978-3-642-19437-5_40 fatcat:modmxrbre5d6hhfj6p3t4jr3mu

Survey on Machine Transliteration and Machine Learning Models

Dhore M L, Dhore R M, Rathod P H
2015 International Journal on Natural Language Computing  
Support of local languages can be given in all internet based applications by means of Machine Transliteration and Machine Translation.  ...  This paper provides the thorough survey on machine transliteration models and machine learning approaches used for machine transliteration over the period of more than two decades for internationally used  ...  Corpora are not available for most of the languages. Summary From the above survey, it is clear that the three approaches are most popular for machine transliteration.  ... 
doi:10.5121/ijnlc.2015.4202 fatcat:kegqa5k4abahvbno2setnkxwtq

Splitting Katakana Noun Compounds by Paraphrasing and Back-transliteration

Nobuhiro Kaji, Masaru Kitsuregawa
2014 Journal of Natural Language Processing  
Experiments in which paraphrases and back-transliterations from unlabeled textual data were extracted and used to construct splitting models improved splitting accuracy with statistical significance.  ...  Therefore, we propose using paraphrasing and back-transliteration of katakana noun compounds to split them.  ...  We observed that 5941 words (77.1%) are in NAIST-jdic or word-aligned transliteration pairs extracted from the Web text.  ... 
doi:10.5715/jnlp.21.897 fatcat:m5abanr7tffpreim4eox3m3yuu

Multilingual spoken language processing

P. Fung, T. Schultz
2008 IEEE Signal Processing Magazine  
When incorporating these features most spoken document summarization systems employ an extractive approach, in which salient sentences or segments of speech are extracted and compiled into a final summary  ...  The algorithm first identifies text data in a resource-rich language similar to the target language, then extracts useful statistics from those text corpora, and projects the statistics back into the target  ... 
doi:10.1109/msp.2008.918417 fatcat:ezye4rngebdpphtis3szqdhvce

A Survey of Orthographic Information in Machine Translation

Bharathi Raja Chakravarthi, Priya Rani, Mihael Arcan, John P. McCrae
2021 SN Computer Science  
It introduces under-resourced languages in terms of machine translation and how orthographic information can be utilised to improve machine translation.  ...  Considerable attention is given to current efforts using cognate information at different levels of machine translation and the lessons that can be drawn from this.  ...  The authors developed a transliteration system trained on automatically-extracted likely cognates for Portuguese into Spanish using systematic spelling variation.  ... 
doi:10.1007/s42979-021-00723-4 pmid:34723204 pmcid:PMC8550410 fatcat:sd6ovquibzdpjlmgamsnklbi3q

A comprehensive survey on Indian regional language processing

B. S. Harish, R. Kasturi Rangan
2020 SN Applied Sciences  
The sources of dataset for the Indian regional languages are described.  ...  In recent information explosion, contents in internet are multilingual and majority will be in the form of natural languages.  ...  In [50] , Lakshmi Preprocessing techniques Stemming It is a process of reducing morphologically variant terms into a single term, without performing complete morphological analysis.  ... 
doi:10.1007/s42452-020-2983-x fatcat:e3u5r5qo7ngapj5mbiwit7qlwi

A Clustering Framework for Lexical Normalization of Roman Urdu [article]

Abdul Rafae Khan, Asim Karim, Hassan Sajjad, Faisal Kamiran, Jia Xu
2020 arXiv   pre-print
UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.  ...  In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based  ...  This research was partially funded by the National Science Foundation (NSF) Award No. 1747728 and the National Science Foundation of China (NSFC) Award No. 61672524.  ... 
arXiv:2004.00088v1 fatcat:mdma3ccwo5aw7hgwzge4tldpki

Particle Swarm Optimization for Punjabi Text Summarization

Arti Jain, Divakar Yadav, Anuja Arora
2021 International Journal of Operations Research and Information Systems  
Two Punjabi datasets—monolingual Punjabi corpus from Indian Languages Corpora Initiative Phase-II and Punjabi-Hindi parallel corpus—are considered.  ...  Calculation within PSO is performed using fitness function which looks into various statistical and linguistic features of the Punjabi datasets.  ...  Punjabi text is mainly extracted from two web sources (Section 5.2) -monolingual Punjabi corpus and bilingual Punjabi-Hindi corpus which are converted into Unicode format with the help of python libraries  ... 
doi:10.4018/ijoris.20210701.oa1 fatcat:irayrabzdze4bnxnrv3heb4tsi

An Information-Extraction System for Urdu---A Resource-Poor Language

Smruthi Mukund, Rohini Srihari, Erik Peterson
2010 ACM Transactions on Asian Language Information Processing  
This system assimilates resources from various online sources to facilitate improved named entity tagging and Urdu-to-English transliteration.  ...  Each of the new Urdu text processing modules has been integrated into a general text-mining platform.  ...  The same is true with the two NE annotated corpora provided by IJCNLP (2008) and Computing Research Laboratory (CRL). 5 Apart from the EMILLE dataset, other datasets are very limited in terms of size  ... 
doi:10.1145/1838751.1838754 fatcat:ibmmwalmtfbfdpjufxccwolzgq

Translation techniques in cross-language information retrieval

Dong Zhou, Mark Truran, Tim Brailsford, Vincent Wade, Helen Ashman
2012 ACM Computing Surveys  
Cross-language information retrieval (CLIR) is an active sub-domain of information retrieval (IR).  ...  This paper presents an overview of those techniques, with a special emphasis on recent developments.  ...  ACKNOWLEDGMENTS This research was partially supported by a PHD scholarship from the University of Nottingham and funding from the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for  ... 
doi:10.1145/2379776.2379777 fatcat:mu5p5djufjghvn3xjppekmwnwu

Language Identification for Multilingual Sentiment Examination

2019 International journal of recent technology and engineering  
Text is then completely translated into English language and POS(Parts of Speech) tagging is performed on the obtained text.  ...  Due to increase in web users across globe, users happen to post their views freely over the internet.  ...  Data Cleaning Data cleaning filters out unwanted text and converts data into structured format. The comments extracted from D.  ... 
doi:10.35940/ijrte.b1444.0982s1119 fatcat:nfnyhsf2y5hcxg6fgjb4lt4554

Chinese OOV translation and post-translation query expansion in chinese--english cross-lingual information retrieval

Ying Zhang, Phil Vines, Justin Zobel
2005 ACM Transactions on Asian Language Information Processing  
The OOV problem arises from the fact that some Chinese query terms are not found in translation resources, such as bilingual dictionaries and parallel corpora.  ...  We have developed a new segmentation-free technique for automatic translation of Chinese OOV terms using the web.  ...  Mend et al. [2004] developed a system that incorporates transliteration from English to Chinese to deal with English OOV terms.  ... 
doi:10.1145/1105696.1105697 fatcat:doae4glz75eyzawt56qnu6yit4
« Previous Showing results 1 — 15 out of 165 results