A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Machine transliteration and transliterated text retrieval: a survey
2018
Sadhana (Bangalore)
With the advent of Web 2.0, user-generated content is increasing on the Web at a very rapid rate. A substantial proportion of this content is transliterated data. ...
To leverage this huge information repository, there is a matching effort to process transliterated text. In this article, we survey the recent body of work in the field of transliteration. ...
The experimental results showed that taking pronunciation variation into account did make extraction of paired cognates more effective [92] . ...
doi:10.1007/s12046-018-0828-8
fatcat:dg3gwugmqrfevnzu3deuk5w67i
A machine transliteration model based on correspondence between graphemes and phonemes
2006
ACM Transactions on Asian Language Information Processing
Machine transliteration is an automatic method for converting words in one language into phonetically equivalent ones in another language. ...
Three types of machine transliteration models-grapheme-based, phoneme-based, and hybrid-have been proposed. ...
In candidate extraction, an algorithm tries to find transliteration pair candidates in bilingual corpora. ...
doi:10.1145/1194936.1194938
fatcat:vnaz3ca2wbbtzokxzelmj32bme
Combining probability models and web mining models: a framework for proper name transliteration
2007
Journal of Special Topics in Information Technology and Management
a Web mining model that uses word frequency of occurrence information from the Web. ...
In this research we propose a generic transliteration framework, which incorporates an enhanced Hidden Markov Model (HMM) and a Web mining model. ...
The Web mining approach is applicable to any pairs of languages. No rules, dictionaries, or training corpora are needed. ...
doi:10.1007/s10799-007-0031-9
fatcat:diqx2xyhwzealptfce6shwh3zq
Predicting Word Pronunciation in Japanese
[chapter]
2011
Lecture Notes in Computer Science
Our experimental results show that our classifier for validating the word-pronunciation pairs harvested from unannotated text achieves over 98% precision and recall. ...
of a bilingual dictionary using the web; (2) Building a decoder for the task of pronunciation prediction, for which we apply the state-of-the-art discriminative substring-based approach. ...
the task of pronunciation modeling; we then use the model to harvest word-pronunciation pairs from the web in the task of pronunciation acquisition. ...
doi:10.1007/978-3-642-19437-5_40
fatcat:modmxrbre5d6hhfj6p3t4jr3mu
Survey on Machine Transliteration and Machine Learning Models
2015
International Journal on Natural Language Computing
Support of local languages can be given in all internet based applications by means of Machine Transliteration and Machine Translation. ...
This paper provides the thorough survey on machine transliteration models and machine learning approaches used for machine transliteration over the period of more than two decades for internationally used ...
Corpora are not available for most of the languages.
Summary From the above survey, it is clear that the three approaches are most popular for machine transliteration. ...
doi:10.5121/ijnlc.2015.4202
fatcat:kegqa5k4abahvbno2setnkxwtq
Splitting Katakana Noun Compounds by Paraphrasing and Back-transliteration
2014
Journal of Natural Language Processing
Experiments in which paraphrases and back-transliterations from unlabeled textual data were extracted and used to construct splitting models improved splitting accuracy with statistical significance. ...
Therefore, we propose using paraphrasing and back-transliteration of katakana noun compounds to split them. ...
We observed that 5941 words (77.1%) are in NAIST-jdic or word-aligned transliteration pairs extracted from the Web text. ...
doi:10.5715/jnlp.21.897
fatcat:m5abanr7tffpreim4eox3m3yuu
Multilingual spoken language processing
2008
IEEE Signal Processing Magazine
When incorporating these features most spoken document summarization systems employ an extractive approach, in which salient sentences or segments of speech are extracted and compiled into a final summary ...
The algorithm first identifies text data in a resource-rich language similar to the target language, then extracts useful statistics from those text corpora, and projects the statistics back into the target ...
doi:10.1109/msp.2008.918417
fatcat:ezye4rngebdpphtis3szqdhvce
A Survey of Orthographic Information in Machine Translation
2021
SN Computer Science
It introduces under-resourced languages in terms of machine translation and how orthographic information can be utilised to improve machine translation. ...
Considerable attention is given to current efforts using cognate information at different levels of machine translation and the lessons that can be drawn from this. ...
The authors developed a transliteration system trained on automatically-extracted likely cognates for Portuguese into Spanish using systematic spelling variation. ...
doi:10.1007/s42979-021-00723-4
pmid:34723204
pmcid:PMC8550410
fatcat:sd6ovquibzdpjlmgamsnklbi3q
A comprehensive survey on Indian regional language processing
2020
SN Applied Sciences
The sources of dataset for the Indian regional languages are described. ...
In recent information explosion, contents in internet are multilingual and majority will be in the form of natural languages. ...
In [50] , Lakshmi
Preprocessing techniques
Stemming It is a process of reducing morphologically variant terms into a single term, without performing complete morphological analysis. ...
doi:10.1007/s42452-020-2983-x
fatcat:e3u5r5qo7ngapj5mbiwit7qlwi
A Clustering Framework for Lexical Normalization of Roman Urdu
[article]
2020
arXiv
pre-print
UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. ...
In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based ...
This research was partially funded by the National Science Foundation (NSF) Award No. 1747728 and the National Science Foundation of China (NSFC) Award No. 61672524. ...
arXiv:2004.00088v1
fatcat:mdma3ccwo5aw7hgwzge4tldpki
Particle Swarm Optimization for Punjabi Text Summarization
2021
International Journal of Operations Research and Information Systems
Two Punjabi datasets—monolingual Punjabi corpus from Indian Languages Corpora Initiative Phase-II and Punjabi-Hindi parallel corpus—are considered. ...
Calculation within PSO is performed using fitness function which looks into various statistical and linguistic features of the Punjabi datasets. ...
Punjabi text is mainly extracted from two web sources (Section 5.2) -monolingual Punjabi corpus and bilingual Punjabi-Hindi corpus which are converted into Unicode format with the help of python libraries ...
doi:10.4018/ijoris.20210701.oa1
fatcat:irayrabzdze4bnxnrv3heb4tsi
An Information-Extraction System for Urdu---A Resource-Poor Language
2010
ACM Transactions on Asian Language Information Processing
This system assimilates resources from various online sources to facilitate improved named entity tagging and Urdu-to-English transliteration. ...
Each of the new Urdu text processing modules has been integrated into a general text-mining platform. ...
The same is true with the two NE annotated corpora provided by IJCNLP (2008) and Computing Research Laboratory (CRL). 5 Apart from the EMILLE dataset, other datasets are very limited in terms of size ...
doi:10.1145/1838751.1838754
fatcat:ibmmwalmtfbfdpjufxccwolzgq
Translation techniques in cross-language information retrieval
2012
ACM Computing Surveys
Cross-language information retrieval (CLIR) is an active sub-domain of information retrieval (IR). ...
This paper presents an overview of those techniques, with a special emphasis on recent developments. ...
ACKNOWLEDGMENTS This research was partially supported by a PHD scholarship from the University of Nottingham and funding from the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for ...
doi:10.1145/2379776.2379777
fatcat:mu5p5djufjghvn3xjppekmwnwu
Language Identification for Multilingual Sentiment Examination
2019
International journal of recent technology and engineering
Text is then completely translated into English language and POS(Parts of Speech) tagging is performed on the obtained text. ...
Due to increase in web users across globe, users happen to post their views freely over the internet. ...
Data Cleaning Data cleaning filters out unwanted text and converts data into structured format. The comments extracted from
D. ...
doi:10.35940/ijrte.b1444.0982s1119
fatcat:nfnyhsf2y5hcxg6fgjb4lt4554
Chinese OOV translation and post-translation query expansion in chinese--english cross-lingual information retrieval
2005
ACM Transactions on Asian Language Information Processing
The OOV problem arises from the fact that some Chinese query terms are not found in translation resources, such as bilingual dictionaries and parallel corpora. ...
We have developed a new segmentation-free technique for automatic translation of Chinese OOV terms using the web. ...
Mend et al. [2004] developed a system that incorporates transliteration from English to Chinese to deal with English OOV terms. ...
doi:10.1145/1105696.1105697
fatcat:doae4glz75eyzawt56qnu6yit4
« Previous
Showing results 1 — 15 out of 165 results