Filters








63 Hits in 5.4 sec

Mostly-unsupervised statistical segmentation of Japanese kanji sequences

RIE KUBOTA ANDO, LILLIAN LEE
2003 Natural Language Engineering  
The algorithm also outperforms another mostly-unsupervised statistical algorithm previously proposed for Chinese.  ...  Despite its simplicity, the algorithm yields performance on long kanji sequences comparable to and sometimes surpassing that of state-of-the-art morphological analyzers over a variety of error metrics.  ...  A preliminary version of this work was published in the proceedings of NAACL 2001; we thank the anonymous reviewers of that paper for their comments.  ... 
doi:10.1017/s1351324902002954 fatcat:xc4s66uuvfaglavck6odqlkskm

Japanese Pronunciation Prediction as Phrasal Statistical Machine Translation

Jun Hatori, Hisami Suzuki
2011 International Joint Conference on Natural Language Processing  
This paper addresses the problem of predicting the pronunciation of Japanese text.  ...  The difficulty of this task lies in the high degree of ambiguity in the pronunciation of Japanese characters and words.  ...  Table 1 shows the statistics of these corpora, with the OOV rate estimated using KyTea 5 manually-cleaned wordpronunciation pairs from Wikipedia, which consists mostly of proper nouns including names  ... 
dblp:conf/ijcnlp/HatoriS11 fatcat:jougmrpka5egfpsul33cgk6sxi

Predicting Word Pronunciation in Japanese [chapter]

Jun Hatori, Hisami Suzuki
2011 Lecture Notes in Computer Science  
This is an important task for many applications including text-to-speech and text input method, and is also challenging, because Japanese kanji (ideographic) characters typically have multiple possible  ...  This paper addresses the problem of predicting the pronunciation of Japanese words, especially those that are newly created and therefore not in the dictionary.  ...  In this model, n-gram statistics are learned over the sequences of pairs of letters and phonemes, instead of the sequences of phonemes.  ... 
doi:10.1007/978-3-642-19437-5_40 fatcat:modmxrbre5d6hhfj6p3t4jr3mu

Character Feature Engineering for Japanese Word Segmentation [article]

Mike Tian-Jian Jiang
2019 arXiv   pre-print
On word segmentation problems, machine learning architecture engineering often draws attention.  ...  The problem representation itself, however, has remained almost static as either word lattice ranking or character sequence tagging, for at least two decades.  ...  Japanese word segmentation (JWS) task has been mostly integrated within morphological analysis (MA) task, which not only splits an input sentence into words, but also jointly annotates morphemes with their  ... 
arXiv:1910.01761v1 fatcat:76454tassjd3hefnwvhu7c2kpu

A procedure for unsupervised lexicon learning [article]

Anand Venkataraman
2001 arXiv   pre-print
The algorithm is based on a conservative and traditional statistical model, and results of empirical tests show that it is competitive with other algorithms that have been proposed recently for this task  ...  We describe an incremental unsupervised procedure to learn words from transcribed continuous speech.  ...  Anonymous reviewers of an initial version helped significantly in improving its content and Judy Lee proof-read the final version carefully.  ... 
arXiv:cs/0111064v1 fatcat:gp46kicrhjev7biuvkugj2jmm4

Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language [article]

Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi
2019 arXiv   pre-print
Our results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, we show important stepping stones towards end-to-end Japanese speech synthesis  ...  Japanese could be one of the most difficult languages for which to achieve end-to-end speech synthesis, largely due to its character diversity and pitch accents.  ...  This study focuses on the Japanese language, which is among the most challenging languages. Japanese writing has three types of orthographical characters: Hiragana, Katakana, and Kanji (Chinese).  ... 
arXiv:1810.11960v2 fatcat:i7mp374z4natbb3wd6zxd7y25i

Survey on Machine Transliteration and Machine Learning Models

Dhore M L, Dhore R M, Rathod P H
2015 International Journal on Natural Language Computing  
Survey shows that linguistic approach provides better results for the closely related languages and probability based statistical approaches are good when one of the languages is phonetic and other is  ...  Support of local languages can be given in all internet based applications by means of Machine Transliteration and Machine Translation.  ...  Katakana and Korean Hangul and from the Japanese name to the Japanese Kanji language pairs[31].  ... 
doi:10.5121/ijnlc.2015.4202 fatcat:kegqa5k4abahvbno2setnkxwtq

Voting experts: An unsupervised algorithm for segmenting sequences

Paul Cohen, Niall Adams, Brent Heeringa
2007 Intelligent Data Analysis  
We describe a statistical signature of chunks and an algorithm for finding chunks.  ...  We show that the log frequency of a chunk is a measure of its internal entropy.  ...  The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements either expressed or implied, of DARPA  ... 
doi:10.3233/ida-2007-11603 fatcat:3zrzvcuanfe4xgcutzb7sxr26m

Design and Structure of The Juman++ Morphological Analyzer Toolkit

Arseny Tolmachev, Daisuke Kawahara, Sadao Kurohashi
2020 Journal of Natural Language Processing  
Modern morphological analyzers achieve high accuracy (a tokenwise segmentation F1 score of > .99 for Japanese) on established domains like newspaper texts.  ...  We must remark that partial lexicalization of bigram features is supported by MeCab and is frequently used, mostly for auxiliary words.  ...  The section on the morphological analysis is based on the paper presented at the meeting of North American Chapter of the Association for Computational Linguistics (Tolmachev et al. 2019 ).  ... 
doi:10.5715/jnlp.27.89 fatcat:i7r2s6r4sbagvplw2d4ztj4vsi

Machine transliteration and transliterated text retrieval: a survey

Dinesh Kumar Prabhakar, Sukomal Pal
2018 Sadhana (Bangalore)  
In this article, we survey the recent body of work in the field of transliteration.  ...  Finally, we study the performance of those techniques and present a comparative analysis of them.  ...  On the other hand, some written languages require multiple scripts (Japanese is written in the Hiragana, Katakana syllabaries and the Kanji ideographs).  ... 
doi:10.1007/s12046-018-0828-8 fatcat:dg3gwugmqrfevnzu3deuk5w67i

Splitting Katakana Noun Compounds by Paraphrasing and Back-transliteration

Nobuhiro Kaji, Masaru Kitsuregawa
2014 Journal of Natural Language Processing  
In the case of Japanese, noun compounds composed of katakana words are particularly difficult to split because katakana words are highly productive and are often out of vocabulary.  ...  Word boundaries within noun compounds in a number of languages, including Japanese, are not marked by white spaces. Thus, it is beneficial for various NLP applications to split such noun compounds.  ...  ., hiragana and kanji).  ... 
doi:10.5715/jnlp.21.897 fatcat:m5abanr7tffpreim4eox3m3yuu

A Survey of Orthographic Information in Machine Translation

Bharathi Raja Chakravarthi, Priya Rani, Mihael Arcan, John P. McCrae
2021 SN Computer Science  
AbstractMachine translation is one of the applications of natural language processing which has been explored in different languages.  ...  This article offers a survey of research regarding orthography's influence on machine translation of under-resourced languages.  ...  Like the history of writing in Korea, Japan to have two writing systems, Kana and Kanji, where Kanji is identified as Classical Chinese characters, and Kana represents sounds where each kana character  ... 
doi:10.1007/s42979-021-00723-4 pmid:34723204 pmcid:PMC8550410 fatcat:sd6ovquibzdpjlmgamsnklbi3q

CASIA Online and Offline Chinese Handwriting Databases

Cheng-Lin Liu, Fei Yin, Da-Han Wang, Qiu-Feng Wang
2011 2011 International Conference on Document Analysis and Recognition  
Each dataset is segmented and annotated at character level, and is partitioned into standard training and test subsets.  ...  The (either online or offline) datasets of isolated characters contain about 3.9 million samples of 7,356 classes (7,185 Chinese characters and 171 symbols), and the datasets of handwritten texts contain  ...  ACKNOWLEDGMENT This work is supported by the National Natural Science Foundation of China (NSFC) under grants no.60825301 and no.60933010.  ... 
doi:10.1109/icdar.2011.17 dblp:conf/icdar/LiuYWW11 fatcat:um6374emmnb3tlip6pldpqk4la

Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks

Oliver Hellwig, Sebastian Nehrdich
2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing  
Contrary to most previous systems, our models do not require feature engineering or extern linguistic resources, but operate solely on parallel versions of raw and segmented text.  ...  The models discussed in this paper clearly improve over previous approaches to Sanskrit word segmentation.  ...  SWS is closely related to word segmentation for other Asian languages such as Thai (Haruechaiyasak et al., 2008) , Chinese or Japanese (Kanji), with most research being done for Chinese and Japanese.  ... 
doi:10.18653/v1/d18-1295 dblp:conf/emnlp/HellwigN18 fatcat:ukccvdaedvdh5fqedhdi4jm6hm

A Statistical Model for Word Discovery in Transcribed Speech

Anand Venkataraman
2001 Computational Linguistics  
A statistical model/or segmentation and word discovery in continuous speech is presented. An incremental unsupervised learning algorithm to infer word boundaries based on this model is described.  ...  Results are also presented of empirical tests showing that the algorithm is competitive with other models that have been used/or similar tasks.  ...  A notable exception in this regard is the work by Ando and Lee (1999) which tries to infer word boundaries from character n-gram statistics of Japanese Kanji strings.  ... 
doi:10.1162/089120101317066113 fatcat:gzbh2n3htvejbeeefjdftoarsq
« Previous Showing results 1 — 15 out of 63 results