Incorporating Pronunciation Variation into Extraction of Transliterated-term Pairs from Web Corpora

Jin-Shea Kuo, Ying-Kuei Yang
2005 International Journal of Asian Language Processing  
A novel approach to automatically extracting transliterated-term pairs from Web corpora is proposed in this paper. One of the most important issues addressed is that of taking pronunciation variation into account. Pronunciation variation is a phenomenon of pronunciation ambiguity that seriously affects the term transliteration and hence affects those results produced by transliteration processes. Extracting transliterated-term pairs is a fundamental yet important task in natural language
more » ... ing to collect large enough paired cognates for further studies on transliteration. To mitigate the problem of pronunciation variation in extracting paired cognates is not an easy task. The proposed method successfully exploits ASR (automated speech recognition)-generated confusion matrices as a basis for both alleviating pronunciation variation and constructing crosslinguistic syllable-and-phoneme conversions and it improves the extraction performance gradually by using cross-linguistic syllable-phoneme confusion matrices trained and refined progressively from extracted term pairs. Many terms extracted in the experiment are new to the existing lexicons. Experiments on mining information from the extracted pairs also have been conducted. From the experimental results showed that taking pronunciation variation into account did make extraction of paired cognates more effective
dblp:journals/jclc/KuoY05 fatcat:3fkuvlgk2vgs5knisf7ioqkd6m