Automatic Conversion of Dialectal Tamil Text to Standard Written Tamil Text using FSTs

Marimuthu K, Sobha Lalitha Devi
2014 Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM  
We present an efficient method to automatically transform spoken language text to standard written language text for various dialects of Tamil. Our work is novel in that it explicitly addresses the problem and need for processing dialectal and spoken language Tamil. Written language equivalents for dialectal and spoken language forms are obtained using Finite State Transducers (FSTs) where spoken language suffixes are replaced with appropriate written language suffixes. Agglutination and
more » ... ding in the resultant text is handled using Conditional Random Fields (CRFs) based word boundary identifier. The essential Sandhi corrections are carried out using a heuristic Sandhi Corrector which normalizes the segmented words to simpler sensible words. During experimental evaluations dialectal spoken to written transformer (DSWT) achieved an encouraging accuracy of over 85% in transformation task and also improved the translation quality of Tamil-English machine translation system by 40%. It must be noted that there is no published computational work on processing Tamil dialects. Ours is the first attempt to study various dialects of Tamil in a computational point of view. Thus, the nature of the work reported here is pioneering.
doi:10.3115/v1/w14-2805 dblp:conf/sigmorphon/KD14 fatcat:ovmvd3iqijezpnsgp24fhl6il4