Filters








161 Hits in 1.8 sec

Corpus-based Sinhala lexicon

Ruvan Weerasinghe, Dulip Herath, Viraj Welgama
2009 Proceedings of the 7th Workshop on Asian Language Resources - ALR7   unpublished
The lexicon developed for Sinhala was based on the text obtained from a corpus of 10 million words drawn from diverse genres.  ...  Lexicon is in important resource in any kind of language processing application. Corpus-based lexica have several advantages over other traditional approaches.  ...  This paper presents a lexicon for Sinhala which has nearly 35,000 entries based on the text drawn from the UCSC Text Corpus of Contemporary Sinhala consisting of 10 million words from diverse genres.  ... 
doi:10.3115/1690299.1690302 fatcat:34ma2f4gtjcoxi5xsmccq34uca

Data Augmentation and Terminology Integration for Domain-Specific Sinhala-English-Tamil Statistical Machine Translation [article]

Aloka Fernando, Surangika Ranathunga, Gihan Dias
2021 arXiv   pre-print
However, since bilingual lists contain words in the base form, it will not translate inflected forms for morphologically rich languages such as Sinhala and Tamil.  ...  This paper focuses on data augmentation techniques where bilingual lexicon terms are expanded based on case-markers with the objective of generating new words, to be used in Statistical machine Translation  ...  Therefore we integrated bilingual lists as static corpus [21] , as a technique to address the OOV to improve the overall MT. The lexicons and lists contain nouns in their base singular forms.  ... 
arXiv:2011.02821v3 fatcat:ud7optid2vfuhdjrwp6aw5tz4m

Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali

Oddur Kjartansson, Supheakmungkol Sarin, Knot Pipatsrisawat, Martin Jansche, Linne Ha
2018 The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages  
We present speech corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali.  ...  Each corpus consists of an average of approximately 200k recorded utterances that were provided by native-speaker volunteers in the respective region.  ...  The grapheme-based lexicons are trivially generated placeholders that can and should be replaced with proper phonemic lexicon if and when those become available.  ... 
doi:10.21437/sltu.2018-11 dblp:conf/sltu/KjartanssonSPJH18 fatcat:pwiuegrvnjff5kldmj4kzlu3bm

Unknown Words Analysis in POS Tagging of Sinhala Language

Jayaweera A.J.P.M.P, Dias N.G.J
2014 Machine Learning and Applications An International Journal  
In this context, dealing with unknown words (words do not appear in the lexicon referred as unknown words) is also an important task, since growing NLP systems are used in more and more new applications  ...  This experiment shows that the performance of the tagging process is enhanced when word class distinction is used together with syntactic rules to parse sentences containing unknown words in Sinhala language  ...  ., the words that appear in sentences, but are not contained within the lexicon.  ... 
doi:10.5121/mlaij.2014.1201 fatcat:u7jtgor7vbg55byul7qgbb26g4

Unknown Words Analysis in POS tagging of Sinhala Language [article]

A.J.P.M.P. Jayaweera, N.G.J. Dias
2015 arXiv   pre-print
In this context, dealing with unknown words (words do not appear in the lexicon referred as unknown words) is also an important task, since growing NLP systems are used in more and more new applications  ...  This experiment shows that the performance of the tagging process is enhanced when word class distinction is used together with syntactic rules to parse sentences containing unknown words in Sinhala language  ...  ., the words that appear in sentences, but are not contained within the lexicon.  ... 
arXiv:1501.01254v1 fatcat:dcd47ctodnevbpgkffw7ncezbi

A Human Quality Text to Speech System for Sinhala

Lakshika Nanayakkara, Chamila Liyanage, Pubudu Tharaka Viswakula, Thilini Nagungodage, Randil Pushpananda, Ruvan Weerasinghe
2018 The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages  
This paper proposes an approach on implementing a Text to Speech system for Sinhala language using MaryTTS framework.  ...  User level evaluation was conducted with 20 candidates, where the intelligibility and the naturalness of the developed Sinhala TTS system received an approximate score of 70%.  ...  Therefore, as the project progressed, initial lexicon was changed based on accuracy and made several versions as discussed in 2.4.  ... 
doi:10.21437/sltu.2018-33 dblp:conf/sltu/NanayakkaraLVNP18 fatcat:6sbqkvyfzvdxvav4azm66dlnqa

SinSpell: A Comprehensive Spelling Checker for Sinhala [article]

Upuli Liyanapathirana, Kaumini Gunasinghe, Gihan Dias
2021 arXiv   pre-print
The errors in a corpus of Sinhala documents were analysed and commonly misspelled words and types of common errors were identified.  ...  To maintain accuracy, SinSpell was designed as a rule-based system based on Hunspell. A set of words was compiled from several sources and verified.  ...  The spelling checker uses an algorithm based on n-gram statistics computed from the UCSC Sinhala Corpus. It creates a unique word list and then a set of permutations of these words.  ... 
arXiv:2107.02983v1 fatcat:sq6xx733czeg7o7iqxgkedsk3m

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research [article]

Nisansa de Silva
2022 arXiv   pre-print
Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European.  ...  A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing.  ...  A methodology for constructing a sentiment lexicon for Sinhala Language in a semi-automated manner based on a given corpus was proposed by Chathuranga et al. [176] . Demotte et al.  ... 
arXiv:1906.02358v12 fatcat:3bvyvces4zhzzhnvjgaqqacbzy

An Open-Source Data Driven Spell Checker for Sinhala

Ruwan Asanka Wasala, Ruwan Weerasinghe, Randil Pushpananda, Chamila Liyanage, Eranga Jayalatharachchi
2011 The International Journal on Advances in ICT for Emerging Regions  
Due to its morphological richness, the language is difficult to enumerate completely in a lexicon.  ...  The approach described is based on n-gram statistics and is relatively inexpensive to construct without deep linguistic knowledge.  ...  It is based on n-gram statistics computed from the UCSC Sinhala Corpus [23] .  ... 
doi:10.4038/icter.v3i1.2844 fatcat:upc3xw7manfchejf22gozns2nm

Assigning Polarity Scores to Facebook Myanmar Movie Comments

Win Win, Nyein Thwet, Su Su, Khine Khine, Kay Thi
2017 International Journal of Computer Applications  
And then the polarity scores to each comment of the plain text movie corpus are assigned.  ...  We also make the comment polarity for 3-class evaluation and 5-class evaluation based on the scores of comments. General Terms Sentiment analysis  ...  Corpus-based approaches can overcome these problems by learning a domain-specific lexicon using a domain corpus of labeled reviews.  ... 
doi:10.5120/ijca2017915780 fatcat:3r6apecfnvf4velupsn4ks4amm

Sinhala G2P Conversion for Speech Processing

Thilini Nadungodage, Chamila Liyanage, Amathri Prerera, Randil Pushpananda, Ruvan Weerasinghe
2018 The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages  
The performance of our rule-based system shows that the rulebased sound patterns are effective on Sinhala G2P conversion.  ...  Sinhala must have a grapheme-to-phoneme conversion for speech processing because Sinhala writing system does not always reflect its actual pronunciations.  ...  There is also a previously built G2P conversion system for Sinhala language using the Rule based approach [10] .  ... 
doi:10.21437/sltu.2018-24 dblp:conf/sltu/NadungodageLPPW18 fatcat:d6geyde6m5gnvpg7e4bmaxsqlu

A Step-by-Step Process for Building TTS Voices Using Open Source Data and Frameworks for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese

Keshan Sodimana, Pasindu De Silva, Supheakmungkol Sarin, Oddur Kjartansson, Martin Jansche, Knot Pipatsrisawat, Linne Ha
2018 The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages  
The data sets consist of audio files, pronunciation lexicons, and phonology definitions for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese.  ...  However, some work has been done on building Sinhala [1] and Bangla [2] voices using Festival [3] . Both of these voices are based on the unit selection technique [4] .  ...  For example, to upload a lexicon file in Sinhala onto my_bucket at path si/lexicon, run $ gsutil cp si/festvox/lexicon.scm gs://my_bucket/si/lexicon.scm All uploaded files need to be made publicly accessible  ... 
doi:10.21437/sltu.2018-14 dblp:conf/sltu/SodimanaSSKJPH18 fatcat:shkr5ezbpzbkpjilawwtove23q

Defining the Gold Standard Definitions for the Morphology of Sinhala Words

Welgama Viraj, Weerasinghe Ruvan, Mahesan Niranjan
2015 Research in Computing Science  
We measured the coverage of the defined resource against three different Sinhala corpora and obtained over 70% coverage for each corpora.  ...  In this work, we describe the steps and strategies we carried out on defining morpheme segmentation boundaries of Sinhala words (which we called Gold Standard Definitions).  ...  current members at the Language Technology Research Laboratory of the University of Colombo of School Computing, Sri Lanka for their significant contribution on developing basic linguistic resources for Sinhala  ... 
doi:10.13053/rcs-90-1-12 fatcat:zelurt3pnvcglall4m7djy6vlq

Sentiment Analysis for Sinhala Language using Deep Learning Techniques [article]

Lahiru Senevirathne, Piyumal Demotte, Binod Karunanayake, Udyogi Munasinghe, Surangika Ranathunga
2020 arXiv   pre-print
A data set of 15059 Sinhala news comments, annotated with these four classes and a corpus consists of 9.48 million tokens are publicly released.  ...  This is the largest sentiment annotated data set for Sinhala so far.  ...  Chathuranga et al. [2019] also presented a technique based on corpus-based sentiment analysis.  ... 
arXiv:2011.07280v1 fatcat:47uivasos5gb5cplnw7fe2khx4

Webinterpret Submission to the WMT2019 Shared Task on Parallel Corpus Filtering

Jesús González-Rubio
2019 Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)  
This document describes the participation of Webinterpret in the shared task on parallel corpus filtering at the Fourth Conference on Machine Translation (WMT 2019).  ...  In our submissions, we used the probabilistic lexicons that can be obtained as a sub-product of the training of full statistical MT models.  ...  For Sinhala-English, we require Sinhala as source language: LangID(x) = "si". Length Ratio As our second heuristic filtering, we chose the ratio between the number of tokens of x and y.  ... 
doi:10.18653/v1/w19-5437 dblp:conf/wmt/Gonzalez-Rubio19 fatcat:aoseqgzkjnh7dcyau4f4qsf5su
« Previous Showing results 1 — 15 out of 161 results