A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2011; you can also visit the original URL.
The file type is application/pdf
.
Filters
Corpus-based Sinhala lexicon
2009
Proceedings of the 7th Workshop on Asian Language Resources - ALR7
unpublished
The lexicon developed for Sinhala was based on the text obtained from a corpus of 10 million words drawn from diverse genres. ...
Lexicon is in important resource in any kind of language processing application. Corpus-based lexica have several advantages over other traditional approaches. ...
This paper presents a lexicon for Sinhala which has nearly 35,000 entries based on the text drawn from the UCSC Text Corpus of Contemporary Sinhala consisting of 10 million words from diverse genres. ...
doi:10.3115/1690299.1690302
fatcat:34ma2f4gtjcoxi5xsmccq34uca
Data Augmentation and Terminology Integration for Domain-Specific Sinhala-English-Tamil Statistical Machine Translation
[article]
2021
arXiv
pre-print
However, since bilingual lists contain words in the base form, it will not translate inflected forms for morphologically rich languages such as Sinhala and Tamil. ...
This paper focuses on data augmentation techniques where bilingual lexicon terms are expanded based on case-markers with the objective of generating new words, to be used in Statistical machine Translation ...
Therefore we integrated bilingual lists as static corpus [21] , as a technique to address the OOV to improve the overall MT. The lexicons and lists contain nouns in their base singular forms. ...
arXiv:2011.02821v3
fatcat:ud7optid2vfuhdjrwp6aw5tz4m
Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali
2018
The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages
We present speech corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali. ...
Each corpus consists of an average of approximately 200k recorded utterances that were provided by native-speaker volunteers in the respective region. ...
The grapheme-based lexicons are trivially generated placeholders that can and should be replaced with proper phonemic lexicon if and when those become available. ...
doi:10.21437/sltu.2018-11
dblp:conf/sltu/KjartanssonSPJH18
fatcat:pwiuegrvnjff5kldmj4kzlu3bm
Unknown Words Analysis in POS Tagging of Sinhala Language
2014
Machine Learning and Applications An International Journal
In this context, dealing with unknown words (words do not appear in the lexicon referred as unknown words) is also an important task, since growing NLP systems are used in more and more new applications ...
This experiment shows that the performance of the tagging process is enhanced when word class distinction is used together with syntactic rules to parse sentences containing unknown words in Sinhala language ...
., the words that appear in sentences, but are not contained within the lexicon. ...
doi:10.5121/mlaij.2014.1201
fatcat:u7jtgor7vbg55byul7qgbb26g4
Unknown Words Analysis in POS tagging of Sinhala Language
[article]
2015
arXiv
pre-print
In this context, dealing with unknown words (words do not appear in the lexicon referred as unknown words) is also an important task, since growing NLP systems are used in more and more new applications ...
This experiment shows that the performance of the tagging process is enhanced when word class distinction is used together with syntactic rules to parse sentences containing unknown words in Sinhala language ...
., the words that appear in sentences, but are not contained within the lexicon. ...
arXiv:1501.01254v1
fatcat:dcd47ctodnevbpgkffw7ncezbi
A Human Quality Text to Speech System for Sinhala
2018
The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages
This paper proposes an approach on implementing a Text to Speech system for Sinhala language using MaryTTS framework. ...
User level evaluation was conducted with 20 candidates, where the intelligibility and the naturalness of the developed Sinhala TTS system received an approximate score of 70%. ...
Therefore, as the project progressed, initial lexicon was changed based on accuracy and made several versions as discussed in 2.4. ...
doi:10.21437/sltu.2018-33
dblp:conf/sltu/NanayakkaraLVNP18
fatcat:6sbqkvyfzvdxvav4azm66dlnqa
SinSpell: A Comprehensive Spelling Checker for Sinhala
[article]
2021
arXiv
pre-print
The errors in a corpus of Sinhala documents were analysed and commonly misspelled words and types of common errors were identified. ...
To maintain accuracy, SinSpell was designed as a rule-based system based on Hunspell. A set of words was compiled from several sources and verified. ...
The spelling checker uses an algorithm based on n-gram statistics computed from the UCSC Sinhala Corpus. It creates a unique word list and then a set of permutations of these words. ...
arXiv:2107.02983v1
fatcat:sq6xx733czeg7o7iqxgkedsk3m
Survey on Publicly Available Sinhala Natural Language Processing Tools and Research
[article]
2022
arXiv
pre-print
Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. ...
A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. ...
A methodology for constructing a sentiment lexicon for Sinhala Language in a semi-automated manner based on a given corpus was proposed by Chathuranga et al. [176] . Demotte et al. ...
arXiv:1906.02358v12
fatcat:3bvyvces4zhzzhnvjgaqqacbzy
An Open-Source Data Driven Spell Checker for Sinhala
2011
The International Journal on Advances in ICT for Emerging Regions
Due to its morphological richness, the language is difficult to enumerate completely in a lexicon. ...
The approach described is based on n-gram statistics and is relatively inexpensive to construct without deep linguistic knowledge. ...
It is based on n-gram statistics computed from the UCSC Sinhala Corpus [23] . ...
doi:10.4038/icter.v3i1.2844
fatcat:upc3xw7manfchejf22gozns2nm
Assigning Polarity Scores to Facebook Myanmar Movie Comments
2017
International Journal of Computer Applications
And then the polarity scores to each comment of the plain text movie corpus are assigned. ...
We also make the comment polarity for 3-class evaluation and 5-class evaluation based on the scores of comments. General Terms Sentiment analysis ...
Corpus-based approaches can overcome these problems by learning a domain-specific lexicon using a domain corpus of labeled reviews. ...
doi:10.5120/ijca2017915780
fatcat:3r6apecfnvf4velupsn4ks4amm
Sinhala G2P Conversion for Speech Processing
2018
The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages
The performance of our rule-based system shows that the rulebased sound patterns are effective on Sinhala G2P conversion. ...
Sinhala must have a grapheme-to-phoneme conversion for speech processing because Sinhala writing system does not always reflect its actual pronunciations. ...
There is also a previously built G2P conversion system for Sinhala language using the Rule based approach [10] . ...
doi:10.21437/sltu.2018-24
dblp:conf/sltu/NadungodageLPPW18
fatcat:d6geyde6m5gnvpg7e4bmaxsqlu
A Step-by-Step Process for Building TTS Voices Using Open Source Data and Frameworks for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese
2018
The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages
The data sets consist of audio files, pronunciation lexicons, and phonology definitions for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese. ...
However, some work has been done on building Sinhala [1] and Bangla [2] voices using Festival [3] . Both of these voices are based on the unit selection technique [4] . ...
For example, to upload a lexicon file in Sinhala onto my_bucket at path si/lexicon, run $ gsutil cp si/festvox/lexicon.scm gs://my_bucket/si/lexicon.scm All uploaded files need to be made publicly accessible ...
doi:10.21437/sltu.2018-14
dblp:conf/sltu/SodimanaSSKJPH18
fatcat:shkr5ezbpzbkpjilawwtove23q
Defining the Gold Standard Definitions for the Morphology of Sinhala Words
2015
Research in Computing Science
We measured the coverage of the defined resource against three different Sinhala corpora and obtained over 70% coverage for each corpora. ...
In this work, we describe the steps and strategies we carried out on defining morpheme segmentation boundaries of Sinhala words (which we called Gold Standard Definitions). ...
current members at the Language Technology Research Laboratory of the University of Colombo of School Computing, Sri Lanka for their significant contribution on developing basic linguistic resources for Sinhala ...
doi:10.13053/rcs-90-1-12
fatcat:zelurt3pnvcglall4m7djy6vlq
Sentiment Analysis for Sinhala Language using Deep Learning Techniques
[article]
2020
arXiv
pre-print
A data set of 15059 Sinhala news comments, annotated with these four classes and a corpus consists of 9.48 million tokens are publicly released. ...
This is the largest sentiment annotated data set for Sinhala so far. ...
Chathuranga et al. [2019] also presented a technique based on corpus-based sentiment analysis. ...
arXiv:2011.07280v1
fatcat:47uivasos5gb5cplnw7fe2khx4
Webinterpret Submission to the WMT2019 Shared Task on Parallel Corpus Filtering
2019
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
This document describes the participation of Webinterpret in the shared task on parallel corpus filtering at the Fourth Conference on Machine Translation (WMT 2019). ...
In our submissions, we used the probabilistic lexicons that can be obtained as a sub-product of the training of full statistical MT models. ...
For Sinhala-English, we require Sinhala as source language: LangID(x) = "si".
Length Ratio As our second heuristic filtering, we chose the ratio between the number of tokens of x and y. ...
doi:10.18653/v1/w19-5437
dblp:conf/wmt/Gonzalez-Rubio19
fatcat:aoseqgzkjnh7dcyau4f4qsf5su
« Previous
Showing results 1 — 15 out of 161 results