85 Hits in 9.4 sec

Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language

Rajat Pandit, Saptarshi Sengupta, Sudip Kumar Naskar, Niladri Sekhar Dash, Mohini Mohan Sardar
2019 Informatics  
In this paper, semantic similarity is explored in Bangla, a less resourced language.  ...  Semantic similarity is a long-standing problem in natural language processing (NLP).  ...  The funding agency had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.  ... 
doi:10.3390/informatics6020019 fatcat:pp42ofxrlfaztaqli7zcqicura

Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages [article]

Saurav Jha, Akhilesh Sudhakar, Anil Kumar Singh
2019 arXiv   pre-print
Out-of-vocabulary (OOV) words can pose serious challenges for machine translation (MT) tasks, and in particular, for low-resource language (LRL) pairs, i.e., language pairs for which few or no parallel  ...  Our work can be seen as an important step in the process of: (i) resolving the OOV words problem arising in MT tasks, (ii) creating effective parallel corpora for resource-constrained languages, and (iii  ...  In the case of Low Resource Languages (LRL), which lack in linguistic resources such as parallel corpora, the problem promptly comes within sight, with most words being OOV words.  ... 
arXiv:1811.08816v2 fatcat:h2ixdn7lyngqzfwtqipywwj2dy

Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages

Saurav Jha, Akhilesh Sudhakar, Anil Kumar Singh
2019 Journal of Language Modelling  
For example, some other sub-languages like Rajasthani, Maithili and Magahi are also often included in the Hindi spectrum.  ...  However, the usual meaning of the word 'Hindi' in literature refers to standard Hindi, whose base is Khari Boli and which is an official language of India. 2  ...  Replacing the translation of OOV words with that of their transductions leads to an improvement of 6.3 points in the BLEU score, which is substantial considering that we are translating to a low-resource  ... 
doi:10.15398/jlm.v7i2.214 fatcat:dztyzz3iizf3vo4zktx7yj33s4

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study [article]

Yash Khemchandani, Sarvesh Mehtani, Vaidehi Patil, Abhijeet Awasthi, Partha Talukdar, Sunita Sarawagi
2021 arXiv   pre-print
However, incorporating a new language in an LM still remains a challenge, particularly for languages with limited corpora and in unseen scripts.  ...  This holds promise for low web-resource languages (LRL) as multilingual models can enable transfer of supervision from high resource languages to LRLs.  ...  this study.  ... 
arXiv:2106.03958v2 fatcat:rkc22nqzhfdp7p4j3kdpiulsdy

DeepHateExplainer: Explainable Hate Speech Detection in Under-resourced Bengali Language [article]

Md. Rezaul Karim and Sumon Kanti Dey and Tanhim Islam and Sagor Sarker and Mehadi Hasan Menon and Kabir Hossain and Bharathi Raja Chakravarthi and Md. Azam Hossain and Stefan Decker
2021 arXiv   pre-print
In this paper, we propose an explainable approach for hate speech detection from the under-resourced Bengali language, which we called DeepHateExplainer.  ...  However, some languages are under-resourced, e.g., South Asian languages like Bengali, that lack computational resources for accurate natural language processing (NLP).  ...  XLM-RoBERTa not only outperformed other transformer models on cross-lingual benchmarks but also performed better on various NLP tasks in a low-resourced language setting.  ... 
arXiv:2012.14353v4 fatcat:xpwnvfh2bnh2xbiewqzrrys2cu

Authorship Attribution in Bangla Literature (AABL) via Transfer Learning using ULMFiT

Aisha Khatun, Anisur Rahman, Md Saiful Islam, Hemayet Ahmed Chowdhury, Ayesha Tasnim
2022 ACM Transactions on Asian and Low-Resource Language Information Processing  
problem and release six variations of pre-trained language models for use in any Bangla NLP downstream task.  ...  Despite significant advancements in other languages such as English, Spanish, and Chinese, Bangla lacks comprehensive research in this field due to its complex linguistic feature and sentence structure  ...  An in-depth analysis of the errors can help reveal more about what the models learn in terms of understanding the structure and semantics of a language which remains a scope for future study.  ... 
doi:10.1145/3530691 fatcat:dpvjpaiurzcudcvovifmwwyh7a

Handwriting Recognition in Low-resource Scripts using Adversarial Learning [article]

Ayan Kumar Bhunia, Abhirup Das, Ankan Kumar Bhunia, Perla Sai Raj Kishore, Partha Pratim Roy
2019 arXiv   pre-print
low-resource scripts.  ...  We record results for varying training data sizes, and observe that our enhanced network generalizes much better in the low-data regime; the overall word-error rates and mAP scores are observed to improve  ...  [3] proposed a cross-lingual framework for Indic scripts where training is performed using a script that is abundantly available and testing is done on the low-resource script using character-mapping  ... 
arXiv:1811.01396v5 fatcat:xp3emb4whrh7jasluh3wv3ffce

Linguistic Resources for Bhojpuri, Magahi and Maithili: Statistics about them, their Similarity Estimates, and Baselines for Three Applications [article]

Rajesh Kumar Mundotiya, Manish Kumar Singh, Rahul Kapur, Swasti Mishra, Anil Kumar Singh
2021 arXiv   pre-print
They are closely related to Hindi, which is a relatively high-resource language, which is why we compare with Hindi.  ...  Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family.  ...  For example, when the language model is trained as well as tested on Bhojpuri, it gives a cross-lingual similarity of 1.  ... 
arXiv:2004.13945v2 fatcat:gjtvhkukunb7xcybh3akvfkvhm

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation [article]

Samuel Cahyawijaya, Genta Indra Winata, Bryan Wilie, Karissa Vincentio, Xiaohong Li, Adhiguna Kuncoro, Sebastian Ruder, Zhi Yuan Lim, Syafri Bahar, Masayu Leylia Khodra, Ayu Purwarianti, Pascale Fung
2021 arXiv   pre-print
Unfortunately, the lack of publicly available NLG benchmarks for low-resource languages poses a challenging barrier for building NLG systems that work well for languages with limited amounts of data.  ...  Here we introduce IndoNLG, the first benchmark to measure natural language generation (NLG) progress in three low-resource -- yet widely spoken -- languages of Indonesia: Indonesian, Javanese, and Sundanese  ...  The model is first pretrained with denoising in 25 languages using a masked language modelling framework, and then fine-tuned on another 25 languages covering low and medium-resource languages, including  ... 
arXiv:2104.08200v3 fatcat:txxm4lltvvhp3dwhxmejj7yjaq

Bangla Text Classification using Transformers [article]

Tanvirul Alam, Akib Khan, Firoj Alam
2020 arXiv   pre-print
Models designed with this type of network and its variants recently showed their success in many downstream natural language processing tasks, especially for resource-rich languages, e.g., English.  ...  In this work, we fine-tune multilingual transformer models for Bangla text classification tasks in different domains, including sentiment analysis, emotion detection, news categorization, and authorship  ...  Word counts are also sampled in a similar manner so that low resource languages have sufficient words in the vocabulary. b) XLM-RoBERTa: RoBERTa [14] improves upon BERT by training on larger datasets  ... 
arXiv:2011.04446v1 fatcat:2l7qbtqntvcd3mbzo3njds2gde

CogNet: A Large-Scale Cognate Database

Khuyagbaatar Batsuren, Gabor Bella, Fausto Giunchiglia
2019 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics  
This paper introduces CogNet, a new, large-scale lexical database that provides cognates-words of common origin and meaning-across languages.  ...  Finally, statistics and early insights about the cognate data are presented, hinting at a possible future exploitation of the resource 1 by various fields of lingustics.  ...  Acknowledgments This paper was partly supported by the In-teropEHRate project, co-funded by the European Union (EU) Horizon 2020 programme under grant number 826106.  ... 
doi:10.18653/v1/p19-1302 dblp:conf/acl/BatsurenBG19 fatcat:7wudx56nt5dk5kqxqufod7cvz4

A comprehensive survey on cross-language information retrieval system

Gouranga Charan Jena, Siddharth Swarup Rautaray
2019 Indonesian Journal of Electrical Engineering and Computer Science  
Cross language information retrieval (CLIR) is a retrieval process in which the user fires queries in one language to retrieve information from another (different) language.  ...  This study is aimed at building an experimental CLIR system between one of the under-resourced language (i.e. Odia) and one of the most commonly used online language (i.e. English) in future.  ...  Cross-lingual information retrieval Cross-Language Information Retrieval is quickly becoming a mature area in the information retrieval world.  ... 
doi:10.11591/ijeecs.v14.i1.pp127-134 fatcat:bg3kk7o5sbcrbbxklsjbfw7aue

Design and Development of a Bangla Semantic Lexicon and Semantic Similarity Measure

Manjira Sinha, Tirthankar Dasgupta, Abhik Jana, Anupam Basu
2014 International Journal of Computer Applications  
In this paper, we have proposed a hierarchically organized semantic lexicon in Bangla and also a graph based edgeweighting approach to measure semantic similarity between two Bangla words.  ...  As we have earlier discussed, this lexicon can be used in various applications like categorization, semantic web, and natural language processing applications like, document clustering, word sense disambiguation  ...  Therefore, it will be a useful resource and tool to other psycholinguistic and NLP studies in Bangla.  ... 
doi:10.5120/16588-6297 fatcat:hyvxlze4vzdlbmnhcstyepyw2i

Deep learning based question answering system in Bengali

Tasmiah Tahsin Mayeesha, Abdullah Md Sarwar, Rashedur M. Rahman
2020 Journal of Information and Telecommunication  
Recent advances in the field of natural language processing has improved state-of-the-art performances on many tasks including question answering for languages like English.  ...  Finally, we compare our models with human children to set up a benchmark score using survey experiments. ARTICLE HISTORY  ...  Rahman is working as a Professor in Electrical and Computer Engineering Department, North South University, Dhaka, Bangladesh.  ... 
doi:10.1080/24751839.2020.1833136 fatcat:ltwrsufie5hrrezjtv2tu56fjy

BANNER: A Cost-Sensitive Contextualized Model For Bangla Named Entity Recognition

Imranul Ashrafi, Muntasir Mohammad, Arani Shawkat Mauree, Galib Md. Azraf Nijhum, Redwanul Karim, Nabeel Mohammed, Sifat Momen
2020 IEEE Access  
Many architectures have produced good results on high resourced languages like English and Chinese. However, the NER task has not yet achieved much progress for Bangla, a low resource Language.  ...  Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that aims to classify words into a predetermined list of Named Entities (NE).  ...  For low resource languages like Bangla, for which there is a dearth of large annotated datasets, this naive approach is limited in its applicability.  ... 
doi:10.1109/access.2020.2982427 fatcat:ujdbt3urh5gzrkmo4yc66oputu
« Previous Showing results 1 — 15 out of 85 results