563 Hits in 5.1 sec

Spoken dialect identification in Twitter using a multi-filter architecture [article]

Mohammadreza Banaei, Rémi Lebret, Karl Aberer
2020 arXiv   pre-print
Moreover, we do not use binary models (GSW vs. not-GSW) in our filters but rather a multi-class classifier with GSW being one of the possible labels.  ...  This paper presents our approach for SwissText & KONVENS 2020 shared task 2, which is a multi-stage neural model for Swiss German (GSW) identification on Twitter.  ...  Conclusion In this work, we propose an architecture for spoken dialect (Swiss German) identification by introducing a multi-filter architecture that is able to filter out non-GSW tweets during the inference  ... 
arXiv:2006.03564v1 fatcat:6n7i7bwttfhnvlrdr2eqwbirge

An Arabic Dialects Dictionary Using Word Embeddings

Azroumahli Chaimae, Yacine El Younoussi, Otman Moussaoui, Youssra Zahidi
2019 International Journal of Rough Sets and Data Analysis  
Facebook, Twitter, etc.). This is to create a vectorized dictionary for the crawled data using the word Embeddings.  ...  This work addresses this issue by firstly highlighting the steps and the issues related to building a multi Arabic dialect corpus using web data from blogs and social media platforms (i.e.  ...  It represents the identification, analysis and description of the different language unit's structures, that are used to render a much larger set of meaning variations in natural languages such as Arabic  ... 
doi:10.4018/ijrsda.2019070102 fatcat:fdeznqlg5rdfzmgzoygcxyhwem

Automatic Arabic Dialect Identification Systems for Written Texts: A Survey [article]

Maha J. Althobaiti
2020 arXiv   pre-print
Then, the survey extensively discusses in a critical manner many aspects related to Arabic dialect identification task.  ...  In this paper, we present a comprehensive survey of Arabic dialect identification research in written texts. We first define the problem and its challenges.  ...  The tweets were first collected using Twitter API, and then filtered using a set of distinctive words for each dialect.  ... 
arXiv:2009.12622v1 fatcat:ul32voarejenfdrfwsbxa46os4

Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus

Mahmoud El-Haj
2020 International Conference on Language Resources and Evaluation  
To experiment with the corpus I run extensive binary and multi-class experiments for dialect and country-of-origin identification.  ...  For the binary dialect identification task the best performing classifier achieved a testing accuracy of 93%.  ...  The dialect identification experiments are performed on different levels as follows: (a) top_2 dialects: a binary classification in which the top two dialects are used (i.e.  ... 
dblp:conf/lrec/El-Haj20 fatcat:cht24vhojbfslbfasppz4iozze

A Multilingual Encoding Method for Text Classification and Dialect Identification Using Convolutional Neural Network [article]

Amr Adel Helmy
2019 arXiv   pre-print
This thesis presents a language-independent text classification model by introduced two new encoding methods "BUNOW" and "BUNOC" used for feeding the raw text data into a new CNN spatial architecture with  ...  The proposed model can be classified as hybrid word-character model in its work methodology because it consumes less memory space by using a fewer neural network parameters as in character level representation  ...  Second, using CNN Architecture models described in Section 4.2 for text classification and Arabic dialects identification.  ... 
arXiv:1903.07588v1 fatcat:rw76tc6i5bb3xm3xsuauwzvc5u

Detecting Arabic textual threats in social media using artificial intelligence: An overview

Hossam Elzayady, Mohamed S. Mohamed, Khaled M. Badran, Gouda I. Salama
2022 Indonesian Journal of Electrical Engineering and Computer Science  
This article provides a thorough review of research studies that have made use of artificial intelligence (AI) for the identification of Arabic offensive language in various contexts.</span>  ...  Therefore, there is a need to monitor and evaluate social media postings using automated methods and techniques.  ...  Arabic is spoken in a variety of dialects, including classical, modern standard, and numerous local dialects.  ... 
doi:10.11591/ijeecs.v25.i3.pp1712-1722 fatcat:yct6lvkemvdklotnhghuxdtvay

Toward Micro-Dialect Identification in Diaglossic and Code-Switched Environments [article]

Muhammad Abdul-Mageed and Chiyu Zhang and AbdelRahim Elmadany and Lyle Ungar
2020 arXiv   pre-print
Inspired by geolocation research, we propose the novel task of Micro-Dialect Identification (MDI) and introduce MARBERT, a new language model with striking abilities to predict a fine-grained variety (  ...  For modeling, we offer a range of novel spatially and linguistically-motivated multi-task learning models.  ...  Arap-tweet: A large multi-dialect Twitter corpus for gender, age and language variety identification.  ... 
arXiv:2010.04900v2 fatcat:ipa75b6xljbvtnovagwjcadfpe

Classification of Arabic Tweets: A Review

Meshrif Alruily
2021 Electronics  
Arabic is one of the world's most famous languages and it had a significant role in science, mathematics and philosophy in Europe in the middle ages.  ...  In this paper, a comparison of previous surveys is presented, elaborating the need for a comprehensive study on Arabic Tweets.  ...  Levantine Levantine is a dialect of Arabic spoken in Jordan, Lebanon, Palestine, and Syria and is spoken by more than 20 million speakers.  ... 
doi:10.3390/electronics10101143 fatcat:ipttevq6xbfqxksbmr3wof5s6q

Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT [article]

Anshul Wadhawan
2021 arXiv   pre-print
The task is aimed at developing a system that identifies the geographical location(country/province) from where an Arabic tweet in the form of modern standard Arabic or dialect comes from.  ...  This paper presents our approach to address the EACL WANLP-2021 Shared Task 1: Nuanced Arabic Dialect Identification (NADI).  ...  ., 2021) is based on a multi-class classification problem where the aim is to recognize which country or province an Arabic tweet in the form of modern standard Arabic or dialect belongs to.  ... 
arXiv:2102.09749v2 fatcat:l636r4rh6jg3ldhnjgwrvw3wai

Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data [article]

Fahad AlGhamdi, Mona Diab
2019 arXiv   pre-print
Linguistic Code Switching (CS) is a phenomenon that occurs when multilingual speakers alternate between two or more languages/dialects within a single conversation.  ...  We investigate the landscape in four CS language pairs, Spanish-English, Hindi-English, Modern Standard Arabic- Egyptian Arabic dialect (MSA-EGY), and Modern Standard Arabic- Levantine Arabic dialect (  ...  Introduction Code Switching (CS) is a common linguistic behavior where two or more languages/dialects are used interchangeably in either spoken or written form.  ... 
arXiv:1905.13359v1 fatcat:g4wgma74cjaxhg6yq2lrdolqae

Detecting Arabic Offensive Language in Microblogs Using Domain-Specific Word Embeddings and Deep Learning

Khulood O. Aljuhani, Khaled H. Alyoubi, Fahd S. Alotaibi
2022 Tehnički glasnik  
The results showed the highest performance accuracy of 0.93% with the BiLSTM model trained using a combination of domain-specific and agnostic-domain word embeddings.  ...  In recent years, social media networks are emerging as a key player by providing platforms for opinions expression, communication, and content distribution.  ...  In this study, we built a multi-dialect and multi-domain Arabic dataset for detecting offensive language on Twitter.  ... 
doaj:a64f4b001e0246babbfb91c74079767f fatcat:yszynytvuneb7pez4yts7p4g4y

An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus

Hebah ElGibreen, Mohammed Faisal, Mansour Al Sulaiman, Sherif Abdou, Mohamed Amine Mekhtiche, Abdullah M. Moussa, Yousef Alohali, Wadood Abdul, Ghulam Muhammad, Mohsen Rashwan, Mohammed Algabri
2021 IEEE Access  
ACKNOWLEDGMENT This research was funded by Deputyship for research and Innovation, Ministry of Education in Saudi Arabia; project number DRI-KSU-1292.  ...  It can be used in applications such as dialect identification (DID) and machine translation (MT).  ...  Another corpus created based on Twitter is the Multi-Dialect Arabic Sentiment Twitter Dataset (MD-ArSenTD) [31] which is a multidialect Arabic corpus collected from tweets from 12 Arab countries (KW,  ... 
doi:10.1109/access.2021.3089924 fatcat:fzfvvwy5jbggpk6kjannu4wghu

Sentiment Analysis of Algerian Dialect Using Machine Learning and Deep Learning with Word2vec

Ahmed Cherif Mazari, Abdelhamid Djeffal
2022 Informatica (Ljubljana, Tiskana izd.)  
These comments concern the Algerian spoken language, written in Arabic and/or Latin characters, which could be either Modern Standard Arabic, French or local dialect.  ...  (CNN, RNN) to evaluate and compare the dataset in original version, in a transcribed to Latin character version and then in a semantically-enhanced version by word2vec models.  ...  Filtering comments The filtering process is based on topics of FB-pages, Hashtags on twitter and titles of YouTube videos. Then it starts by deleting all the comments written in other languages.  ... 
doi:10.31449/inf.v46i6.3340 fatcat:gpfrruq6bjd6zfx7ae5zep36zi

Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task

Shervin Malmasi, Marcos Zampieri, Nikola Ljubesic, Preslav Nakov, Ahmed Ali, Jörg Tiedemann
2016 Workshop on NLP for Similar Languages, Varieties and Dialects  
The challenge offered two subtasks: subtask 1 focused on the identification of very similar languages and language varieties in newswire texts, whereas subtask 2 dealt with Arabic dialect identification  ...  A total of 37 teams registered to participate in the task, 24 teams submitted test results, and 20 teams also wrote system description papers.  ...  Acknowledgments We would like to thank all participants in the DSL shared task for their valuable suggestions and comments.  ... 
dblp:conf/vardial/MalmasiZLNAT16 fatcat:vw3c5sgikfahnltdsha3qumloa

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia [article]

Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, Sebastian Ruder
2022 arXiv   pre-print
NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects.  ...  Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's  ...  Lastly, we acknowledge the support of in this work.  ... 
arXiv:2203.13357v1 fatcat:v3klnli2ivbsfmtpdhi6pd4o5a
« Previous Showing results 1 — 15 out of 563 results