41,821 Hits in 11.4 sec

Topic Identification Of Noisy Texts: Statistical Approaches

K. Abainia
2015 Zenodo  
Actually, there exist several works in this field based on statistical and machine learning approaches for different text categories.  ...  In this investigation, we carried out a comparative study between two different statistical approaches based on tf-idf.  ...  A comparison between two statistical approaches based on tf-idf was carried out to study the performance of each one in the case of noisy Arabic texts (Section 4).  ... 
doi:10.5281/zenodo.20362 fatcat:oicdjqsqenhojo6luhaz2w3i2q

Word Level Language Identification in Online Multilingual Communication

Dong Nguyen, A. Seza Dogruöz
2013 Conference on Empirical Methods in Natural Language Processing  
Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level.  ...  For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries.  ...  Implementation Language identification was not performed for texts within quotes.  ... 
dblp:conf/emnlp/NguyenD13 fatcat:262huabckja6phifrduezfaom4

Applying CNL Authoring Support to Improve Machine Translation of Forum Data [chapter]

Sabine Lehmann, Ben Gottesman, Robert Grabowski, Mayo Kudo, Siu Kei Pepe Lo, Melanie Siegel, Frederik Fouvry
2012 Lecture Notes in Computer Science  
Machine translation (MT) is most often used for texts of publishable quality. However, there is increasing interest in providing translations of user-generated content in customer forums.  ...  This paper describes research towards addressing this challenge by automatically improving the quality of community forum data to improve MT results.  ...  Methods in Machine Translation We look at two approaches to machine translation: the statistical approach and the rule-based approach.  ... 
doi:10.1007/978-3-642-32612-7_1 fatcat:crpbg67lxzdzfghof5nhsd25wy

Language Identification Strategies for Cross Language Information Retrieval

Alessio Bosca, Luca Dini
2010 Conference and Labs of the Evaluation Forum  
In our participation to the 2010 LogCLEF track we focused on the analysis of the European Library (TEL) logs and in particular we experimented with the identification of the natural language used in the  ...  Entities can be misleading for the correct identification of the language used in the query.  ...  Language Identification techniques traditionally (see [4] , [5] or [6] ) include models based on the statistical distribution of character sequences or the presence in the text of function words (grammar  ... 
dblp:conf/clef/BoscaD10 fatcat:7nfvj4x3wzg2nlgc7h2jcgpdhm

Concept Extraction to Identify Adverse Drug Reactions in Medical Forums: A Comparison of Algorithms [article]

Alejandro Metke-Jimenez, Sarvnaz Karimi
2015 arXiv   pre-print
set of annotated text.  ...  Specifically, we implement several dictionary-based methods popular in the relevant literature, as well as a method we suggest based on a state-of-the-art machine learning method for entity recognition  ...  Machine Learning Approaches Several machine learning approaches have been used successfully to do entity recognition in natural language text.  ... 
arXiv:1504.06936v1 fatcat:wneokxxv4na3hmqx5kzpppu4f4

Grammar Checker Features for Author Identification and Author Profiling Notebook for PAN at CLEF 2013

Roman Kern
2013 Conference and Labs of the Evaluation Forum  
In order to detect the grammatical errors we base our approach on the output of the open-source library Language-Tool.  ...  Our work on author identification and author profiling is based on the question: Can the number and the types of grammatical errors serve as indicators for a specific author or a group of people?  ...  Conclusions We studied the effectiveness of style and grammar errors for Authorship Identification and Author Profiling.  ... 
dblp:conf/clef/Kern13 fatcat:bfes2zz5jvfcnm4fx5cxhla5ki

Investigating Machine Learning & Natural Language Processing Techniques Applied for Predicting Depression Disorder from Online Support Forums: A Systematic Literature Review

Isuri Anuradha Nanomi Arachchige, Priyadharshany Sandanapitchai, Ruvan Weerasinghe
2021 Information  
Our objective is to undertake a systematic review of the literature on NLP and ML approaches used for depression identification on Online Support Forums (OSF).  ...  From this systematic review, we further analyse which combination of features extracted from NLP and ML techniques are effective and scalable for state-of-the-art Depression Identification.  ...  Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/info12110444 fatcat:buft2xvmaba65dmghesdljd7xq

Automatic Identification of Arabic Language Varieties and Dialects in Social Media

Fatiha Sadat, Farzindar Kazemi, Atefeh Farzindar
2014 Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP)  
Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable overall accuracy of 98%.  ...  We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different conditions in social media  ...  The first approach is based on popular words or stop-words for each language, which score the text based on these words (Gotti and al., 2013) . The second approach is more statistical oriented.  ... 
doi:10.3115/v1/w14-5904 dblp:conf/acl-socialnlp/SadatKF14 fatcat:2ipvhdytjbel5l5qnmontfbxii

Predicting Depression Symptoms In An Arabic Psychological Forum

Norah Saleh Alghamdi, Hanan A. Hosni Mahmoud, Ajith Abraham, Samar Awadh Alanazi, Laura Garcia-Hernandez
2020 IEEE Access  
Our research method is based on the collection of Arabic text from online forums and the application of either a lexicon-based approach or a machine-learning-based approach.  ...  Therefore, in this study, we investigate the application of natural language processing and machine learning on Arabic text for the prediction of depression, and we evaluate and compare the performance  ...  Three approaches have been used to create the lexicon based on a scoring system, which is used to explore the identification of the depression-related words that are shared by online forum users in their  ... 
doi:10.1109/access.2020.2981834 fatcat:t3ezer5i4bgsdb2cj7sduftwu4

A focused crawler for Dark Web forums

Tianjun Fu, Ahmed Abbasi, Hsinchun Chen
2010 Journal of the American Society for Information Science and Technology  
Experiments conducted to evaluate the effectiveness of the human-assisted accessibility approach and the recall improvement based incremental update procedure yielded favorable results.  ...  In this study we propose a novel crawling system designed to collect Dark Web forum content. The system uses a human-assisted accessibility approach to gain access to Dark Web forums.  ...  The set of relevant URL tokens differs based on the forum software being used. Such tokens are language independent yet software specific.  ... 
doi:10.1002/asi.21323 fatcat:vqdvbnapgvhwbkjaucj3yr7mve

Optimizing Authorship Profiling of Online Messages

Adeola Opesade
2016 International Conference on Computing Research and Innovations  
Hence, a need for methods that can help improve on the success of authorship profiling undertakings.  ...  The present study sought through experiments, the writing features, analytical technique and number of class labels that can help improve the effectiveness of profiling the country of affiliation of authors  ...  The study achieved greater effectiveness but with a trade-off on efficiency.  ... 
dblp:conf/cori/Opesade16 fatcat:5l5baqefbvfajngvfyxufne5ka

Automatic identification of arabic dialects in social media

Fatiha Sadat, Farnazeh Kazemi, Atefeh Farzindar
2014 Proceedings of the first international workshop on Social media retrieval and analysis - SoMeRA '14  
Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable overall accuracy of 98%.  ...  We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different conditions in social media  ...  The first approach is based on popular words or stop-words for each language, which score the text based on these words [8] . The second approach is more statistical oriented.  ... 
doi:10.1145/2632188.2632207 dblp:conf/sigir/SadatKF14 fatcat:rmrj4okd5zblrm534rl24dnia4

Machine Learning for Classifying Authors of Anonymous Tweets, Blogs and Reviews

Seifeddine Mechti, Maher Jaoua, Lamia Hadrich Belguith
2014 Conference and Labs of the Evaluation Forum  
The proposed method is based on automatic classification, which uses some data extracted statistically from a source corpus.  ...  We present a hybrid method that combines the analysis of data in texts with a machine learning method.  ...  [2, 3, 4] Our method of author attribution is based on author identification and author profiling.  ... 
dblp:conf/clef/MechtiJB14 fatcat:hzxkk7iypfe3rbsvoqzgumzme4

A research framework for pharmacovigilance in health social media: Identification and evaluation of patient adverse drug event reports

Xiao Liu, Hsinchun Chen
2015 Journal of Biomedical Informatics  
To evaluate the proposed framework, a series of experiments were conducted on a test bed encompassing about postings from major diabetes and heart disease forums in the United States.  ...  The framework consists of medical entity extraction for recognizing patient discussions of drug and events, adverse drug event extraction with shortest dependency path kernel based statistical learning  ...  We thank Jing Liu for her analytical work on the MedHelp forum.  ... 
doi:10.1016/j.jbi.2015.10.011 pmid:26518315 fatcat:53nu3jrxfbdmxe2yqpaaqyrija

How Noisy Social Media Text, How Diffrnt Social Media Sources?

Timothy Baldwin, Paul Cook, Marco Lui, Andrew MacKinlay, Li Wang
2013 International Joint Conference on Natural Language Processing  
the proportion of grammatical sentences in each, based on a linguistically-motivated parser.  ...  We first extract out various descriptive statistics from each data type (including the distribution of languages, average sentence length and proportion of out-ofvocabulary words), and then investigate  ...  Acknowledgements NICTA is funded by the Australian government as represented by Department of Broadband, Communication and Digital Economy, and the Australian Research Council through the ICT centre of  ... 
dblp:conf/ijcnlp/BaldwinCLMW13 fatcat:6xc7iu6u4zcwrjs5xniafs26xy
« Previous Showing results 1 — 15 out of 41,821 results