751 Hits in 4.7 sec

Towards non-toxic landscapes: Automatic toxic comment detection using DNN [article]

Ashwin Geet D'Sa, Irina Illina, Dominique Fohr
2020 arXiv   pre-print
The spectacular expansion of the Internet has led to the development of a new research problem in the field of natural language processing: automatic toxic comment detection, since many countries prohibit  ...  We compare different unsupervised word representations and different DNN based classifiers.  ...  This article aims at designing methods for automatic toxic speech detection on the Internet.  ... 
arXiv:1911.08395v2 fatcat:t5wv4wpscfgqrh627ch27wfwdq

Transfer Learning for Hate Speech Detection in Social Media [article]

Lanqin Yuan and Tianyu Wang and Gabriela Ferraro and Hanna Suominen and Marian-Andrei Rizoiu
2022 arXiv   pre-print
These methods and insights hold the potential for safer social media and reduce the need to expose human moderators and annotators to distressing online messaging.  ...  This paper uses a transfer learning technique to leverage two independent datasets jointly and builds a single representation of hate speech.  ...  The ELMo model is pre-trained for general purposes, and consequently, its constructed embeddings may have limited usefulness for the hate speech detection application.  ... 
arXiv:1906.03829v2 fatcat:3fp3z7ckgndl5h7aimxe3zhobu

Thai Spelling Correction and Word Normalization on Social Text using a Two-stage Pipeline with Neural Contextual Attention

Anuruth Lertpiya, Tawunrat Chalothorn, Ekapol Chuangsuwanich
2020 IEEE Access  
., spell checkers) have been used to improve the quality of computerized text by detecting and correcting errors.  ...  In this paper, we investigated how current text correction systems perform on correcting errors and word variances in Thai social texts and propose a method designed for this task.  ...  ACKNOWLEDGMENTS This work was supported in part by the joint research of Kasikorn Business Technology Group (KBTG) and the Faculty of Engineering, Chulalongkorn University.  ... 
doi:10.1109/access.2020.3010828 fatcat:7mdoiniof5eahmm32w4beefmgq

Unsupervised, low latency anomaly detection of algorithmically generated domain names by generative probabilistic modeling

Jayaram Raghuram, David J. Miller, George Kesidis
2014 Journal of Advanced Research  
We propose a fully generative model for the probability distribution of benign (white listed) domain names which can be used in an anomaly detection setting for identifying putative algorithmically generated  ...  Since these names are mostly assigned by humans, they are pronounceable, and tend to have a distribution of characters, words, word lengths, and number of words that are typical of some language (mostly  ...  We used the logarithm of the joint probability under this model as a test statistic for detection.  ... 
doi:10.1016/j.jare.2014.01.001 pmid:25685511 pmcid:PMC4294760 fatcat:lpxqtbssefgljiqexfphlaouj4

Codeword Detection, Focusing on Differences in Similar Words Between Two Corpora of Microblogs

Takuro Hada, Yuichi Sei, Yasuyuki Tahara, Akihiko Ohsuga
2021 Annals of Emerging Technologies in Computing  
We proposed new methods for detecting codewords based on differences in word usage and conducted experiments on concealed-word detection to evaluate the effectiveness of the method.  ...  Recently, the use of microblogs in drug trafficking has surged and become a social problem.  ...  Morphological analysis We focused on Twitter because of its use of short sentences, new words and slang, and limited character length.  ... 
doi:10.33166/aetic.2021.02.008 fatcat:t3kky4ifbjhrni2n72p7p4agju

Malicious Text Identification: Deep Learning from Public Comments and Emails

Asma Baccouche, Sadaf Ahmed, Daniel Sierra-Sosa, Adel Elmaghraby
2020 Information  
Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails.  ...  We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset.  ...  The model contains 300-dimensional vectors for 3 million words and sentences that can be used to create word embeddings for a specific dataset.  ... 
doi:10.3390/info11060312 fatcat:3cbqddu4sfhe3ai2w5g7ejff24

A ML and NLP based Framework for Sentiment Analysis on Bigdata

2020 International journal of recent technology and engineering  
In other words, social feedback on products and services are available.  ...  Usage of probabilistic topic model is a novel approach in sentiment analysis. In this paper, we proposed a framework for comprehensive analysis of overall and aspect-based sentiments.  ...  Three models are used for document vector generation. They are known as count model, TF-IDF model and word embeddings model. Word embeddings model is known as GoogleNews-vectors-negative300.  ... 
doi:10.35940/ijitee.d9062.029420 fatcat:nhyddtiqzradbpfhcgmiz5tg6m


2020 Philology matters  
The literary language and the language of the science and technology practically use the commonly-used words and scientific lexical units.  ...  In the modern English and Uzbek languages jargons are widely used in terms of many concepts related to computer and the Internet activities.  ...  The group of analyzed slang vocabulary consists of words that describe the process of working on the Internet: cobsite -an outdated, not updated site, spam -the names of types of advertising embedded in  ... 
doi:10.36078/987654465 fatcat:n3l2ttajczbbzb4xhucj447rvy

SocialNLP 2018 EmotionX Challenge Overview: Recognizing Emotions in Dialogues

Chao-Chun Hsu, Lun-Wei Ku
2018 Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media  
The best team achieves the unweighted accuracy 62.48 and 62.5 on EmotionPush and Friends, respectively.  ...  Organizers provide baseline results. 18 teams registered in this challenge and 5 of them submitted their results successfully.  ...  For the SmartDubai team, they use word and character TF-IDF independently with logistic regression.  ... 
doi:10.18653/v1/w18-3505 dblp:conf/acl-socialnlp/HsuK18 fatcat:vyfatxzycrbkhedmgm7cbrbqya

Authorship Attribution in Bangla literature using Character-level CNN [article]

Aisha Khatun, Anisur Rahman, Md. Saiful Islam, Marium-E-Jannat
2020 arXiv   pre-print
The time and memory efficiency of the proposed model is much higher than the word level counterparts but accuracy is 2-5% less than the best performing word-level models.  ...  Comparison of various word-based models is performed and shown that the proposed model performs increasingly better with larger datasets.  ...  This concept can be leveraged to use character embeddings to fit misspelled words, rare or new words, slangs or emoticons.  ... 
arXiv:2001.05316v1 fatcat:ulmqi25ozjh6red5qgah4mlc6y

Text Analysis and Machine Learning Approach to Phished Email Detection

Olasehinde Olayemi
2019 International Journal of Computer Applications  
(AI) that uses the method of data mining to find out new or existing characteristics from a set of gathered data which can be relevant for classification.  ...  Machine learning methods has been found to achieve much better result than other phished email detection techniques such as blacklists, visual similarity and heuristic techniques.  ...  vector created using word embedding discussed in section 3.3 was used for the training and testing of the classifiers using 10-fold cross validation.  ... 
doi:10.5120/ijca2019918354 fatcat:siwkv5n5izb7vco3dam57kvaie

Characterization of citizens using word2vec and latent topic analysis in a large set of tweets

Vladimir Vargas-Calderón, Jorge E. Camargo
2019 Cities  
With the increasing use of the Internet and mobile devices, social networks are becoming the most used media to communicate citizens' ideas and thoughts.  ...  Results show that the proposed method is an interesting tool to characterize a city population based on a machine learning methods and text analytics.  ...  We selected this model because it has been the seed for all word embedding models, and it is the most widely used model, despite the existence of newer and very successful word embedding models such as  ... 
doi:10.1016/j.cities.2019.03.019 fatcat:2z4rzz32jrdn7lxuu3wd4ilxri

Aspect-Based Sentiment Analysis Using Hybrid CNN-SVM with Particle Swarm Optimization for Domain Independent Datasets

2020 International Journal of Emerging Trends in Engineering Research  
In this paper, we suggested novel intelligent framework based on hybrid convolutional neural network and support vector machine (SVM) for aspect-based sentiment detection and classification of online product  ...  However, building a powerful hybrid aspect-based sentiment analysis model utilizing CNN can be highly complex and expensive.  ...  Sentence level embedding A sentence x with m words is provided {w 1 ;w 2 ;...w m } which is then translated in to joints of word level and {u 1 , u 2 ;…,u n } are embedding level character.  ... 
doi:10.30534/ijeter/2020/628102020 fatcat:bixgm5d7fvbqraizwbc73a3q3q

Improving Adverse Drug Event Extraction with SpanBERT on Different Text Typologies [article]

Beatrice Portelli, Daniele Passabì, Edoardo Lenzi, Giuseppe Serra, Enrico Santus, Emmanuele Chersoni
2021 arXiv   pre-print
In recent years, Internet users are reporting Adverse Drug Events (ADE) on social media, blogs and health forums.  ...  We propose for the first time the use of the SpanBERT architecture for the task of ADE extraction: this new version of the popular BERT transformer showed improved capabilities with multi-token text spans  ...  In addition, in short and highly contextual language, such as the one used in social media -which is characterized by acronyms, slang, metaphors, etc.  ... 
arXiv:2105.08882v1 fatcat:gytfiem6u5dbdolqstu7wckcqq

Improving Named Entity Recognition in Tweets via Detecting Non-Standard Words

Chen Li, Yang Liu
2015 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)  
Second, this paper investigates two methods using NSW detection results for named entity recognition (NER) in social media data.  ...  One adopts a pipeline strategy, and the other uses a joint decoding fashion. We also create a new data set with newly added normalization annotation beyond the existing named entity labels.  ...  Acknowledgments We thank the anonymous reviewers for their detailed and insightful comments on earlier drafts of this paper. The work is partially supported by DARPA Contract No. FA8750-13-2-0041.  ... 
doi:10.3115/v1/p15-1090 dblp:conf/acl/LiL15 fatcat:ycelfphrwzdhhlr235tfigonfu
« Previous Showing results 1 — 15 out of 751 results