Filters








3,231 Hits in 3.9 sec

Iterative Data Programming for Expanding Text Classification Corpora [article]

Neil Mallinar, Abhishek Shah, Tin Kam Ho, Rajendra Ugrani, Ayush Gupta
2020 arXiv   pre-print
We present a fast, simple data programming method for augmenting text data sets by generating neighborhood-based weak models with minimal supervision.  ...  The iterative data programming techniques improve newer weak models as more labeled data is confirmed with human-in-loop.  ...  Conclusion We present iterative applications, of a search-based selection strategy and a data programming strategy employing weak learning, for expanding text classification data that is independent of  ... 
arXiv:2002.01412v1 fatcat:soan2jxci5f2ddoank4bruvfya

Iterative Data Programming for Expanding Text Classification Corpora

Neil Mallinar, Abhishek Shah, Tin Kam Ho, Rajendra Ugrani, Ayush Gupta
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
We present a fast, simple data programming method for augmenting text data sets by generating neighborhood-based weak models with minimal supervision.  ...  The iterative data programming techniques improve newer weak models as more labeled data is confirmed with human-in-loop.  ...  Conclusion We present iterative applications, of a search-based selection strategy and a data programming strategy employing weak learning, for expanding text classification data that is independent of  ... 
doi:10.1609/aaai.v34i08.7045 fatcat:l4zykyazfzhmfodhlmod46yjji

BioReader: a text mining tool for performing classification of biomedical literature

Christian Simon, Kristian Davidsen, Christina Hansen, Emily Seymour, Mike Bogetofte Barnkob, Lars Rønn Olsen
2019 BMC Bioinformatics  
We here present a tool that enables users to perform classification of scientific literature by text mining-based classification of article abstracts.  ...  BioReader supports data and information collection by implementing text mining-based classification of primary biomedical literature in a web interface, thus enabling curators and researchers to take advantage  ...  The training data set is expanded with each iteration of classification, thus improving the performance of the classification algorithm.  ... 
doi:10.1186/s12859-019-2607-x pmid:30717659 pmcid:PMC7394276 fatcat:le2hnzw3krcqpfif53qjyxyy2i

AraSenCorpus: A Semi-Supervised Approach for Sentiment Annotation of a Large Arabic Text Corpus

Ali Al-Laith, Muhammad Shahbaz, Hind F. Alaskar, Asim Rehmat
2021 Applied Sciences  
We evaluate our proposed framework on two external benchmark datasets to ensure the improvement of the Arabic sentiment classification.  ...  This paper presents a semi-supervised self-learning technique, to extend an Arabic sentiment annotated corpus with unlabeled data, named AraSenCorpus.  ...  programming interfaces (APIs) [19] .  ... 
doi:10.3390/app11052434 fatcat:yxj6kpjgwfcvjmggympkfhtxxa

Porting Multilingual Subjectivity Resources across Languages

Carmen Banea, Rada Mihalcea, Janyce Wiebe
2013 IEEE Transactions on Affective Computing  
., a bilingual dictionary or a parallel corpus), the methods can be used to rapidly create tools for subjectivity analysis in the new language.  ...  In this paper, we explore methods for generating subjectivity analysis resources in a new language by leveraging on the tools and resources available in English.  ...  After each iteration, only candidates with a LSA score higher than 0.4 (determined empirically) are considered to be expanded in the next iteration.  ... 
doi:10.1109/t-affc.2013.1 fatcat:y4cwk57cvncahhvmxjovzkt53a

Deep Text Mining of Instagram Data without Strong Supervision

Kim Hammar, Shatha Jaradat, Nima Dokoohaki, Mihhail Matskin
2018 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI)  
This textual data can be analyzed for the purpose of improving user recommendations and detecting trends. Instagram is one of the largest social media platforms, containing both text and images.  ...  However, most of the prior research on text processing in social media is focused on analyzing Twitter data, and little attention has been paid to text mining of Instagram data.  ...  Deep Clothing Classification of Text using Data Programming This section presents a pipeline for weakly supervised classification that I have applied to our corpora of Instagram posts.  ... 
doi:10.1109/wi.2018.00-94 dblp:conf/webi/HammarJDM18 fatcat:2x3qskb7njhcfhqf6rqcd7ti6y

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Dragos Stefan Munteanu, Daniel Marcu
2005 Computational Linguistics  
We present a novel method for discovering parallel sentences in comparable, non-parallel corpora.  ...  Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora.  ...  We would like to thank Hal Daumé III, Alexander Fraser, Radu Soricut, as well as the anonymous reviewers, for their helpful comments. Any remaining errors are of course our own.  ... 
doi:10.1162/089120105775299168 fatcat:zlklfrmy5zd4rk5ye5l3f7y2ae

Domain Adaptation of a Broadcast News Transcription System for the Portuguese Parliament [chapter]

Luís Neves, Ciro Martins, Hugo Meinedo, João Neto
2008 Lecture Notes in Computer Science  
Acknowledgements The authors would like to thank Alberto Abad, for many helpful discussions. This work was funded by PRIME National Project TECNOVOZ number 03/165. References  ...  The cross-validation classification error defined the number of training iterations of the neural network.  ...  For both video programs the audio stream was extracted to mp3 format, using open source tools.  ... 
doi:10.1007/978-3-540-85980-2_17 fatcat:myd7nf43dbfwjieiepushsmuem

Proceedings 2002 IEEE International Conference on Data Mining. ICDM 2002

2002 2002 IEEE International Conference on Data Mining 2002 Proceedings ICDM-02  
L Hellerstein Mining Significant Associations in Large Scale Text Corpora 402 P. Raghavan and P.  ...  Lu Using Text Mining to Infer Semantic Attributes for Retail Data Mining 195 R. Ghani and A. E.  ... 
doi:10.1109/icdm.2002.1183878 fatcat:3iufo7cncbbzbn7cwjme73wrpm

Scary films good, scary flights bad

Scott Nowson
2009 Proceeding of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion - TSA '09  
This paper describes preliminary work on feature selection for classification of review text by both sentiment rating and topic.  ...  Following successful work on classification of texts by author demographics, a corpus of review texts labelled with attributed rating, topic area, and user demographics has been compiled.  ...  The second method involved looking for co-occurrence of new and known terms in unlabeled sentiment data. They found that an iterative combination of the two approaches worked best.  ... 
doi:10.1145/1651461.1651465 fatcat:45326p3uujcm3oalcyqxlbelna

Support Vector Machine with Ensemble Tree Kernel for Relation Extraction

Xiaoyong Liu, Hui Fu, Zhiguo Du
2016 Computational Intelligence and Neuroscience  
The new algorithm mainly uses two kinds of support vector machine classifiers based on tree kernel for integration and integrates the strategy of constrained extension seed set.  ...  The numerical experimental research based on two benchmark data sets (PropBank and AIMed) shows that the LXRE algorithm proposed in the paper is superior to other two common relation extraction methods  ...  Acknowledgments This work has been supported by grants from Program for Excellent Youth Scholars in Universities of Guangdong Province  ... 
doi:10.1155/2016/8495754 pmid:27118966 pmcid:PMC4826950 fatcat:4t4ej2viojcvjm7yauta7kqqdi

Designing an Extensible Domain-Specific Web Corpus for "Layfication" [chapter]

Marina Santini, Arne Jönsson, Wiktor Strandqvist, Gustav Cederblad, Mikael Nyström, Marjan Alirezaie, Leili Lind, Eva Blomqvist, Maria Lindén, Annica Kristoffersson
2019 Advances in Systems Analysis, Software Engineering, and High Performance Computing  
In the era of data-driven science, corpus-based language technology is an essential part of cyber physical systems.  ...  The main purpose of the corpus is to be used for building and training language technology applications for the "layfication" of the specialized medical jargon.  ...  Expanding the Corpus: eCare_Sv_01+ For expanding the corpus, the software BootCat was used.  ... 
doi:10.4018/978-1-5225-7879-6.ch006 fatcat:tgaorpe5fvepnhl7j66mkp2taa

Topic Modeling Technique for Text Mining Over Biomedical Text Corpora through Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering

Junaid Rashid, Syed Muhammad Adnan Shah, Aun Irtaza, Toqeer Mahmood, Muhammad Wasif Nisar, Muhammad Shafiq, Akber Gardezi
2019 IEEE Access  
In order to obtain the relevant data, the text documents pose a lot of challenging issues for data processing.  ...  Afterward, the classification and clustering for text mining are performed with a probability of topics in the documents.  ...  Therefore, text data is preprocessed through following steps. 1) CONVERT TEXT DATA INTO LOWER CASE Text datasets are converted into the lower case for preventing the various words differences. 2) TOKENIZATION  ... 
doi:10.1109/access.2019.2944973 fatcat:yqwkq6crgfc2jmvwip7fzxdoja

Predicting Good Configurations for GitHub and Stack Overflow Topic Models [article]

Christoph Treude, Markus Wagner
2019 arXiv   pre-print
To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies.  ...  In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to  ...  Topic modelling is a probabilistic technique to summarise large corpora of text documents by automatically discovering the semantic themes, or topics, hidden within the data.  ... 
arXiv:1804.04749v3 fatcat:xh5y3x2wvffh7hpupatc6tgesa

Textual resource acquisition and engineering

J. Chu-Carroll, J. Fan, N. Schlaefer, W. Zadrozny
2012 IBM Journal of Research and Development  
A key requirement for high-performing question-answering (QA) systems is access to high-quality reference corpora from which answers to questions can be hypothesized and evaluated.  ...  In this paper, we discuss the methodology that we developed for IBM Watson for performing acquisition, transformation, and expansion of textual resources.  ...  These text nuggets are then manually labeled on the basis of a binary classification of relevant or irrelevant with respect to the original seed document.  ... 
doi:10.1147/jrd.2012.2185901 fatcat:rk7qmv7umjh3znxojvmf72smgu
« Previous Showing results 1 — 15 out of 3,231 results