11,868 Hits in 5.9 sec

Automatic Term Extraction Using Log-Likelihood Based Comparison with General Reference Corpus [chapter]

Alexander Gelbukh, Grigori Sidorov, Eduardo Lavin-Villa, Liliana Chanona-Hernandez
2010 Lecture Notes in Computer Science  
The proposed method is based on comparison with general reference corpus using log-likelihood similarity.  ...  In the paper we present a method that allows an extraction of singleword terms for a specific domain. At the next stage these terms can be used as candidates for multi-word term extraction.  ...  The proposed method is based on comparison with general reference corpus using loglikelihood similarity that is used for corpus comparison.  ... 
doi:10.1007/978-3-642-13881-2_26 fatcat:wcitrfwuurb6bd74fb2asfcfom

Automatic labeling of software components and their evolution using log-likelihood ratio of word frequencies in source code

Adrian Kuhn
2009 2009 6th IEEE International Working Conference on Mining Software Repositories  
In this paper we present a lexical approach that uses the log-likelihood ratios of word frequencies to automatically provide labels for software components.  ...  We present a prototype implementation of our labeling/comparison algorithm and provide examples of its application.  ...  We thank Dominique Matter for his help with the parameters of log-likelihood ratios.  ... 
doi:10.1109/msr.2009.5069499 dblp:conf/msr/Kuhn09 fatcat:jqhojunm5nhfpfeibura7gnu2q

Taxonomy Extraction for Customer Service Knowledge Base Construction [chapter]

Bianca Pereira, Cecile Robin, Tobias Daudert, John P. McCrae, Pranab Mohanty, Paul Buitelaar
2019 Lecture Notes in Computer Science  
can improve the quality of automatically constructed taxonomic knowledge bases.  ...  In this paper we explore the use of automatic taxonomy extraction from text as a means to reconstruct a customer-agent taxonomic vocabulary.  ...  First we extract the terms that are most relevant to the domain, a task referred to as automatic term recognition (ATR).  ... 
doi:10.1007/978-3-030-33220-4_13 fatcat:2cio6ivakrbvnjwivtjvmln7xm

Clustering-based Approach to Multiword Expression Extraction and Ranking

Elena Tutubalina
2015 Proceedings of the 11th Workshop on Multiword Expressions  
We present a domain-independent clusteringbased approach for automatic extraction of multiword expressions (MWEs).  ...  The method combines statistical information from a general-purpose corpus and texts from Wikipedia articles.  ...  For comparison, we use n-best lists that are ranked by popular association measures: t-score, log-likelihood, and MI.  ... 
doi:10.3115/v1/w15-0906 dblp:conf/mwe/Tutubalina15 fatcat:ylseuc3dc5hclcpuudb3ybum4q

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora [chapter]

Marina Santini, Wiktor Strandqvist, Mikael Nyström, Marjan Alirezai, Arne Jönsson
2018 Communications in Computer and Information Science  
We present a case study where we explore the effectiveness of different measures -namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback-Leibler divergence, log-likelihood and  ...  Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora.  ...  medical web corpora, one bootstrapped with hand-picked term seeds, and the other one bootstrapped with automatically extracted term seeds.  ... 
doi:10.1007/978-3-319-99133-7_17 fatcat:ncso5ksl5vfwvkrdeqehfqbuze

TExSIS: Bilingual terminology extraction from parallel corpora using chunk-based alignment

Lieve Macken, Els Lefever, Veronique Hoste
2013 Terminology  
We report on TExSIS, a flexible bilingual terminology extraction system that uses a sophisticated chunk-based alignment method for the generation of candidate terms, after which the specificity of the  ...  A comparison of our system with the LUIZ approach described by Vintar (2010) reveals that TExSIS outperforms LUIZ both for monolingual and bilingual terminology extraction.  ...  (extraction corpus and general reference corpus).  ... 
doi:10.1075/term.19.1.01mac fatcat:fh5gxl2ksfhvhiz3v5mvgtgfci

Text de-identification for privacy protection: A study of its impact on clinical text information content

Stéphane M. Meystre, Óscar Ferrández, F. Jeffrey Friedlin, Brett R. South, Shuying Shen, Matthew H. Samore
2014 Journal of Biomedical Informatics  
Automated Natural Language Processing (NLP) methods can alleviate this process, but their impact on subsequent uses of the automatically deidentified clinical narratives has only barely been investigated  ...  To study this impact in more details and assess how generalizable our findings were, we examined the overlap between select clinical information annotated in the 2010 i2b2 NLP challenge corpus and automatic  ...  Acknowledgments We thank Abhisek Trivedi and Matthew Maw for their help with these studies. Research supported by VA HSR HIR 08-374.  ... 
doi:10.1016/j.jbi.2014.01.011 pmid:24502938 fatcat:rso3rhsnsfdnnksstsmdn5bqe4

Speaker Verification Using Support Vector Machines and High-Level Features

William M. Campbell, Joseph P. Campbell, Terry P. Gleason, Douglas A. Reynolds, Wade Shen
2007 IEEE Transactions on Audio, Speech, and Language Processing  
We use support vector machine modeling of these n-gram frequencies for speaker verification. We derive a new kernel based upon linearizing a log likelihood ratio scoring system.  ...  We demonstrate that our methods produce results significantly better than standard log-likelihood ratio modeling.  ...  Both the standard TFIDF term weighting (9) and the log likelihood ratio (TFLLR) term weighting (6) methods were used.  ... 
doi:10.1109/tasl.2007.902874 fatcat:7hkkdid2a5dbboivvqxmikrs7y

A corpus of Australian contract language

Michael Curtotti, Eric C. McCreath
2011 Proceedings of the 13th International Conference on Artificial Intelligence and Law - ICAIL '11  
Profiling of the corpus is consistent with its suitability for use in language engineering applications.  ...  The corpus conforms to Zipf's law and comparative type to token ratios are consistent with lower term sparsity (an expectation for legal language).  ...  Applied to words, the method calculates the log likelihood ('LL') ratio of the frequency of a word in frequency lists extracted from each corpus.  ... 
doi:10.1145/2018358.2018387 dblp:conf/icail/CurtottiM11 fatcat:vvto626xzrdx7nlwjvn2q57bc4

A Corpus of Australian Contract Language: Description, Profiling and Analysis

Michael Curtotti, Eric McCreath
2011 Social Science Research Network  
Profiling of the corpus is consistent with its suitability for use in language engineering applications.  ...  The corpus conforms to Zipf's law and comparative type to token ratios are consistent with lower term sparsity (an expectation for legal language).  ...  Applied to words, the method calculates the log likelihood ('LL') ratio of the frequency of a word in frequency lists extracted from each corpus.  ... 
doi:10.2139/ssrn.2304652 fatcat:cjjgi5ytprgxvh4g7af3gcz6cq

Statistical termhood measurement for mono-word terms via corpus comparison

Xiao-Yue Liu, Chunyu Kit
2009 2009 International Conference on Machine Learning and Cybernetics  
This paper examines the performance of a number of statistical measures for mono-word termhood within a corpus comparison framework.  ...  These measures are defined in terms of the frequency, information, and rank of a term candidate in a domain and a background corpus.  ...  Rayson and Garside [8] identify key items to differentiate one corpus from another using the log-likelihood (LL) statistic.  ... 
doi:10.1109/icmlc.2009.5212765 fatcat:sxgbfsexdfecvi2e4ic44ogjkq

A Novel Method for Arabic Multi-Word Term Extraction

Hadni Meryem, Said Alaoui Ouatik, Abdelmonaime Lachkar
2014 International Journal of Database Management Systems  
These methods present some drawbacks that limit their use. In fact they can only deal with bi-grams terms and their yield not good accuracies.  ...  To evaluate and illustrate the efficiency of our proposed method for AMWTs extraction, a comparative study has been conducted based on Kalimat Corpus and using nine experiment schemes: In the linguistic  ...  The aim of Extraction term is to automatically extract relevant terms from a given corpus.  ... 
doi:10.5121/ijdms.2014.6304 fatcat:iv22zu7tkzd5dcnh3em7tjda3e

Automatic analysis of dialect/language sets

Mahnoosh Mehrabani, John H. L. Hansen
2015 International Journal of Speech Technology  
First, a method is proposed to measure spectral acoustic differences between dialects based on a volume space analysis within a 3D model using log likelihood score distributions derived from traditional  ...  The proposed dialect proximity measures are evaluated and compared on a corpus of Arabic dialects, as well as a corpus of South Indian languages, which are closely related languages.  ...  Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s)  ... 
doi:10.1007/s10772-014-9268-y fatcat:ibdzfhwtkjcyna7xpaimz5egfe

Evaluation of automatic collocation extraction methods for language learning

Vishal Bhalla, Klara Klimcikova
2019 Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications  
A number of methods have been proposed to automatically extract collocations, i.e., conventionalized lexical combinations, from text corpora.  ...  This paper compares three end-to-end resources for collocation learning, all of which used the same corpus but different methods.  ...  Wu for her insights on FLAX extraction, Aisulu for preparing the COCA list, Ivet for the help in large scale experiments and all the anonymous reviewers for their critical feedback.  ... 
doi:10.18653/v1/w19-4428 dblp:conf/bea/BhallaK19 fatcat:bvdmjypmxnfu5ac2y3efc5hbpu

Abstractive Summarization of Spoken and Written Conversations Based on Phrasal Queries

Yashar Mehdad, Giuseppe Carenini, Raymond T. Ng
2014 Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)  
We rank and extract the utterances in a conversation based on the overall content and the phrasal query information.  ...  Automatic and manual evaluation results over meeting, chat and email conversations show that our approach significantly outperforms baselines and previous extractive models.  ...  We also would like to acknowledge the early discussions on the related topics with Frank Tompa.  ... 
doi:10.3115/v1/p14-1115 dblp:conf/acl/MehdadCN14 fatcat:f2dfsghiarf4bmcwgyabzomfqq
« Previous Showing results 1 — 15 out of 11,868 results