Filters








892 Hits in 4.1 sec

Measuring LDA topic stability from clusters of replicated runs

Mika V. Mantyla, Maelick Claes, Umar Farooq
2018 Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement - ESEM '18  
Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics.  ...  For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters.  ...  ACKNOWLEDGMENTS This work has been supported by Academy of Finland grant 298020.  ... 
doi:10.1145/3239235.3267435 dblp:conf/esem/MantylaCF18 fatcat:dycdnyah5ffg5dqajjm77bh734

Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs [article]

Jonas Rieger, Lars Koppers, Carsten Jentsch, Jörg Rahnenführer
2020 arXiv   pre-print
We aim to improve the reliability of LDA results. Therefore, we study the stability of LDA by comparing assignments from replicated runs.  ...  This approach leads to the new measure S-CLOP ( Similarity of multiple sets by Clustering with LOcal Pruning) for quantifying the stability of LDA models.  ...  In this work we propose to assess the stability of LDA with clustering techniques applied to replicated LDA runs.  ... 
arXiv:2003.04980v1 fatcat:q7nqr47mnzcfbjvbnf2hdz5l5q

How Many Topics? Stability Analysis for Topic Models [article]

Derek Greene, Derek O'Callaghan, Pádraig Cunningham
2014 arXiv   pre-print
Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics.  ...  In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the  ...  This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.  ... 
arXiv:1404.4606v3 fatcat:uwznp7klqzaytop27kveikijua

How Many Topics? Stability Analysis for Topic Models [chapter]

Derek Greene, Derek O'Callaghan, Pádraig Cunningham
2014 Lecture Notes in Computer Science  
Choosing too few topics will produce results that are overly broad, while choosing too many will result in the"over-clustering" of a corpus into many small, highly-similar topics.  ...  In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the  ...  This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.  ... 
doi:10.1007/978-3-662-44848-9_32 fatcat:6exdolamtbbvpe3yqjpecfcmku

Topic Model Stability for Hierarchical Summarization

John Miller, Kathleen McCoy
2017 Proceedings of the Workshop on New Frontiers in Summarization  
stability of the centroid model.  ...  We ran stability experiments for standard corpora and a development corpus of Global Warming articles.  ...  to achieve Results -Factorial Design Stability analysis was performed for each experimental group of replicates.  ... 
doi:10.18653/v1/w17-4509 dblp:conf/emnlp/MillerM17 fatcat:z4m7tkjzknavdgljx3urcylxca

Enhancing Digital Book Clustering by LDAC Model

Lidong WANG, Yuan JIE
2012 IEICE transactions on information and systems  
The main goal of LDAC topic modeling is to effectively extract topics from digital books. Subsequently, Gibbs sampling is applied for parameter inference.  ...  To do the correct clustering for digital books, a novel method based on probabilistic topic model is proposed. Firstly, we build a topic model named LDAC.  ...  The stability of the algorithm is also an essential factor. In Fig. 4 , we depict the clustering performance with different category numbers under the run of LDA and LDAC, respectively.  ... 
doi:10.1587/transinf.e95.d.982 fatcat:7mouki55x5gx3mvc4d5w6argfa

Text Representation Using Multi-level Latent Dirichlet Allocation [chapter]

Amir H. Razavi, Diana Inkpen
2014 Lecture Notes in Computer Science  
The method applies Latent Dirichlet Allocation (LDA) on a corpus to infer its major topics, which will be used for document representation.  ...  The representation that we propose has multiple levels (granularities) by using different numbers of topics.  ...  By running the LDA topic estimation algorithm, we have a topical cluster membership distribution vector for each document in the corpus.  ... 
doi:10.1007/978-3-319-06483-3_19 fatcat:uhuw3ekz7baxbm6yf7vtplyede

Expert Refined Topic Models to Edit Topic Clusters in Image Analysis Applied to Welding Engineering

Theodore T. Allen, Hui Xiong, Shih-Hsien Tseng
2020 Informatics  
This paper proposes a new method to generate edited topics or clusters to analyze images for prioritizing quality issues.  ...  Numerical examples illustrate the benefits of the high-level data related to improving accuracy measured by Kullback–Leibler (KL) distance.  ...  Conflicts of Interest: The authors declare no conflicts of interest.  ... 
doi:10.3390/informatics7030021 fatcat:coqmxsjjenbbnd5esipruhfjxe

Topic Modelling of Empirical Text Corpora: Validity, Reliability, and Reproducibility in Comparison to Semantic Maps [article]

Tobias Hecking, Loet Leydesdorff
2018 arXiv   pre-print
Using the 6,638 case descriptions of societal impact submitted for evaluation in the Research Excellence Framework (REF 2014), we replicate the topic model (Latent Dirichlet Allocation or LDA) made in  ...  Removing a small fraction of documents from the sample, for example, has on average a much larger impact on LDA than on PCA-based models to the extent that the largest distortion in the case of PCA has  ...  An LDA-based model m differs from a PCA-based clustering because the words are not partitioned, but words can be part of multiple topics.  ... 
arXiv:1806.01045v1 fatcat:gpxpgdhrszbhfmjtjcaxasfdd4

Unsupervised identification of crime problems from police free-text data

Daniel Birks, Alex Coleman, David Jackson
2020 Crime Science  
Results of our analyses demonstrate that topic modelling algorithms are capable of clustering substantively different burglary problems without prior knowledge of such groupings.  ...  We present a novel exploratory application of unsupervised machine-learning methods to identify clusters of specific crime problems from unstructured modus operandi free-text data within a single administrative  ...  Availability of data and materials Source code for the project can be found at https ://githu b.com/Quant Crim-Leeds /Polic e-Free-Text-LDA-Dashb oard.  ... 
doi:10.1186/s40163-020-00127-4 fatcat:x524lo4evncl3fei36r3srwlii

Thematic Analysis of 18 Years of PERC Proceedings using Natural Language Processing [article]

Tor Ole B. Odden and Alessandro Marin, Marcos D. Caballero
2020 arXiv   pre-print
Based on these results, we suggest that unsupervised text analysis techniques like LDA may hold promise for providing quantitative, independent, and replicable analyses of educational research literature  ...  Research (PER) over time and to rate the distribution of these topics within each paper.  ...  Our first replication run featured an extremely similar set of 10 topics (with minor reshuffling of words); the second differed from our presented model by one topic (it had two topics focused on representations  ... 
arXiv:2001.10753v1 fatcat:txsymkvdh5eh3h3kpttm7xnrlu

A Systematic Comparison of Search-Based Approaches for LDA Hyperparameter Tuning

Annibale Panichella
2020 Information and Software Technology  
While LDA has been mostly used with default settings, previous studies showed that default hyperparameter values generate sub-optimal topics from software documents.  ...  Context:Latent Dirichlet Allocation (LDA) has been successfully used in the literature to extract topics from software documents and support developers in various software engineering tasks.  ...  Besides, raw score re-run LDA multiple times with the same configuration to measure the topic stability across the runs.  ... 
doi:10.1016/j.infsof.2020.106411 fatcat:zwhbpfu725gdfkne6edwhuppxu

Thematic analysis of 18 years of physics education research conference proceedings using natural language processing

Tor Ole B. Odden, Alessandro Marin, Marcos D. Caballero
2020 Physical Review Physics Education Research  
Based on these results, we suggest that unsupervised text analysis techniques like LDA may hold promise for providing quantitative, independent, and replicable analyses of educational research literature  ...  research (PER) over time and to rate the distribution of these topics within each paper.  ...  For example, by analyzing analogous publications to PERC from other fields, one might be able to compare the key topics of physics education research with those from other fields of discipline-based educational  ... 
doi:10.1103/physrevphyseducres.16.010142 fatcat:rmjtv276jndoxlnuokqewotpyi

User Ex Machina : Simulation as a Design Probe in Human-in-the-Loop Text Analytics [article]

Anamaria Crisan, Michael Correll
2021 arXiv   pre-print
In this paper we conduct a simulation-based analysis of human-centered interactions with topic models, with the objective of measuring the sensitivity of topic models to common classes of user actions.  ...  Topic models are widely used analysis techniques for clustering documents and surfacing thematic elements of text corpora.  ...  As such, we selected three designs as inspiration: TopicCheck [14] uses a matrix of small multiples to assess the stability of a topic model algorithm across runs.  ... 
arXiv:2101.02244v1 fatcat:dge3niwcpncjdbo2q3estntdj4

Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology

Daniel Maier, A. Waldherr, P. Miltner, G. Wiedemann, A. Niekler, A. Keinert, B. Pfetsch, G. Heyer, U. Reber, T. Häussler, H. Schmid-Petri, S. Adam
2018 Communication Methods and Measures  
Consequently, we develop a brief hands-on user guide for applying LDA topic modeling. We demonstrate the value of our approach with empirical data from an ongoing research project.  ...  parameters, including the number of topics to be generated; (c) evaluation of the model's reliability; and (d) the process of validly interpreting the resulting topics.  ...  More specifically, the top 30 words of the validated "issues" (from the k matrix) were clustered using the cosine-similarity measure and the "complete" clustering method, as implemented in the "hclust"  ... 
doi:10.1080/19312458.2018.1430754 fatcat:7cfaethx5vhc7e6iq6gm2x6cqq
« Previous Showing results 1 — 15 out of 892 results