A critical analysis of studies that address the use of text mining for citation screening in systematic reviews

Babatunde K. Olorisade, Ed de Quincey, Pearl Brereton, Peter Andras
2016 Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering - EASE '16  
Since the introduction of the systematic review process to Software Engineering in 2004, researchers have investigated a number of ways to mitigate the amount of effort and time taken to filter through large volumes of literature. Aim: This study aims to provide a critical analysis of text mining techniques used to support the citation screening stage of the systematic review process. Method: We critically re-reviewed papers included in a previous systematic review which addressed the use of
more » ... t mining methods to support the screening of papers for inclusion in a review. The previous review did not provide a detailed analysis of the text mining methods used. We focus on the availability in the papers of information about the text mining methods employed, including the description and explanation of the methods, parameter settings, assessment of the appropriateness of their application given the size and dimensionality of the data used, performance on training, testing and validation data sets, and further information that may support the reproducibility of the included studies. Results: Support Vector Machines (SVM), Naïve Bayes (NB) and Committee of classifiers (Ensemble) are the most used classification algorithms. In all of the studies, features were represented with Bag-of-Words (BOW) using both binary features (28%) and term frequency (66%). Five studies experimented with n-grams with n between 2 and 4, but mostly the unigram was used. χ 2 , information gain and tf-idf were the most commonly used feature selection techniques. Feature extraction was rarely used although LDA and topic modelling were used. Recall, precision, F and AUC were the most used metrics and cross validation was also well used. More than half of the studies used a corpus size of below 1,000 documents for their experiments while corpus size for around 80% of the studies was 3,000 or fewer documents. The major common ground we found for comparing performance assessment based on independent replication of studies was the use of the same dataset but a sound performance comparison could not be established because the studies had little else in common. In most of the studies, insufficient information was reported to enable independent replication. The studies analysed generally did not include any discussion of the statistical appropriateness of the text mining method that they applied. In the case of applications of SVM, none of the studies report the number of support vectors that they found to indicate the complexity of the prediction engine that they use, making it impossible to judge the extent to which over-fitting might account for the good performance results. Conclusions: There is yet to be concrete evidence about the effectiveness of text mining algorithms regarding their use in the automation of citation screening in systematic reviews. The studies indicate that options are still being explored, but there is a need for better reporting as well as more explicit process details and access to datasets to facilitate study replication for evidence strengthening. In general, the reader often gets the impression that text mining algorithms were applied as magic tools in the reviewed papers, relying on default settings or default optimization of available machine learning toolboxes without an in-depth understanding of the statistical validity and appropriateness of such tools for text mining purposes. CCS Concepts •Computing methodologies➝Artificial intelligence➝Machine learning •General and reference➝Cross-computing tools and techniques➝Experimentation •General and reference➝Document types➝Surveys and overviews
doi:10.1145/2915970.2915982 dblp:conf/ease/OlorisadeQBA16 fatcat:qpjigspd7nacrajfduz5x3vdd4