Threshold-free code clone detection for a large-scale heterogeneous Java repository

Iman Keivanloo, Feng Zhang, Ying Zou
2015 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)  
Code clones are unavoidable entities in software ecosystems. A variety of clone-detection algorithms are available for finding code clones. For Type-3 clone detection at method granularity (i.e., similar methods with changes in statements), dissimilarity threshold is one of the possible configuration parameters. Existing approaches use a single threshold to detect Type-3 clones across a repository. However, our study shows that to detect Type-3 clones at method granularity on a large-scale
more » ... ogeneous repository, multiple thresholds are often required. We find that the performance of clone detection improves if selecting different thresholds for various groups of clones in a heterogeneous repository (i.e., various applications). In this paper, we propose a threshold-free approach to detect Type-3 clones at method granularity across a large number of applications. Our approach uses an unsupervised learning algorithm, i.e., k-means, to determine true and false clones. We use a clone benchmark with 330,840 tagged clones from 24,824 open source Java projects for our study. We observe that our approach improves the performance significantly by 12% in terms of Fmeasure. Furthermore, our threshold-free approach eliminates the concern of practitioners about possible misconfiguration of Type-3 clone detection tools.
doi:10.1109/saner.2015.7081830 dblp:conf/wcre/Keivanloo0Z15 fatcat:bagkmkeebjd3loxithm7f5tv6i