Autorank: A Python package for automated ranking of classifiers
Journal of Open Source Software
Analyses to determine differences in the central tendency, e.g., mean or median values, are an important application of statistics. Often, such comparisons must be done with paired samples, i.e., populations that are not dependent on each other. This is, for example, required if the performance different machine learning algorithms should be compared on multiple data sets. The performance measures on each data set are then the paired samples, the difference in the central tendency can be used
... dency can be used to rank the different algorithms. This problem is not new and how such tests could be done was already described in the well-known article by Demšar (2006). Regardless, the correct use of Demšar's guidelines is hard for non-experts in statistics. The distribution of the populations must be analyzed with the Shapiro-Wilk test for normality and, depending on the normality with Levene's test or Bartlett's tests for homogeneity of the data. Based on the results and the number of populations, researchers must decide whether the paired t-test, Wilcoxon's rank sum test, repeated measures ANOVA with Tukey's HSD as posthoc test, or Friedman's tests and Nemenyi's post-hoc test is the suitable statistical framework. All this is already quite complex. Additionally, researchers must adjust the significance level due to the number of tests to achieve the desired family-wise significance and control the false-positive rate of the test results. Moreover, there are important aspects that go beyond Demšar's guidelines regarding best practice for the reporting of statistical result. Good reporting of the results goes beyond simply stating the significance of findings. Additional aspects also matter, e.g., effect sizes, confidence intervals, and the decision whether it is appropriate to report the mean value and standard deviation, or whether the median value and the median absolute deviation are better suited.