The Problem of Limited Inter-rater Agreement in Modelling Music Similarity

Arthur Flexer, Thomas Grill
2016 Journal of New Music Research  
One of the central tasks in the annual MIREX evaluation campaign is the "Audio Music Similarity and Retrieval (AMS)" task. Songs which are ranked as being highly similar by algorithms are evaluated by human graders as to how similar they are according to their subjective judgment. By analyzing results from the AMS tasks of the years 2006 to 2013 we demonstrate that: (i) due to low inter-rater agreement there exists an upper bound of performance in terms of subjective gradings; (ii) this upper
more » ... und has already been achieved by participating algorithms in 2009 and not been surpassed since then. Based on this sobering result we discuss ways to improve future evaluations of audio music similarity. 15th International Society for Music Information Retrieval Conference (ISMIR 2014)
doi:10.1080/09298215.2016.1200631 pmid:28190932 pmcid:PMC5256035 fatcat:h6s6h3hikjayhpzkcnsaxfvnoq