Automatic Music Transcription and Audio Source Separation

M. D. Plumbley, S. A. Abdallah, J. P. Bello, M. E. Davies, G. Monti, M. B. Sandler
2002 Cybernetics and systems  
2 In this article, we give an overview of a range of approaches to the analysis and separation of musical audio. In particular, we consider the problems of automatic music transcription and audio source separation, which are of particular interest to our group. Monophonic music transcription, where a single note is present at one time, can be tackled using an autocorrelationbased method. For polyphonic music transcription, with several notes at any time, other approaches can be used, such as a
more » ... lackboard model or a multiple-cause/sparse coding method. The latter is based on ideas and methods related to independent component analysis (ICA), a method for sound source separation. Keywords: Automatic music transcription, blind source separation, ICA, sparse coding, computational auditory scene analysis 3 Over the last decade or so, and particularly since the publication of Bregman's seminal book on Auditory Scene Analysis (Bregmann 1990), there has been an increasing interest in the problem of Computational Auditory Scene Analysis (CASA): how to design computer-based models that can analyze an auditory scene. Imagine you are standing in a busy street among a crowd of people. You can hear traffic noise, footsteps of people nearby, the bleeping of a pedestrian crossing, your mobile phone ringing, and colleagues behind you having a conversation. Despite all these different sound sources, you have a pretty good idea of what is going on around you. It is more than just a mess of overlapping noise, and if you try hard you can concentrate on one of these sources if it is important to you (such as the conversation behind you). This has proved to be a very difficult problem. It requires both separation of many sound sources, and analysis of the content of these sources. However, a few authors have begun to tackle this problem in recent years, with some success (see e.g. Ellis 1996). One particular aspect of auditory scene analysis of interest to our group is automatic music transcription. Here, the sound sources are one or more instruments playing a piece of music, and we wish to analyze this to identify the instruments that are playing, and when and for how long each note is played. From this analysis we should then be able to produce a written musical score that shows notes and the duration of each on a written conventional music notation (for conventional western music, at least). In principle, this musical score could then be used to recreate the musical piece that was played. We are also interested in the ability to separate sound sources based on their different locations in an auditory scene. This capability, known as blind source separation (Bell & Sejnowski 1995) , can be useful on its own if we wish to eliminate some sound sources and concentrate on others. It also offers the potential to be combined with automatic music transcription systems in the future, to improve transcription performance based on the location of instruments as well as the different sounds they make. Automatic music transcription As we mentioned above, the aim of automatic music transcription is to analyze an audio signal to discover the instruments and notes that are playing, and so be able to produce a written transcription of the piece of music. Some
doi:10.1080/01969720290040777 fatcat:qgxhssx2lrertblwb3l2ixak4i