Online Audio-Visual Source Association for Chamber Music Performances

Bochen Li, Karthik Dinesh, Chenliang Xu, Gaurav Sharma, Zhiyan Duan
2019 Transactions of the International Society for Music Information Retrieval  
In audio-visual recordings of music performances, visual cues from instrument players exhibit good temporal correspondence with the audio signals and the music content. These correspondences provide useful information for estimating source associations, i.e., for identifying the affiliation between players and sound sources or score parts. In this paper, we propose a computational system that models audio-visual correspondences to achieve source association for Western chamber music ensembles
more » ... cluding strings, woodwind, and brass instruments. Through its three modules, the system models three typical types of correspondences between 1) body motion (e.g., bowing for string instruments and sliding for trombone) and note onsets, 2) finger motion (e.g., fingering for most woodwind and brass instruments) and note onsets, and 3) vibrato hand motion (e.g., fingering hand rolling for string instruments) and pitch fluctuations. Although the three modules are designed for estimating associations for different instruments, the overall system provides a universal framework for all common melodic instruments in Western chamber ensembles. The framework automatically and adaptively integrates the three modules, without requiring prior knowledge of the instrument types. The system operates in an online fashion, i.e., associations are updated as the audio-visual stream progresses. We evaluate the system on ensembles with different instruments and polyphony, ranging from duets to quintets. Results demonstrate that association accuracy increases as the duration of video excerpts increases. For string quintets, the accuracy is over 90% from just a 5-second video excerpt, while for woodwind, brass, and mixed-instrument quintets, a similar accuracy can be reached after processing 30 seconds of video. The result of the proposed framework is promising and enables novel applications such as interactive audio-visual music editing and auto-whirling camera in concerts.
doi:10.5334/tismir.25 fatcat:qpxohsce4ffxfo2wcu3aupi4ra