Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music
EURASIP Journal on Audio, Speech, and Music Processing
AbstractMultiple predominant instrument recognition in polyphonic music is addressed using decision level fusion of three transformer-based architectures on an ensemble of visual representations. The ensemble consists of Mel-spectrogram, modgdgram, and tempogram. Predominant instrument recognition refers to the problem where the prominent instrument is identified from a mixture of instruments being played together. We experimented with two transformer architectures like Vision transformer
... and Shifted window transformer (Swin-T) for the proposed task. The performance of the proposed system is compared with that of the state-of-the-art Han's model, convolutional neural networks (CNN), and deep neural networks (DNN). Transformer networks learn the distinctive local characteristics from the visual representations and classify the instrument to the group where it belongs. The proposed system is systematically evaluated using the IRMAS dataset with eleven classes. A wave generative adversarial network (WaveGAN) architecture is also employed to generate audio files for data augmentation. We train our networks from fixed-length music excerpts with a single-labeled predominant instrument and estimate an arbitrary number of predominant instruments from the variable-length test audio file without any sliding window analysis and aggregation strategy as in existing algorithms. The ensemble voting scheme using Swin-T reports a micro and macro F1 score of 0.66 and 0.62, respectively. These metrics are 3.12% and 12.72% relatively higher than those obtained by the state-of-the-art Han's model. The architectural choice of transformers with ensemble voting on Mel-spectro-/modgd-/tempogram has merit in recognizing the predominant instruments in polyphonic music.