PodcastMix: A dataset for separating music and speech in podcasts

Nicolás Schmidt, Marius Miron, Jordi Pons
2021 Zenodo  
Over the last few years, the popularity of podcast shows in streaming services has increased considerably. Licensed music in these shows is frequently used, but the precision of song identification services could be a˙ected by the speakers voice in the mix. This presents a major problem both for the musicians, who do not receive their respective royalty payments, and for the broadcasters, who may be exposed to legal problems for non-compliance with international copyright laws. In this Master
more » ... esis, a benchmark between two state of the art models for music source separa-tion, the ConvTasNet and the UNet, was performed against a novel Podcast-like audio dataset called PodcastMix with the objective of separating both the voice of the speakers and the background music from a podcast. In this way, the back-ground music and foreground speech source separation task was formalized. This new dataset is compound by music from the Jamendo free music streaming service, mixed with the VCTK speech dataset. The models were trained on this dataset and evaluated both in the test partition and on a dataset of real podcasts. The results show that UNet performs better than ConvTasNet in separating speakers and music from podcasts. The benchmark was performed using the Asteroid toolkit and the evaluation metrics were computed using BSSEval tool in order to measure the quality of the separations.
doi:10.5281/zenodo.5554789 fatcat:75wg7qrslnez5buxw46mublk54