F0 Modeling For Singing Voice Synthesizers with LSTM Recurrent Neural Networks

Serkan Özer, Merlijn Blaauw, Martí Umbert
2015 Zenodo  
In singing voice synthesis process, score and lyrics for a target song are converted to singing voice expression parameters such as F0, spectra and dynamics. However, this study aims to model and automatically generate F0 parameter by assuring expressiveness and human-likeness in final synthesized singing voice. Musical contexts are important factor on evolution of F0 through a singing performance. Thus, we propose a machine-learning framework that learns F0 of the singing from a set of real
more » ... an singing recordings with respect to musical contexts, at the same time, capturing expressiveness and naturalness of the human singer. Then, we can automatically generate F0 parameter from our trained model given musical contexts of the score. Recurrent Neural Networks with Long Short Term Memory networks are employed for first time to this specific problem due to their flexibility and strong power in modeling complex sequences. Two recurrent neural networks are trained to learn baseline and vibrato parts of F0 separately. Then, F0 sequences are generated from the trained networks and applied to a singing voice synthesizer. Finally, synthesized songs are evaluated with AB preference tests.
doi:10.5281/zenodo.3755574 fatcat:44izjub7yjbivn7mrj6sf7h2ae