Voice Conversion Based Data Augmentation to Improve Children's Speech Recognition in Limited Data Scenario

S. Shahnawazuddin, Nagaraj Adiga, Kunal Kumar, Aayushi Poddar, Waquar Ahmad
2020 Interspeech 2020  
Automatic recognition of children's speech is a challenging research problem due to several reasons. One among those is unavailability of large amounts of speech data from child speakers to develop automatic speech recognition (ASR) systems employing deep learning architectures.Using a limited amount of training data limits the power of the learned system. To overcome this issue, we have explored means to effectively make use of adults' speech data for training an ASR system. For that purpose,
more » ... enerative adversarial network (GAN) based voice conversion (VC) is exploited to modify the acoustic attributes of adults' speech making it perceptually similar to that of children's speech. The original and converted speech samples from adult speakers are then pooled together to learn the statistical model parameters. Significantly improved recognition rate for children's speech is noted due to VC-based data augmentation. To further enhance the recognition rate, a limited amount of children's speech data is also pooled into training. Large reduction in error rate is observed in this case as well. It is worth mentioning that GAN-based VC does not change the speakingrate. To demonstrate the need to deal with speaking-rate differences we report the results of time-scale modification of childrens speech test data.
doi:10.21437/interspeech.2020-1112 dblp:conf/interspeech/ShahnawazuddinA20 fatcat:rf2ijymv5bhflgbsch34fxsu4q