Using a Small Amount of Text-Independent Speech Data for a BiLSTM Large-Scale Speaker Identification Approach

Mohammad K. Nammous, Khalid Saeed, Paweł Kobojek
2020 Journal of King Saud University: Computer and Information Sciences  
Communication between people and machines has been extended in the last two decades. Corresponding techniques have been founded to cover the need of voice understanding, including speech and speaker recognition on a large-scale. In this paper, the authors propose a simplified deep-learning approach to accomplish the large-scale speaker identification task using as little training data as possible. Fisher speech corpus has been explored to select the recordings of unique speakers having
more » ... ers having sufficient data. The authors are using the MFCC method to represent the feature vectors of a large set of more than 4k speakers with about 343 hours of speech signals. The solution includes omitting the pre-processing and considering longer segments of the voice signals. Various portions of training datasets have been tested, as well as dedicating larger percentages of the used data for testing. Bidirectional LSTM neural networks provided up to 76.9% accuracy rate for individual voice segments, and 99.5% when considering the segments of each speaker as a bundle. Doubling the amount of the training data yielded a perfect accuracy rate of 100%.
doi:10.1016/j.jksuci.2020.03.011 fatcat:3dgjpnz3l5dv5kqmhg7lrkxgau