On Acoustic Diversification Front-End for Spoken Language Identification
IEEE Transactions on Audio, Speech, and Language Processing
The parallel phone recognition followed by language model (PPRLM) architecture represents one of the state-of-the-art spoken language identification systems. A PPRLM system comprises multiple parallel subsystems, where each subsystem employs a phone recognizer with a different phone set for a particular language. The phone recognizer extracts phonotactic attributes from the speech input to characterize a language. The multiple parallel subsystems are devised to capture the phonetic
... onetic diversification available in the speech input. Alternatively, this paper investigates a new approach for building a PPRLM system that aims at improving the acoustic diversification among its parallel subsystems by using multiple acoustic models. These acoustic models are trained on the same speech data with the same phone set but using different model structures and training paradigms. We examine the use of various structured precision (inverse covariance) matrix modeling techniques as well as the maximum likelihood and maximum mutual information training paradigms to produce complementary acoustic models. The results show that acoustic diversification, which requires only one set of phonetically transcribed speech data, yields similar performance improvements compared to phonetic diversification. In addition, further improvements were obtained by combining both diversification factors. The best performing system reported in this paper combined phonetic and acoustic diversifications to achieve EERs of 4.71% and 8.61% on the 2003 and 2005 NIST LRE sets, respectively, compared to 5.77% and 9.94% using phonetic diversification alone. Index Terms-Acoustic modeling, fusion, maximum mutual information (MMI), parallel phone recognition followed by language model (PPRLM), precision matrix modeling, spoken language identification.