Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-resource ASR

Bin Wu, Sakriani Sakti, Jinsong Zhang, Satoshi Nakamura
2022 IEEE/ACM Transactions on Audio Speech and Language Processing  
Speech feature extraction is critical for ASR systems. Such successful features as MFCC and PLP use filterbank techniques to model log-scaled speech perception but fail to model the adaptation of human speech perception by hearing experiences. Infant perception that is adapted by hearing speech without text may cause permanent brain state modifications (engrams) that serve as a physical fundamental basis for lifetime speech perception formation. This realization motivates us to propose to model
more » ... such an unsupervised adaptation process, where adaptation denotes perception that is affected or changed by the history of experiences, with the Dirichlet Process Gaussian Mixture Model (DPGMM) and the DPGMM-RNN hybrid model to extract perceptual features to improve ASR. Our proposed features extend MFCC features with posteriorgrams extracted from the DPGMM algorithm or the DPGMM-RNN hybrid model. Our analysis shows that the DPGMM and DPGMM-RNN model perplexities agree with infant auditory perplexity to support that the proposed features are perceptual. Our ASR results verify the effectiveness of the proposed unsupervised features in such tasks as LVCSR on WSJ and ASR on noisy low-resource telephone conversations, compared with the supervised bottleneck features from Kaldi in ASR performance.
doi:10.1109/taslp.2022.3150220 fatcat:svo6n3zua5gpraapwxxol5b3ji