Feedback and imitation by a caregiver guides a virtual infant to learn native phonemes and the skill of speech inversion

Heikki Rasilo, Okko Räsänen, Unto K. Laine
2013 Speech Communication  
Despite large-scale research, development of robust machines for imitation and inversion of human speech into articulatory movements has remained an unsolved problem. We propose a set of principles that can partially explain real infants' speech acquisition processes and the emergence of imitation skills and demonstrate a simulation where a learning virtual infant (LeVI) learns to invert and imitate a virtual caregiver's speech. Based on recent findings in infants' language acquisition, LeVI
more » ... rns the phonemes of his native language in a babbling phase using only caregiver's feedback as guidance and to map acoustically differing caregiver's speech into its own articulation in a phase where LeVI is imitated by the caregiver with similar, but not exact, utterances. After the learning stage, LeVI is able to recognize vowels from the virtual caregiver's VCVC utterances perfectly and all 25 Finnish phonemes with an average accuracy of 88.42%. The place of articulation of consonants is recognized with an accuracy of 96.81%. LeVI is also able to imitate the caregiver's speech since the recognition occurs directly in the domain of articulatory programs for phonemes. The learned imitation ability (speech inversion) is strongly language dependent since it is based on the phonemic programs learned from the caregiver. The findings suggest that caregivers' feedback can act as an important signal in guiding infants' articulatory learning, and that the speech inversion problem can be effectively approached from the perspective of early speech acquisition. * Used abbreviations: ATP = articulatory target position, CM = concept matrix, CG = caregiver, DM = direct mapping, IM = indirect mapping, LeVI = Learning Virtual Infant, MFCC = Mel-frequency cepstral coefficients, VQ = vector quantization.
doi:10.1016/j.specom.2013.05.002 fatcat:2afapc3xbja2fjjdsxm5ueselm