Multiple levels of linguistic and paralinguistic features contribute to voice recognition

Jean Mary Zarate, Xing Tian, Kevin J. P. Woods, David Poeppel
2015 Scientific Reports  
Voice or speaker recognition is critical in a wide variety of social contexts. In this study, we investigated the contributions of acoustic, phonological, lexical, and semantic information toward voice recognition. Native English speaking participants were trained to recognize five speakers in five conditions: non-speech, Mandarin, German, pseudo-English, and English. We showed that voice recognition significantly improved as more information became available, from purely acoustic features in
more » ... n-speech to additional phonological information varying in familiarity. Moreover, we found that the recognition performance is transferable between training and testing in phonologically familiar conditions (German, pseudo-English, and English), but not in unfamiliar (Mandarin) or non-speech conditions. These results provide evidence suggesting that bottom-up acoustic analysis and top-down influence from phonological processing collaboratively govern voice recognition. Voice recognition, irrespective of the speech content, is crucial in many social contexts, including distinguishing voices of one's kin from those of strangers. The social relevance of voice recognition is reinforced by evidence of fetal recognition of mothers' voices in utero 1 and increasing specialization of neural mechanisms for human voices over the first six months of development 2,3 . These early voice-recognition mechanisms precede fully developed linguistic abilities 4 and may therefore rely principally on acoustic, paralinguistic characteristics of voice [e.g., average fundamental frequency (F0), F0 contour, etc.] that can exist outside of the speech domain 5 . Voice timbre -the Gestalt of sound characteristics that make a voice unique and recognizable -is determined by the physical characteristics of the vocal folds that affect speaking fundamental frequency (F0), the vocal tract, and the articulators that modify the shape of the vocal tract and influence the higher harmonics or formant frequencies of the voice, i.e., lips, teeth, jaw, tongue, etc.; 6 . Average speaking F0 (a key voice characteristic), higher-order characteristics of F0 contour, and vocalization rate or speed are important for voice recognition when phonological information is held constant and lexical semantic information is not available 7 . With reversed speech, which eliminates lexical and semantic information and distorts temporally based, consonant-related phonological information, listeners can still use voice timbre conveyed in the formant frequencies of vowels to distinguish between and recognize voices 8,9 . Compared to sine-wave speech -which possesses a complex timbre with phonological features, but is ultimately devoid of vocal F0 -the average speaking F0, F0 contour, and natural voice timbre in normal and reversed speech greatly enhanced listeners' ability to distinguish between voices 8,10 . Besides the paralinguistic factors that contribute to voice recognition, in a recent paper Perrachione et al. 11 suggest that voice recognition depends on the integrity of the phonological representations of words. They argue that the unfamiliar phonology of a foreign language or, alternatively, a pre-existing deficit in phonological processing such as that described in dyslexia 12 can reduce voice recognition; this implies that intact phonological processing is critical to correctly identifying a speaker 11, 13 .
doi:10.1038/srep11475 pmid:26088739 pmcid:PMC4473599 fatcat:liffou44hnacbgpxaajxdyz72i