Evolving theories of vowel perception

Winifred Strange
1987 Journal of the Acoustical Society of America  
Research on the perception of vowels in the last several years has given rise to new conceptions of vowels as articulatory, acoustic, and perceptual events. Starting from a "simple" target model in which vowels were characterized articulatorily as static vocal tract shapes and acoustically as points in a first and second formant (F I/F2) vowel space, this paper briefly traces the evolution of vowel theory in the 1970s and 1980s in two directions. ( 1 ) Elaborated target models represent vowels
more » ... s represent vowels as target zones in perceptual spaces whose dimensions are specified asformant ratios. These models have been developed primarily to account for perceivers' solution of the "speaker normalization" problem. (2) Dynamic specification models emphasize the importance offormant trajectory patterns in specifying vowel identity. These models deal primarily with the problem of "target undershoot" associated with the eoarticulation of vowels with consonants in natural speech and with the issue of "vowelinherent spectral change" or diphthongization of English vowels. Perceptual studies are summarized that motivate these theoretical developments. PACS numbers: 43.71.An, 43.71.Es defoged, 1982; Pickett, 1980). Central to this theory is the notion of the vowel target as a unifying concept among articulatory, acoustic, and perceptual characterizations of vowels. Vowel targets are con-This paper is based on an invited address of the same title presented at the Spring 1987 Meeting of the Acoustical Society of America [J. Acoust. Soc. Am. Suppl. 1 81, 516 (1987)]. Requests for reprints should be sent to the author at the published address. ceived of as the canonical forms of vowels, the context-free stored representations of the phonelogical segments (Danileft and Hammarberg, 1973). Articulatorily, these canonical targets are best represented by the static vocal tract shapes assumed when a speaker produces sustained, monophthongal vowel sounds. In continuous speech, these static articulatory positions are considered the goal states when coarticulating vowels in syllabic contexts (e.g., MacNeilage, 1970). Acoustically, vowel targets are represented as points in a multidimensional acoustic space whose coordinates are the first two (F 1/F 2 ) or three ( F 1/F 27F 3 ) oral formants. Formant frequencies are derived from a single spectral cross section through the steady-state portion of the acoustic signal (Joes, 1948; Peterson, 1952, 1961). According to a simple target model of perception, the target frequencies of the first two formants constitute the primary and often sufficient acoustic information for the perceptual identity of the vowel (e.g., Delattre et al., 1952). Thus articulatory, acoustic, and perceptual descriptions of vowels are unified by this concept of a static, context-free target. Given this characterization of what vowels are in their canonical form, problems in explaining the perception of speakers' intended messages arise from two sources of variation in vowels as actually produced. First, as the classic work of Peterson and Barney (1952) showed, the target formant frequencies are not invariant with respect to intended (or perceived) vowels across men, women, and children, and different vowel categories often overlap in F 1/F 2 space even for speakers of the same age and gender. Thus, in a simple target model, formant frequency information must 2081
