User-centered modeling for spoken language and multimodal interfaces

S. Oviatt
1996 IEEE Multimedia  
By modeling difficult sources of linguistic variability in spontaneous speech and language, interfaces can be designed that transparently guide human input to match system processing capabilities. Such work is yielding more user-centered and robust interfaces for next-generation spoken language and multimodal systems. Historically, the development of spoken language systems has been primarily a technology-driven phenomenon. However, successful processing of spontaneous speech and dialogue,
more » ... ially in actual field settings, requires a considerably broader understanding of performance issues during humancomputer spoken interactions. Research from this perspective currently represents a gap in our scientific knowledge, which is widely recognized as having generated a bottleneck in our ability to support robust speech for real commercial applications. The present article summarizes recent research on usercentered modeling of human language and performance during spoken and multimodal interaction, as well as interface design aimed at next-generation systems. Why Can't Speakers Adapt? Technologists developing speech and language-oriented systems tend to think that "users can adapt" to whatever they build. Typically, they have relied on instruction, training and practice with the system to attempt to encourage users to speak in a manner that matches system processing capabilities. However, human speech production involves a highly automatized set of skills organized in a modality-specific brain center. As a result, many of its features are not under full conscious control (e.g., disfluencies, prosody, timing), and there are constraints on the extent to which even the most cooperative user can adapt his or her speech production to suit system limitations-such as the need to articulate with artificial pauses between words for an isolated word recognizer. Even when people can concentrate on changing some aspect of their speech, such as deliberate pausing, as soon as they become absorbed with a real task they quickly forget and slip back into their more natural and automatic style of delivery. We believe that a more promising and enduring approach to the design of spoken language systems is: l to model the user-and modality-centered speech upon which the system must be built l to design spoken interface capabilities that leverage from these existing, strongly-engrained speech patterns The potential impact of this approach is substantial improvement in the commercial viability of nextgeneration systems for a wide range of real-world applications, especially communications devices and mobile technology. Basic Terminology Page 1 of 16 Untitled 4/2/2003 file://P:\chcc.www\Papers\sharonPaper\Ieee\Ieee.html User and Modality-Centered Approach Individual communication modalities, including spoken, written, keyboard, and others, are influential in shaping the language transmitted within them. In a sense, communication channels physically constrain the flow and shape of human language just as a river bed directs the river's current. To be successful, technology based on human language input needs to be designed to accommodate the unique landmark features of the input modality that the system is built upon. In the case of speech, especially interactive speech in which dialogue partners alternate turns, these landmark features include disfluencies, errors and repairs, confirmation requests and feedback, prosodic and nonverbal modulation of language, regulation of turn-taking and dialogue control, and grammatically run-on constructions-all of which are different from our prototype of formal noninteractive textual language. Unfortunately, current spoken language systems typically are trained on samples of read or noninteractive speech, which fail to represent these unique landmark features of interactive spoken exchanges.
doi:10.1109/93.556458 fatcat:s4wrmd2ifbglnbihttjnitgtry