Human/Human Conversation Understanding
Spoken Language Understanding
Speech at Microsoft | Microsoft Research While the term spoken language understanding mostly refers to the understanding of spoken utterances directed at machines, as they are more constrained, recent progress in recognition and understanding the human/human conversations and multi-party meetings is not negligible. While there is significant amount of previous work on discourse processing especially in social sciences (such as in the field of conversation analysis), processing human/human
... sations is a relatively newer area for spoken language processing. In this chapter we focus on two-party and multi-party human/human conversation understanding approaches, mainly focusing on discourse modeling, speech act modeling, and argument diagramming. We also try to point the bridge studies using human/human conversations for building better human/machine conversational systems or using the approaches for human/machine understanding for better human/human understanding and vice versa. Human/Human Conversation Understanding 3 which has access to extra information tries to help the speaker. With the start of another large scale DARPA program, named EARS (Effective, Affordable, Reusable Speech-to-Text), a larger corpus, named Fisher, has been collected with 16,454 English conversations, totaling 2,742 hours of speech. This data has been transcribed using LDC "quick transcription" specifications, that include a single pass with some automated preprocessing. The speech processing community has then studied extensions of two-party human/human conversations in a few directions: multi-party human/human conversations (or meetings), lectures, and broadcast conversations (such as talkshows, broadcast discussions, etc.). Projects initiated at CMU (Burger et al. 2002) and ICSI (Janin et al. 2004) in the late 1990s and early 2000s collected substantial meeting corpora and investigated many of the standard speech processing tasks on this genre. Subsequently, several large, interdisciplinary, and multi-site government-funded research projects have investigated meetings of various kinds. The AMI (Augmented Multi-party Interaction) Consortium (AMI n.d.) and DARPAfunded CALO (Cognitive Assistant that Learns and Organizes) (DARPA Cognitive Agent that Learns and Organizes (CALO) Project n.d.) projects concentrate on conferenceroom meetings with small numbers of participants. The CHIL (Computers in the Human Interaction Loop) project (CHI n.d.) collected a series of lectures dominated by a single presenter with shorter question/answer portions, as well as some "interactive" lectures involving smaller groups. AMI and CHIL also produced corpora of time-synchronized media, generally including close-talking and far-field microphones, microphone arrays, individual and room-view video cameras, and output from slide projectors and electronic whiteboards. Starting in 2002, the annual NIST Rich Transcription (RT) Evaluations have become a driving force for research in conversational speech processing technology, with substantial performance improvements in recent years. In order to promote robustness and domainindependence, the NIST evaluations cover several genres and topics, ranging from largely open-ended, interactive chit-chat, to topic-focused project meetings and technical seminars dominated by lecture-style presentations. NIST evaluates only the speech recognition and speaker diarization systems, with a focus on recognition from multiple distant table-top microphones. However, technology has advanced such that many other types of information can be detected and evaluated -not least including dialog acts, topics, and action items. In the next sections, we will try to cover basic understanding tasks studied in the literature.