Human movement capture and analysis in intelligent environments

Mohan M. Trivedi
<span title="2003-09-01">2003</span> <i title="Springer Nature"> <a target="_blank" rel="noopener" href="" style="color: black;">Machine Vision and Applications</a> </i> &nbsp;
Investigators from multiple disciplines and applications areas are interested in Human body modeling, tracking, and synthesis related topics. These topics offer fertile ground for challenging research problems as well as potential for a wide range of applications. Our own interest and involvement in human modeling, analysis, and synthesis comes from a very specific need: that of developing "intelligent" environments or spaces [1] [2] [3] . An intelligent environment automatically derives and
more &raquo; ... amically maintains an awareness of its composition as well as events and activities occurring within. Moreover, these spaces should be responsive to specific events and triggers. Such spaces need not be limited to rooms in buildings, but extend to outdoor environments [4] and any other spaces that humans occupy such as a performance on a stage or an automobile on a highway [5] . The spaces are monitored by multiple audio and video sensors, which can be unobtrusively embedded in the infrastructure. To avoid intrusion on the normal human activities in the space, all sensors, processors, and communication devices should remain "invisible" in the infrastructure. The system should also support natural and flexible interactions among the participants without specialized or encumbering devices. In a conference room environment, multiple video cameras and microphones may be embedded in walls and furniture. Video and audio signals are analyzed in real time for a wide range of low-level tasks, including person identification, localization and tracking, and gesture and voice recognition [6] . Combining the analysis tasks with human face and body synthesis enables efficient interactions with remote observers, effectively merging disjoint spaces into a single intelligent environment. We are currently embedding distributed video networks in rooms, laboratories, museums, and even outdoor public spaces in support of experimental research in this domain. This involves the development of new frameworks, architectures, and algorithms for audio and video processing as well as for the control of various functions associated with proper execution of a transaction within such intelligent spaces. These test beds are also helping to identify novel applications of such systems in distance learning, teleconferencing, entertainment, and smart homes. There are several key elements to the development of intelligent spaces: Multilevel interpretive analysis of body and movement Intelligent environments need to support the wide array of natural interactions that their human inhabitants perform. This places great demands on the sensory system. Basically, these systems should be capable of providing multilevel descriptions of typical human activities so that semantic-level interpretation of the events can be achieved (Fig. 1) . Some of the typical functionalities of such powerful sensory systems include tracking of people in 3-D, estimation of the poses and postures of the people, person recognition, event recognition, and body modeling and movement analysis [7, 8] . Integration of multiple video and audio streams, and multiple camera types In order to achieve reliable and robust system performance in intelligent spaces where interactions among multiple people and other fixtures can be properly captured and analyzed, the systems require sensory information from multiple perspectives and at multiple resolutions. Researchers are beginning to consider these issues seriously. Our research is characterized by its emphasis on using large numbers of channels, both video and audio, to augment the precision and robustness of our algorithms. We believe that scene-and activity-analysis algorithms of the future will use similarly large numbers of channels as transducers become cheaper and the available I/O bandwidth of computers grows. There will be a need in the future for algorithms matched to systems offering large numbers of input channels. Current computer techniques for sensing the environment have not yet caught up to the abilities of humans, partly because of the lack of cross-modal information sharing in computer perception algorithms. Robust audiovisual (AV) signatures of the participants, gestures, and events are required in the development of intelligent rooms. Approaches based purely
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.1007/s00138-002-0109-7</a> <a target="_blank" rel="external noopener" href="">fatcat:g2vjhcf2yngnllcodoxo5hwhfa</a> </span>
<a target="_blank" rel="noopener" href="" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href=""> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> </button> </a>