Visual-Based Eye Contact Detection in Multi-Person Interactions

Mahmoud Qodseya, Franck Panta, Florence Sedes
2019 2019 International Conference on Content-Based Multimedia Indexing (CBMI)  
ter). Nonverbal cues can be taxonomized into vocal and visual cues, where: (i) vocal cues include voice quality, silences, turn taking patterns, nonlinguistic vocalizations, and linguistic vocalizations; and (ii) visual cues include physical appearance (e.g., gender, height, ethnicity, age), face and eyes cues (e.g., facial expression, gaze direction, focus of attention), gesture and posture, and space and environment. As shown in Figure 1 , a visual non-verbal behavioral analysis schema
more » ... lysis schema consisting of five modules: (i) data acquisition; (ii) person detection and tracking; (iii) social cues extraction; (iv) contextual information identification; and (v) social cues analysis. Different types of sensors and devices, e.g. cameras and proximity detectors, might be used in the data acquisition module to record social interactions. Thus, one or more dedicated computer vision and image processing based (e.g. face detection) methods could be leveraged for processing the input data to detect and track person(s). The social cues extraction module takes as an input the detected person(s) to extract a feature vector (per person) describing the social cues such as head pose. The social cues understanding module deeply analyzes the primitive social cues through modeling temporal dynamics and combining signals extracted from various modalities (e.g., head pose, facial expression) at different time scales to provide more useful information and conclusions at the behavioral level of the detected persons. Indeed, this module might optionally leverage additional contextual information (e.g. type of the event, location, restaurant menu) that describe the context, in which the data is captured, to provide a precise social behavior prediction and analysis. Finally, the existence of metadata repository decouples the analysis phase from other components [3] . At the social cues extraction level, VNBA systems mainly adopt eye contact, as an important social cue, for performing a wide range of analysis and studies such as a dominant person detection [4] . It provides multiple functions in the twoperson contacts such as information seeking, establishment and recognition of social relationships, and signaling that the "channel is open for communication" [5] . Indeed, extraction of this social cue must be fully automated, accurate at detection level, and compatible with simple capturing devices such as closed-circuit television (CCTV) cameras. However, existing state-of-the-art methods require expensive special devices for Abstract-Visual non-verbal behavior analysis (VNBA) methods mainly depend on extracting an important and essential social cue, called eye contact, for performing a wide range of analysis such as dominant person detection. Besides the major need for an automated eye-contact detection method, existing state-of-theart methods require intrusive devices for detecting any contacts at the eye-level. Also, such methods are completely dependent on supervised learning approaches to produce eye-contact classification models, raising the need for ground truth datasets. To overcome the limitations of existing techniques, we propose a novel geometrical method to detect eye contact in natural multi-person interactions without the need of any intrusive eyetracking device. We have experimented our method on 10 social videos, each 20 minutes long. Experiments demonstrate highly competitive efficiency with regards to classification performance, compared to the classical existing supervised eye contact detection methods. Index Terms-eye contact detection, visual nonverbal behavior analysis, social interaction
doi:10.1109/cbmi.2019.8877471 dblp:conf/cbmi/QodseyaPS19 fatcat:becknm2wkfhfnb3cfbhd5vdpjy