Representation, Analysis, and Recognition of 3D Humans

Stefano Berretti, Mohamed Daoudi, Pavan Turaga, Anup Basu
2018 ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)  
Computer Vision and Multimedia solutions are now offering an increasing number of applications ready for use by end users in everyday life. Many of these applications are centered of detection, representation, and analysis of face and body. Methods based on 2D images and videos are the most widespread, but there is a recent trend that successfully extends the study to 3D human data as acquired by a new generation of 3D acquisition devices. Based on these premises, in this survey, we provide an
more » ... verview on the newly designed techniques that exploit 3D human data and also prospect the most promising current and future research directions. In particular, we first propose a taxonomy of the representation methods, distinguishing between spatial and temporal modeling of the data. Then, we focus on the analysis and recognition of 3D humans from 3D static and dynamic data, considering many applications for body and face. For many years, 2D still images and videos have been used as the only sources to investigate methods for detecting, representing and analyzing human body and face [176] . Now, the interest for non-Euclidean data is growing. First attempts that tried to move from representations of 2D to 3D humans investigated how to automatically extract compact descriptors of the body and the face, mostly using synthetic models generated by dedicated software tools. The use of such 3D modeling tools required experienced operators, was time consuming, and resulted in limited realism. In addition, operating with synthetic data, hid most of the difficulties associated with the manipulation of the actual data acquired. Indeed, only recently the advent of new 3D acquisition technologies at affordable cost, including consumer depth cameras like Kinect, has made it possible to capture real human body and face in 3D. Such 3D scanners can be either static or dynamic (i.e., across time), with resulting scans obtained at high-or low-resolution. In the last few years, this has also allowed the production of large repositories of human samples, which has opened the way to substantial research advancements and new application domains. Several surveys exist on 3D methods, but they are more focused on individual tasks, such as 3D face recognition [121, 152] , 3D emotion and expression recognition [42, 139] , 3D action recognition [95] , and 3D retrieval [156] . Instead, our effort here is to provide a comprehensive and updated overview of what have been done in tasks that have the humans, body and face as the main focus of analysis or recognition. The usual framework for analysis and recognition of 3D body and face comprises the following steps. 3D data is typically noisy and irregularly formed so that some preprocessing is first applied. Then, a representation is built on lower-level descriptors that model the information embedded in the data. At large, the contra-position is between hand-crafted and learned features: these can capture the spatial information only or also account for the temporal dimension. Finally, such representations are the input for a classification stage that can rely on some classifier or be integrated into a (deep) learning framework. In the following, we start by focusing on the representations (see Section 2) by distinguishing between methods that perform spatial or temporal modeling of human data (in Section 2.1 and Section 2.2, respectively). The former extract the representation by exploiting the data acquired as individual 3D scans, usually captured at high resolution with user cooperation. The latter also accounts for the temporal component in dynamic data (i.e., sequences of 3D scans acquired with 3D cameras). Differing from the static case, these scans can be acquired without user cooperation but, typically, at the cost of lower resolution. Spatial and temporal representations have been used in a variety of analysis and recognition tasks (see Section 3). In general, these applications are different for face and body so we will present them separately in Section 3.1 and Section 3.2, respectively. REPRESENTATIONS OF 3D HUMANS Representations of the 3D human body and face are usually built on low level descriptors that model static (spatial) and dynamic (spatio-temporal) data for extracting meaningful features. We summarize some of the most successful 3D descriptors, and outline how they have been used in the modeling of body and face. We organize in a taxonomy the representation methods proposed so far in the literature as illustrated in Figure 1 . We further classify the methods based on the main characteristics of the 3D shape they capture. In particular, in the case of spatial modeling, we distinguish between representation methods in the following categories: • geometric: account for the surface shape either in an extrinsic, or intrinsic way; • volumetric: the volume delimited by the shape surface is accounted for by the representation; • topological: topological variations of the shape are captured by the representation; • landmarks based: the shape is represented by a set of landmarks (or fiducial points) with some local surface description attached to them. An increasing importance is also assumed by learning solutions, where hand-crafted descriptors are substituted by deep features that are learned directly from the data. These deep features can be learned from point sets (and surfaces), landmarks and volumes.
doi:10.1145/3182179 fatcat:ds55t4md2na2tibtyg4llerf3q