Human Pose Estimation from Monocular Images: A Comprehensive Survey

Wenjuan Gong, Xuena Zhang, Jordi Gonzàlez, Andrews Sobral, Thierry Bouwmans, Changhe Tu, El-hadi Zahzah
2016 Sensors  
Human pose estimation refers to the estimation of the location of body parts and how they are connected in an image. Human pose estimation from monocular images has wide applications (e.g., image indexing). Several surveys on human pose estimation can be found in the literature, but they focus on a certain category; for example, model-based approaches or human motion analysis, etc. As far as we know, an overall review of this problem domain has yet to be provided. Furthermore, recent
more » ... s based on deep learning have brought novel algorithms for this problem. In this paper, a comprehensive survey of human pose estimation from monocular images is carried out including milestone works and recent advancements. Based on one standard pipeline for the solution of computer vision problems, this survey splits the problem into several modules: feature extraction and description, human body models, and modeling methods. Problem modeling methods are approached based on two means of categorization in this survey. One way to categorize includes top-down and bottom-up methods, and another way includes generative and discriminative methods. Considering the fact that one direct application of human pose estimation is to provide initialization for automatic video surveillance, there are additional sections for motion-related methods in all modules: motion features, motion models, and motion-based methods. Finally, the paper also collects 26 publicly available data sets for validation and provides error measurement methods that are frequently used. cameras to capture motions simultaneously. However, they are not suitable for real-life non-invasive applications, and the equipment is quite expensive, confining their applications to lab experiments or long-term very costly productions such as controlling avatars' movements in animations [2] . So, an increasing number of studies have been focused on markerless methods. The inputs are also captured by cameras, but the acting humans are not bound to wear any markers. Several types of images can be captured: RGB or grayscale images (which are the input image types we discuss in this survey), infrared images [3], depth images [4], and others. RGB images capture visible light, and are the most frequently seen images on the web; infrared images capture infrared light; and depth images contain information regarding the distance of objects in the image to the cameras. Infrared images are extremely useful for night vision, but are not in the scope of this review. While ordinary cameras can capture RGB images, depth images require specialized equipment. This equipment is much less expensive compared with those for acquiring motion capture data, and can be used in everyday life settings. Commercial products include Microsoft Kinect [5], the Leap Motion [6], and GestureTek [7] . These products provide application programming interfaces (APIs) to acquire depth data [8] . The human pose detection problem has seen the most success when utilizing depth images in conjunction with color images: real-time estimation of 3D body joints and pixelwise body part labelling have been possible based on randomized decision forests [9] . Estimation accuracy from depth images are comparatively more accurate, but these devices can only acquire images within a certain distance limit (around eight meters), and a vast majority of pictures on the web are RGB or grayscale images with no depth information. Human pose detection from a single image is a severely under-constrained problem, due to the intrinsic one-to-many mapping nature of this problem. One pose produces various pieces of image evidence when projecting from changing viewpoints. This problem has been extensively studied, but is still far from being completely solved. Effective solutions for this problem need to tackle illumination changes, shading problems, and viewpoint variations. Furthermore, human pose estimation problems have specific characteristics. First, the human body has high degrees of freedom, leading to a high-dimensional solution space; second, the complex structure and flexibility of human body parts causes partially occluded human poses which are extremely hard to recognize; third, depth loss resulting from 3D pose projections to 2D image planes makes the estimation of 3D poses extremely difficult. In this paper, we collect milestone works and recent advancements in human pose estimation from monocular images. The papers in the reference section were downloaded during the first semester of 2016 from the following sources: Google Scholar, IEEE Explore, Scopus Elsevier, Springer, Web of Science, Research Gate, arXiv, and several research lab homepages. Each section of the paper is a possible component of human pose estimation algorithms. The flow of the sections follows the degree of abstraction: starting from images of low abstraction level to semantic human poses of high abstraction level. Summarizing related works, there are two main ways to categorize human pose estimation methodologies [10]. The first way clusters solutions based on whether the human pose estimation problem is modeled as geometric projection of a 3D real-world scene or if it is treated as a general classification/regression problem. In geometric projection modeling (Section 4.1.2), a 3D human body model is required (Section 3.3) . Furthermore, camera parameters are required for a projection model. From an image processing perspective, human pose estimation can be treated as a regression problem from image evidence. In discriminative methods (Section 4.1.1), distinctive measurements, called features, are first extracted from images. These are usually salient points (like edges or corners) which are useful characteristics for the accomplishment of the estimation task. Later on, these salient points are described in a systematic way, very frequently statistically. This procedure is named "feature description". In this review, we fuse feature extraction and feature description procedures into a feature section (Section 2). Instead, we categorize features based on their abstraction level: from low-level Sensors 2016, 16, 1966 3 of 39 abstraction to high-level abstraction. Features of high abstraction levels are semantically closer to the human description of a human pose. Features are then assembled based on a predefined human body structure (Sections 3.1 and 3.2) and then the assembled information is fed to a classification or a regression model to predict human body part layout. Then, various mapping models between extracted features and human poses are utilized (Section 4.1.1). The second approach to categorization splits related works into top-down (Section 4.2.2) and bottom-up (Section 4.2.1) methods based on how pose estimation is carried out: if it introduces high-level semantics for low-level estimation or if human poses are recognized from pixel-level image evidence. There are also works taking advantage of different types of approaches simultaneously by fusing them to achieve a better estimation accuracy (Sections 4.1.3 and 4.2.3). One straightforward application of monocular human pose estimation is the initialization of smart video surveillance systems. In this scenario, motion cues provide valuable information, and progress in motion-based recognition could be applied to enhance pose estimation accuracy. The advantage is that an image sequence leads to the recognition of higher-level motions(like walking or running) which consist of a complex and coordinated series of events that cannot be understood by looking at only a few frames [11] [12] [13] , and these pieces of higher-level information could be utilized to confine low-level human pose estimation. Extracted motion features are introduced in Section 2.4, human motion patterns extracted as motion priors are explained in the last paragraph of Section 3.4, and motion-based methods are described in Section 4.3. The main components of the survey paper are illustrated in Figure 1 . As mentioned before, it is not compulsory for a human pose estimation algorithm to contain all three components (features, human body models, and methodologies). For example, in Figure 1 , the first flow line denotes three components of discriminative methods and bottom-up methods, including three feature types of different abstraction level, two types of human body models, and their methods. Temporal information provides motion-based components. In Section 5, we collect publicly-available datasets for the validation of human pose estimation algorithms, several error measurement methods, and a toolkit for non-expert users to use human pose estimation algorithms. Lastly, in Section 6, we discuss open challenges in this problem.
doi:10.3390/s16121966 pmid:27898003 pmcid:PMC5190962 fatcat:jigvz4ovpbh63eovto3etoefx4