3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network [chapter]

Sijin Li, Antoni B. Chan
2015 Lecture Notes in Computer Science  
In this paper, we propose a deep convolutional neural network for 3D human pose estimation from monocular images. We train the network using two strategies: 1) a multi-task framework that jointly trains pose regression and body part detectors; 2) a pre-training strategy where the pose regressor is initialized using a network trained for body part detection. We compare our network on a large data set and achieve significant improvement over baseline methods. Human pose estimation is a structured
more » ... prediction problem, i.e., the locations of each body part are highly correlated. Although we do not add constraints about the correlations between body parts to the network, we empirically show that the network has disentangled the dependencies among di↵erent body parts, and learned their correlations. However, the pose space grows cubicly with the resolution of the discretization, i.e., doubling the resolution in each dimension will octuple the pose space. Discriminative methods view pose estimation as a regression problem [4, [9] [10] [11] . After extracting features from the image, a mapping is learned from the feature space to the pose space. Because of the articulated structure of the human skeleton, the joint locations are highly correlated. To consider the dependencies between output variables, [11] proposes to use structured SVM to learn the mapping from segmentation features to joint locations. [9] models both the input and output with Gaussian processes, and predicts target poses by minimizing the KL divergence between the input and output Gaussian distributions. Instead of dealing with the structural dependencies manually, a more direct way is to "embed" the structure into the mapping function and learn a representation that disentangles the dependencies between output variables. In this case models need to discover the patterns of human pose from data, which usually requires a large dataset for learning. [4] uses approximately 500,000 images to train regression forests for predicting body part labels from depth images, but the dataset is not publicly available. The recently released Human3.6M dataset [12] contains about 3.6 million video frames with labeled poses of several human subjects performing various tasks. Such a large dataset makes it possible to train data-driven pose estimation models. Recently, deep neural networks have achieved success in many computer vision applications [13, 14] , and deep models have been shown to be good at disentangling factors [15, 16]. Convolutional neural networks are one of the most popular architectures for vision problems because it reduces the number of parameters (compared to fully-connected deep architectures), which makes training easier and reduces overfitting. In addition, the convolutional and max-pooling structure enables the network to extract translation invariant features. In this paper, we consider two approaches to train deep convolutional neural networks for monocular 3D pose estimation. In particular, one approach is to jointly train the pose regression task with a set of detection tasks in a heterogeneous multi-task learning framework. The other approach is to pre-train the network using the detection tasks, and then refine the network using the pose regression task alone. To the best of our knowledge, we are the first to show that deep neural networks can be applied to 3D human pose estimation from single images. By analyzing the weights learned in the regression network, we also show that the network has discovered correlation patterns of human pose.
doi:10.1007/978-3-319-16808-1_23 fatcat:4rwf5y3avnbfxgahnyhn6nzbdy