Estimating the 3D human pose from a single RGB image [1–8] has drawn intensive research attentions over the last decade due to its broad applications. Thanks to the powerful DCNN (deep convolutional neural network), significant advances have been witnessed in this area. Nevertheless, there still exists a large gap between images and 3D poses for in-the-wild scenarios. This occurs as the result of the challenges to annotate 3D groundtruth positions for skeleton joints.
Many of previous works tackle this problem by decomposing the task into two stages, each aiming at training models with easy annotating: (a) performing 2D pose estimation; (b) recovering 3D pose from 2D pose directly. Although this decomposition helps with data annotations, as 2D pose annotation can be more easily obtained with in-the-wild images, it also discards necessary pictorial information for resolving the ambiguity in 3D pose recovery. Depth ordering information e.g., [4, 5], has been demonstrated to be effective for solving this ambiguity and the annotation can be done efficiently.
Taking [4] as consideration, the FBI (forward-or-backward information) only partially reflects the absolute depth information. To further advance along this line, we propose an architecture for extracting multimodal depth information. More specifically, both the FBI, as an implicit depth information, and the explicit depth information are exploited to supervise the learning procedure. In addition, we improve the estimation of explicit depth by using a conditional adversarial
Fig. 1. The framework of our method. A multimodal generator is learned to predict a coarse 3D pose and FBI in a supervised way. An adversarial training is adopted to boost the performance of the part for 3D pose inference. Both the estimated coarse 3D pose and FBI are then together fed into a deep regressor for further 3D pose renement.
learning scheme. At the last stage, a linear deep regressor with a novel loss function maps FBI and the explicit depth with the corresponding 2D human pose into the estimated 3D human pose.
The overview of the proposed network architecture is shown in Fig. 1. A coarse 3D pose and FBI are estimated by our multimodal generator. A conditional adversarial learning architecture is employed to fine-tune the coarse 3D pose module. Finally, the coarse 3D pose and FBI are fed into the linear regressor to infer the 3D human pose.
2.1 Multimodal Depth Estimation
Multimodal Generator. The multimodal generator consists of two parallel convNets. One convNet estimates a coarse 3D pose, an explicit depth representation, and the other convNet generates FBI, an implicit depth representation.
Explicit Depth Supervision. Our explicit depth is represented by assigning an extra z coordinate on each 2D joint location. In this work, the 2D joint locations are detected by a widely-used estimator [9]. The explicit depth with the corresponding 2D joint locations can be viewed as a coarse 3D pose aligned in the camera coordinate system. Such information can be easily extracted from the 3D pose groundtruth, and used to supervise the learning process. We use the same convNets architecture as in [8] for our coarse 3D pose prediction.
Implicit Depth Supervision. FBI is an implicit depth information indicating if a bone is forward-or-backward facing with respect to the camera’s view. We selected m (= 14) FBI relationships from a human skeleton. Each bone vector
= 1, 2, .., m) has one of three status: definitely forward, definitely backward and possibly parallel to the image plane. The FBI of an image can be defined
Fig. 2. (a) The architecture of conditional adversarial learning. Discriminator training is used to distinguish the authenticity of the samples and the generator is used to generate 3D poses that are anthropometrically valid to fool the discriminator. (b) Visual results of estimated coarse 3D poses for out-of-the-domain images. Second/Third row: results without/with conditional adversarial learning.
as a matrix where is a one-hot 3-dimensional vector, i.e., ) = 1 (j = 0, 1, 2) means the bone has the status. Interested readers are referred to [4] for more details of FBI. It is worth noting that such information is very easy to annotate. The users only need to do a binary selection for each skeleton bone, incurring around 20 seconds for each image. We use the same convNets as in [4] for FBI estimation.
Adversarial Learning. It has been proved in [10] that adversarial learning helps to predict more realistic poses. The whole process of our conditional adversarial learning is depicted in Fig. 2(a). In the pre-training stage, a coarse 3D pose is predicted by our generator. Subsequently, the coarse 3D pose is refined by conditional adversarial learning in the fine-tuning stage. Real and fake labels are generated from the discriminator, which in turn leads to generating plausible 3D poses. Thanks to the generalization power of conditional adversarial learning, our coarse 3D module is robust. The visual results of some images with very different domain characteristics are shown in Fig. 2(b).
2.2 3D Pose Refinement
Multimodal depth features fed into the network can effectively improve the accuracy of 3D pose estimation. Specifically, the coarse 3D pose and FBI are concatenated together and then mapped into the 3D pose by exploiting two cascaded linear regression blocks used in [11].
Weighted Regression Loss function. Let p be the predicted 3D pose and P be the groundtruth 3D pose. The loss function for 3D pose regression is:
Here, is the basic L2 loss. 1is the mean of on the training dataset. is a hyperparameter to adjust the trade-off between and , and is set as 0.001
Table 1. Results on the official evaluation server (measured in millimeter).
in the experiments. Actions with large poses are commonly hard to learn, and these hard samples should gain more attention. To this end, is designed to complement which allows different samples to get adaptive supervision focus.
3.1 Training
Dataset. Only 3D human pose data from the ECCV Challenge dataset, a subset of the large-scale dataset Human3.6M [1,12], is used in the training process. FBI and our coarse 3D pose derived from the ECCV Challenge dataset are used for our multimodal generator training.
Implementation Details. The whole framework is implemented on Tensor-flow. The linear 3D pose regressor requires less than six hours for training.
The ECCV 3D Pose Challenge only provides RGB images and the corresponding 3D pose coordinates groundtruth. We use the 2D pose estimator [9] to assist in cropping the full human body in an image and resize it to 256256.
Results. All the results in Table 1 are obtained from the evaluation server. The method inferring 3D poses only from 2D joint locations without the coarse 3D pose and FBI is denoted as ”Base”. The full method is denoted as ”Final”.
It is clear that the most challenging task for 3D human pose estimation is the learning of depth. We proposed to simultaneously infer the explicit depth and implicit depth, in a supervised manner, using a convNets architecture with two independent branches. Despite the FBI lacks groundtruth of explicit depth for in-the-wild images, it can provide useful depth supervision, and it is also very easy to annotate. We take complementary advantages of the implicit and explicit depth supervision and feed the learned features together to the final regressor for 3D pose inference. A weighted regression loss function provides an adaptive feedback for different pose samples. Thanks to these designs, our proposed method achieves competitive 3D human pose estimation.
We also find that the 2D joint coordinates of our coarse 3D pose is not reliable enough (left-right joint pairs sometimes flip). As future work, we will explore combining a stronger 2D pose detector and a more effective depth feature extractor for 3D human pose estimation.
1. Catalin Ionescu, Fuxin Li, C.S.: Latent structured models for human pose estima- tion. In: International Conference on Computer Vision. (2011)
2. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B.: Deepcut: Joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4929–4937
3. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: European Conference on Computer Vision, Springer (2016) 561–578
4. Shi, Y., Han, X., Jiang, N., Zhou, K., Jia, K., Lu, J.: Fbi-pose: Towards bridg- ing the gap between 2d images and 3d human poses using forward-or-backward information. arXiv preprint arXiv:1806.09241 (2018)
5. Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3D human pose estimation. In: Computer Vision and Pattern Recognition (CVPR). (2018)
6. Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3d pose and shape estimation of multiple people in natural scenes–the importance of multiple scene constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 2148–2157
7. Marinoiu, E., Zanfir, M., Olaru, V., Sminchisescu, C.: 3d human sensing, action and emotion recognition in robot assisted therapy of children with autism. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 2158–2167
8. Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3d human pose estimation in the wild: a weakly-supervised approach. In: IEEE International Conference on Computer Vision. (2017)
9. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose esti- mation. In: European Conference on Computer Vision, Springer (2016) 483–499
10. Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3d human pose esti- mation in the wild by adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Volume 1. (2018)
11. Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: International Conference on Computer Vision. Volume 1. (2017) 5
12. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (2014)