Allowing for rotational invariance in the ground-plane is more challenging and requires integration with the data model. We seek the optimal rotations for each pose such that after rotating the poses they are closely approximated by a low-rank compact Gaussian distribution.
We formulate this as a problem of optimization over a set of variables. Given a set of N training 3D poses, each represented as a matrix of 3D landmark locations, where and L is the number of human joints/landmarks; we seek global estimates of an average 3D pose , a set of J orthonormal basis matrices1 e and noise variance , alongside per sample rotations and basis coefficients to minimize the following estimate
Where is the tensor analog of a mul- tiplication between a vector and a matrix, and is the squared Frobenius norm of the matrix. Here the y-axis is assumed to point up, and the rotation matrices considered are ground plane rotations. With the large number of 3D pose samples considered (of the order of 1 million when training on the Human3.6M dataset [15]), and the complex inter-dependencies between samples for e and , the memory requirements mean that it is not possible to solve directly as a joint optimization over all variables using a non-linear solver such as Ceres. Instead, we carefully initialize
Figure 2: Visualization of the 3D training data after alignment (see section 4.1) using 2D PCA. Notice how all poses have the same orientation. Standing-up poses a), b), c) and d) are all close to each other and far from sitting-down poses f) and h) which form another clear cluster. and alternate between performing closed-form PPCA [38] to update ; and updating using Ceres [2] to minimize the above error. As we do this, we steadily increase the size of the basis from 1 through to its target size J. This stops apparent deformations that could be resolved through rotations from becoming locked into the basis at an early stage, and empirically leads to lower cost solutions.
To initialize we use a variant of the Tomasi-Kanade [39] algorithm to estimate the mean 3D pose . As the y component is not altered by planar rotations, we take as our estimate of the y component of , the mean of each point in the y direction. For the x and z components, we interleave the x and z components of each sample and concatenate them into a large matrix M, and find the rank two approximation of this such that . We then calculate by replacing each adjacent pair of rows of A with the closest orthonormal matrix of rank two, and take as our estimate2 of the x and z components of .
The end result of this optimization is a compact low-rank approximation of the data in which all reconstructed poses appear to have the same orientation (see Figure 2). In the next section we extend the model to be described as a multi-modal distribution to better capture the variations in the space of 3D human poses.
Although the learned Gaussian model of section 4.1 can be directly used to estimate the 3D (see Table 1), inspection of figure 2 shows that the data is not Gaussian distributed and is better described using a multiple modal distribution. In doing this, we are heavily inspired both by approaches such as [27] which characterize the space of human poses as a mixture of PCA bases, and by related works such as [42, 8] that represent poses as an interpolation between exemplars. These approaches are extremely good at modeling tightly distributed poses (e.g. walking) where samples in the testing data are likely to be close to poses seen in training. This is emphatically not the case in much of the Human3.6M dataset, which we use for evaluation. Zooming in on the edges of Figure 2 reveals many isolated paths where motions occur once and are never revisited again.
Nonetheless, it is precisely these regions of low-density that we are interested in modeling. As such, we seek a coarse representation of the pose space that says something about the regions of low density but also characterizes the multi-modal nature of the pose space. We represent the data as a mixture of probabilistic PCA models using few clusters, and trained using the EM-algorithm [38]. When using a small number of clusters, it is important to initialize the algorithm correctly, as accidentally initializing with multiple clusters about a single mode, can lead to poor density estimates. To initialize we make use of a simple heuristic.
We first subsample the aligned poses (which we refer to as P), and then compute the Euclidean distance d among pairs. We seek a set of k samples S such that the distance between points and their nearest sample is minimized
We find S using greedy selection, holding our previous estimate of S constant, and iteratively selecting the next candidate s such that minimizes the above cost. A selection of 3D pose samples found using this procedure can be seen in the rendered poses of Figure 2. In practice, we stop proposing candidates when they occur too close to the existing candidates, as shown by samples (a–d), and only choose one candidate from the dominant mode.
Given these candidates for cluster centers, we assign each aligned point to a cluster representing its nearest candidate and then run the EM algorithm of [38], building a mixture of probabilistic PCA bases.
5. A New Convolutional Architecture for 2D and 3D Pose Inference
Our 3D pose inference from a single RGB image makes use of a multistage deep convolutional architecture, trained end-to-end, that repeatedly fuses and refines 2D and 3D poses, and a second module which takes the final predicted 2D landmarks and lifts them one last time into 3D space for the final estimate (see Figure 1).
At its heart, the architecture is a novel refinement of the Convolutional Pose Machine of Wei et al. [44], who reasoned exclusively in 2D, and proposed an architecture that
Figure 3: Results returned by different stages of the architecture. Top Left: Evolution of the 2D skeleton after projecting the 3D points back into the 2D space; Bottom Left: Evolution of the beliefs for the landmark Left hand through the stages. Right: 3D skeleton with the relative mean error per landmark in millimeters. Even with incorrect landmark locations, the model returns a physically plausible solution.
iteratively refined 2D pose estimations of landmarks using a mixture of knowledge of the image and of the estimates of landmark locations of the previous stage. We modify this architecture by generating, at each stage, projected 3D pose belief maps which are fused in a learned manner with the standard maps. From an implementation point of view this is done by introducing two distinct layers, the probabilistic 3D pose layer and the fusion layer (see Figure 1).
Figure 3 shows how the 2D uncertainty in the belief maps is reduced at each stage of the architecture and how the accuracy of the 3D poses increases with each stage.
The sequential architecture consists of 6 stages. Each stage consists of 4 distinct components (see Figure 1):
Predicting CNN-based belief-maps: we use a set of convolutional and pooling layers, equivalent to those used in the original CPM architecture [44], that combine evidence obtained from image learned features with the belief maps obtained from the previous stage () to predict an updated set of belief maps for the 2D human joint positions.
Lifting 2D belief-maps into 3D: the output of the CNNbased belief maps is taken as input to a new layer that uses new pretrained probabilistic 3D human pose model to lift the proposed 2D poses into 3D.
Projected 2D pose belief maps: The 3D pose estimated by the previous layer is projected back onto the image plane to produce a new set of projected pose belief maps. These maps encapsulate 3D dependencies between the body parts. 2D Fusion layer: The final layer in each stage (described in section 5.5) learns the weights to fuse the two sets of belief maps into a single estimate passed to the next stage.
Final lifting: The belief maps produced as the output of the final stage (t = 6) are then lifted into 3D to give the final estimate for the pose (see Figure 1) using our algorithm to lift 2D poses into 3D.
Convolutional Pose Machines [44] can be understood as an updating of the earlier work of Ramakrishna et al. [29] to use a deep convolutional architecture. In both approaches, at each stage t and for each landmark p, the algorithm returns dense per pixel belief maps , which show how confident it is that a joint center or landmark occurs in any given pixel (u, v). For stages the belief maps are a function of not just the information contained in the image but also the information computed by the previous stage.
In the case of convolutional pose machines, and in our work which uses the same architecture, a summary of the convolution widths and architecture design is shown in Figure 1, with more details of training given in [44].
Both [29, 44] predict the locations of different landmarks to those captured in the Human3.6M dataset. As such the input and output layers in each stage of the architecture are replaced with a larger set to account for the greater number of landmarks. The new architecture is then initialized by using the weights with those found in CPM’s model for all preexisting layers, with the new layers randomly initialized.
After retraining, CPMs return per-pixel estimates of landmark locations, while the techniques for 3D estimation (described in the next section) make use of 2D locations. To transform these belief maps into locations, we select the most confident pixel as the location of each landmark
We follow [50] in assuming a weak perspective model, and first describe the simplest case of estimating the 3D pose of a single frame using a unimodal Gaussian 3D pose model as described in section 4. This model is composed of a mean shape , a set of basis matrices e and variances , and from this we can compute the most probable sample
Figure 4: Left: Results from the Human3.6M dataset. The identified 2D landmark positions and 3D skeleton is shown for each pose taken from different actions: Walking, Phoning, Greeting, Discussion, Sitting Down. Right: Results on images from the MPII [5] (columns 1 to 3) and Leeds [18] datasets (last column). The model was not trained on images as diverse as those contained in these datasets, however it often retrieves correct 2D and 3D joint positions. The last row shows example cases where the method fails either in the identification of 2D or 3D landmarks.
Figure 5: Landmark refinement: Left: 2D predicted landmark positions; Right: improved predictions using the projected 3D pose.
from the model that could give rise to a projected image.
Where is the orthographic projection matrix, E a known external camera calibration matrix, and s the estimated per-frame scale. Although, given R this problem is convex in a and s together3, for an unknown rotation matrix R the problem is extremely non-convex – even if a is known – and prone to sticking in local minima using gradient descent. Local optima often lie far apart in pose space and a poor optima leads to a significantly worse 3D reconstructions.
We take advantage of the matrix R’s restricted form that allows it to be parameterized in terms of a single angle . Rather than attempting to solve this optimization problem using local methods we quantize over the space of possible rotations, and for each choice of rotation, we hold this fixed and solve for s and a, before picking the minimum cost solution of any choice of R. With fixed choices of rotation the terms and can be precomputed and finding the optimal a becomes a simple linear least square problem.
This process is highly efficient and by oversampling the rotations and exhaustively checking in 10, 000 locations we can guarantee that a solution extremely close to the global optima is found. In practice, using 20 samples and refining the rotations and basis coefficients of the best found solution using a non-linear least squares solver obtains the same reconstruction, and we make use of the faster option of checking 80 locations and using the best found solution as our 3D estimate. This puts us close to the global optima and has the same average accuracy as finding the global optima. Moreover, it allows us to upgrade from sparse landmark locations to 3D using a single Gaussian at around 3,000 frames a second using python code on a standard laptop.
To handle models consisting of a mixture of Gaussians, we follow [27] and simply solve for each Gaussian independently and select the most probably solution.
The projected pose model is interleaved throughout the architecture (see Figure 1). The goal is to correct the beliefs regarding landmark locations at each stage, by fusing extra information about 3D physical plausibility. Given the solution R, s, and a from the previous component, we estimate a physically plausible projected 3D pose as
Table 1: A comparison of the 3D pose estimation results of our approach on the Human3.6M dataset against competitors that follow Protocol #1 for evaluation (3D errors are given in mm). We substantially outperform all other methods in terms of average error showing a 4.7mm average improvement over our closest competitor. Note that some approaches [37, 50] use video as input instead of a single frame.
which is then embedded in a belief map as
and then convolved using Gaussian filters.
The 2D belief maps predicted by the probabilistic 3D pose model are fused with the CNN-based belief maps
according to the following equation
where is a weight trained as part of the end-to-end learning. This set of fused belief maps is then passed to the next stage and used as an input to guide the 2D reestimation of joint locations, instead of the belief maps used by convolutional pose machines.
Following [44], the objective or cost function minimized at each stage is the the squared distance between the generated fusion maps of the layer , and ground-truth belief maps generated by Gaussian blurring the sparse ground-truth locations of each landmark p
For end-to-end training the total loss is the sum over all layers . The novel layers were implemented as an extension of the published code of Convolutional Pose Machines [44] inside the Caffe framework [17] as Python layers, with weights updated using Stochastic Gradient Descent with momentum. Details of the novel gradient updates used lifting estimates through 3d pose space are given in the supplementary materials.
6. Experimental evaluation
Human3.6M dataset: The model was trained and tested on the Human3.6M dataset consisting of 3.6 million accurate 3D human poses [15]. This is a video and mocap dataset of 5 female and 6 male subjects, captured from 4 different viewpoints, that show them performing typical activities (talking on the phone, walking, greeting, eating, etc.). 2D Evaluation: Figure 5 shows how the 2D predictions are improved by the projected pose model, reducing the overall mean error per landmark. The 2D error reduction using our full approach over the estimates of [44] is comparable in magnitude to the improvement due to the change of architecture moving from the work Zhou et al. [50] to the state-of-the-art 2d architecture [44] (i.e. a reduction of 0.59 pixels vs. 0.81 pixels). See Table 2 for details. 3D Evaluation: Several evaluation protocols have been followed by different authors to measure the performance of their 3D pose estimation methods on the Human3.6M dataset. Tables 1 and 2 show comparisons of the 3D pose
Table 2: Further evaluation on the Human3.6M dataset. Top two tables compare our 3D pose estimation errors against competitors on Protocols #2 or #3. Bottom table compares our 2D pose estimation error against competitors. Our approach, which lifts the 2D landmark predictions into a plausible 3D model and then projects them back into the image, substantially reduces the error. Note that [50] use video as input and knowledge of the action label. estimation with previous works, where we take care to evaluate using the appropriate protocol.
Protocol #1, the most standard evaluation protocol on Human3.6M, was followed by [15, 22, 37, 35, 36, 50, 31]. The training set consists of 5 subjects (S1, S5, S6, S7, S8), while the test set includes 2 subjects (S9, S11). The original frame rate of 50 FPS is down-sampled to 10 FPS and the evaluation is on sequences coming from all 4 cameras and all trials. The reported error metric is the 3D error i.e. the Euclidean distance from the estimated 3D joints to the ground truth, averaged over all 17 joints of the Human3.6M skeletal model. Table 1 shows a comparison between our approach and competing approaches using Protocol #1. Our baseline method using a single unimodal probabilistic PCA model outperforms almost every method in most action types, with the exception of Sanzari et al. [31], which it still outperforms on average across the entire dataset. The mixture model improves on this again, offering a 4.76mm improvement over Sanzari et al., our closest competitor.
Protocol #2, followed by [46, 30], selects 6 subjects (S1, S5, S6, S7, S8 and S9) for training and subject S11 for testing. The original video is down-sampled to every 64th frame and evaluation is performed on sequences from all 4 cameras and all trials. The error metric reported in this case is the 3D pose error equivalent to the per-joint 3D error up to a similarity transformation (i.e. each estimated 3D pose is aligned with the ground truth pose, on a per-frame basis, using Procrustes analysis). The error is averaged over 14 joints. Table 2 shows a comparison between our approach and other approaches that use Protocol #2. Although, our model was trained using only the 5 subjects used for training in Protocol #1 (one fewer subject), it still outperforms
Protocol #3, followed by [7], selects the same subjects for training and testing as Protocol #1. However, evaluation is only on sequences captured from the frontal camera (“cam 3”) from trial 1 and the original video is not subsampled. The error metric used in this case is the 3D pose error as described in Protocol #2. The error is averaged over a subset of 14 joints. Table 2 shows a comparison between our approach and [7]. Our method outperforms Bogo et al. [7] by almost 3mm on average, even though Bogo et al. exploits a high-quality detailed statistical 3D body model [23] trained on thousands of 3D body scans, that captures both the variation of human body shape and its deformation through pose.
MPII and Leeds datasets: The proposed approach trained exclusively on the Human3.6M dataset can be used to identify 2D and 3D landmarks of images contained in different datasets. Figure 4 shows some qualitative results on the MPII dataset [5] and on the Leeds dataset [18], including failure cases. Notice how the probabilistic 3D pose model generates anatomically plausible poses even though the 2D landmark estimations are not all correct. However, as shown in bottom row, even small errors in 2D pose can lead to drastically different 3D poses. These inaccuracies could be mitigated without further 3D data by annotating additional RGB images for training from different datasets.
7. Conclusion
We have presented a novel approach to human 3D pose estimation from a single image that outperforms previous solutions. We approach this as a problem of iterative refine-ment in which 3D proposals help refine and improve upon the 2D estimates. Our approach shows the importance of thinking in 3D even for 2D pose estimation within a single image, with our method demonstrating better 2D accuracy than [44], the 2D approach it is based upon. Our novel approach for upgrading from 2D to 3D is extremely efficient. When using 3 models, as in Tables 1 and 2, the upgrade for each stage in CPU-based Python code runs at approximately 1,000 frames a second, while a GPU-based real-time approach for Convolutional Pose Machines has been announced. Integrating these systems to provide a reliable real-time 3D pose estimator is a natural future direction, as is integrating this work with a simpler 2D approach for real-time pose estimation on lower power devices.
This work was funded by the SecondHands project, from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No 643950. Chris Russell was partially supported by The Alan Turing Institute under EPSRC grant EP/N510129/1.
[1] A. Agarwal and B. Triggs. Recovering 3d human pose from monocular images. IEEE transactions on pattern analysis and machine intelligence, 28(1):44–58, 2006. 2
[2] S. Agarwal, K. Mierle, and Others. Ceres solver. http: //ceres-solver.org. 4
[3] I. Akhter and M. J. Black. Pose-conditioned joint angle lim- its for 3d human pose reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1446–1455, 2015. 2
[4] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade. Trajectory space: A dual representation for nonrigid structure from motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011. 2
[5] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. Hu- man pose estimation: New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 6, 8
[6] C. Barr´on and I. A. Kakadiaris. Estimating anthropometry and pose from a single uncalibrated image. Computer Vision and Image Understanding, 81(3):269–284, 2001. 2
[7] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, pages 561–578. Springer, 2016. 3, 8
[8] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3d shape from image streams. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 2, pages 690–696. IEEE, 2000. 2, 4
[9] X. Chen and A. L. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in Neural Information Processing Systems, pages 1736–1744, 2014. 3
[10] J. Cho, M. Lee, and S. Oh. Complex non-rigid 3d shape recovery using a procrustean normal distribution mixture model. International Journal of Computer Vision, 117(3):226–246, 2016. 2
[11] C. H. Ek, P. H. S. Torr, and N. D. Lawrence. Gaussian process latent variable models for human pose estimation. In A. Popescu-Belis, S. Renals, and H. Bourlard, editors, MLMI, volume 4892 of Lecture Notes in Computer Science, pages 132–143. Springer, 2007. 2
[12] A. Elgammal and C. Lee. Inferring 3d body pose from sil- houettes using activity manifold learning. In CVPR, 2004. 2
[13] X. Fan, K. Zheng, Y. Zhou, and S. Wang. Pose locality con- strained representation for 3d human pose reconstruction. In European Conference on Computer Vision, pages 174–188. Springer, 2014. 2
[14] P. Gotardo and A. Martinez. Computing smooth time tra- jectories for camera and deformable shape in structure from motion with occlusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011. 2
[15] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE
transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2014. 3, 7, 8
[16] A. Jain, J. Tompson, M. Andriluka, G. W. Taylor, and C. Bre- gler. Learning human pose estimation features with convolutional networks. arXiv preprint arXiv:1312.7302, 2013. 3
[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 7
[18] S. Johnson and M. Everingham. Clustered pose and nonlin- ear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference, 2010. doi:10.5244/C.24.12. 6, 8
[19] H.-J. Lee and Z. Chen. Determination of 3d human body postures from a single view. Computer Vision, Graphics, and Image Processing, 30(2):148–168, 1985. 2
[20] M. Lee, J. Cho, and S. Oh. Procrustean normal distribution for non-rigid structure from motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016. 2
[21] S. Li and A. B. Chan. 3d human pose estimation from monocular images with deep convolutional neural network. In Asian Conference on Computer Vision, pages 332–347. Springer, 2014. 2
[22] S. Li, W. Zhang, and A. B. Chan. Maximum-margin struc- tured learning with deep networks for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2848–2856, 2015. 2, 7, 8
[23] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248, 2015. 3, 8
[24] G. Mori and J. Malik. Recovering 3d human body configu- rations using shape contexts. PAMI, 2006. 2
[25] V. Parameswaran and R. Chellappa. View independent hu- man body pose estimation from a single perspective image. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–16. IEEE, 2004. 2
[26] T. Pfister, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 1913–1921, 2015. 3
[27] N. Pitelis, C. Russell, and L. Agapito. Learning a manifold as an atlas. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1642–1649. IEEE, 2013. 4, 6
[28] V. Ramakrishna, T. Kanade, and Y. Sheikh. Reconstructing 3d human pose from 2d image landmarks. In European Conference on Computer Vision, pages 573–586. Springer, 2012. 2
[29] V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and Y. Sheikh. Pose machines: Articulated pose estimation via inference machines. In European Conference on Computer Vision, pages 33–47. Springer, 2014. 5
[30] G. Rogez and C. Schmid. Mocap-guided data augmentation for 3d pose estimation in the wild. In Advances in Neural Information Processing Systems, pages 3108–3116, 2016. 8
[31] M. Sanzari, V. Ntouskos, and F. Pirri. Bayesian image based 3d pose estimation. In European Conference on Computer Vision, pages 566–582. Springer, 2016. 3, 7, 8
[32] L. Sigal, R. Memisevic, and D. Fleet. Shared kernel infor- mation embedding for discriminative inference. In CVPR, 2009. 2
[33] E. Simo-Serra, A. Ramisa, G. Aleny`a, C. Torras, and F. Moreno-Noguer. Single image 3d human pose estimation from noisy observations. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2673–2680. IEEE, 2012. 3
[34] C. J. Taylor. Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 1, pages 677–684. IEEE, 2000. 2
[35] B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua. Structured prediction of 3d human pose with deep neural networks. In British Machine Vision Conference (BMVC), 2016. 3, 7, 8
[36] B. Tekin, P. M´arquez-Neila, M. Salzmann, and P. Fua. Fus- ing 2d uncertainty and 3d cues for monocular body pose estimation. arXiv preprint arXiv:1611.05708, 2016. 7, 8
[37] B. Tekin, X. Sun, X. Wang, V. Lepetit, and P. Fua. Predict- ing people’s 3d poses from short sequences. arXiv preprint arXiv:1504.08200, 2015. 7, 8
[38] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61:611–622, 1999. 4
[39] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2):137–154, 1992. 4
[40] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint train- ing of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pages 1799–1807, 2014. 2, 3
[41] A. Toshev and C. Szegedy. Deeppose: Human pose estima- tion via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1653–1660, 2014. 2
[42] C. Wang, J. Flynn, Y. Wang, and A. L. Yuille. Represent- ing data by a mixture of activated simplices. arXiv preprint arXiv:1412.4102, 2014. 4
[43] C. Wang, Y. Wang, Z. Lin, , A. Yuille, and W. Gao. Robust estimation of human poses from a single image. In Computer Vision and Pattern Recognition (CVPR), 2014. 2
[44] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con- volutional pose machines. arXiv preprint arXiv:1602.00134, 2016. 1, 3, 4, 5, 7, 8
[45] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Tor- ralba, and W. T. Freeman. Single image 3d interpreter network. In European Conference on Computer Vision, pages 365–382. Springer, 2016. 3
[46] H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall. A dual-source approach for 3d pose estimation from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 8
[47] R. Zhao, Y. Wang, and A. Martinez. A simple, fast and highly-accurate algorithm to recover 3d shape from 2d landmarks on a single image. arXiv preprint arXiv:1609.09058, 2016. 2
[48] X. Zhou, X. Sun, W. Zhang, S. Liang, and Y. Wei. Deep kine- matic pose regression. arXiv preprint arXiv:1609.05317, 2016. 2
[49] X. Zhou, M. Zhu, S. Leonardos, and K. Daniilidis. Sparse representation for 3d shape estimation: A convex relaxation approach. arXiv preprint arXiv:1509.04309, 2015. 2
[50] X. Zhou, M. Zhu, S. Leonardos, K. Derpanis, and K. Dani- ilidis. Sparseness meets deepness: 3d human pose estimation from monocular video. arXiv preprint arXiv:1511.09439, 2015. 3, 5, 7, 8
1. Computing derivatives for back-propagation through our lifted model
As discussed by the Convolution Pose Machine paper [1] recurrent like architectures such as ours have problems with vanishing gradients and for effective training they require an additional loss function to be defined for each layer, that independently drives each individual layer to return correct predictions regardless of how this information is used in subsequent layers.
Before we give the derivation of the gradients it should be emphasized that it is entirely possible to train the network without using them – in fact similar results can be obtained by only using the 3D lifting for the forward pass, and not back-propagating the lifting derivatives through the rest of the network. As the additional layers make use of custom Pythonbased derivatives rather than an efficient implementation, for computational reasons it might preferable to avoid this step. Nonetheless for completeness we include the derivatives.
There are two reasons the gradients are unneeded: Our lifting 3D model we use makes its best predictions when the 2D predictions of the same layer are closest to ground truth, and this is a constraint naturally enforced by the objective of equation (8) of the main paper. Further, as with Convolutional Pose Machines [1] our architecture suffers from problems with vanishing gradients. To overcome this Wei et al. [1] defined an objective at each layer, which acted to locally strengthen the gradients. However, a side effect of this multi-stage objective is that most of the effects of back-propagation happen locally and gradients back-propagated from other layers have little effect on the learning. This makes subtle interactions between layers less influential, and forces the learning process to concentrate on simply making accurate 2D predictions in each layer.
We first give the results for computing the gradients of sparse predicted locations from Y (see section 5 of the main paper), before discussing the gradients induced on the confidence maps by these sparse locations.
1.1. Landmark Gradients
In the interests of readability we neglect the use of indices to indicate stages, the reader should assume that all variables are taken from the same stage. Similarly, when dealing with a mixture of Gaussians, as we are only interested in computing a sub-gradient, the reader should assume that the best model has already been selected in the forward pass and we are computing gradients using only this model.
where R is a discrete set of rotations we exhaustively minimize over, and J is the number of bases in e. Owing to the use of discrete rotations, this mapping from Y to is a piecewise smooth approximation of the smooth function defined over a continious R, and sub-gradients can be induced by fixing R to its current state. Hence:
For the remainder of the section, and to compact notation we will write E for the matrix of size the number of landmark points and J being the number of bases in e ) formed by unwrapping tensor . Similarly, we will unwrap the matrices Y and and write them as y and . We also write p for the vector representing the unwrapped set of 2D landmark positions .
We will use [y, 0] for the vector formed by vector y followed by J zeros, and for the matrix of size formed by concatenating E with the matrix that has values along the diagonals and zero everywhere else. We can rewrite
equation (3) in its new notation as:
and given R, we can rewrite equation (2) as
with continuing to represent the pseudo-inverse of . Hence
where is the truncation of .
1.2. Mapping belief gradients to coordinate transform
The coordinates of each predicted landmark induce a Gaussian in the belief map . So a change in the x component of induces an update which is equivalent to a difference of Gaussians.
and the same for the y component as well. For computational purposes we take as one pixel. As such, an induced gradient on the projected belief map near the predicted location induces an updating of that is propagated through to Y using the sub-gradients described in equation (8).
Updating B Writing B for the the set of all , and assuming is not in the right location, i.e. given updates on
any update of b in which we decrease the belief at and increase anywhere else is a valid sub-gradient. We choose as a sensible update a negative step at of magnitude and a positive update for each element Y of of of the magnitude in the quadrant of a Gaussian of the same width used to generate (i.e. see section 5.6 of main paper) and with the same direction as in each x and y coordinate.