Sparse Representation for 3D Shape Estimation: A Convex Relaxation Approach

2015·Arxiv

Abstract

Abstract

We investigate the problem of estimating the 3D shape of an object defined by a set of 3D landmarks, given their 2D correspondences in a single image. A successful approach to alleviating the reconstruction ambiguity is the 3D deformable shape model and a sparse representation is often used to capture complex shape variability. But the model inference is still challenging due to the nonconvexity in the joint optimization of shape and viewpoint. In contrast to prior work that relies on an alternating scheme whose solution depends on initialization, we propose a convex approach to addressing this challenge and develop an efficient algorithm to solve the proposed convex program. We further propose a robust model to handle gross errors in the 2D correspondences. We demonstrate the exact recovery property of the proposed method, the advantage compared to several nonconvex baselines and the applicability to recover 3D human poses and car models from single images.

1 INTRODUCTION

Recognizing 3D objects from 2D images is a central problem in computer vision. Past years have witnessed an emerging trend towards analyzing 3D geometry of objects including shapes and poses instead of merely providing bounding boxes [1], [2]. The 3D geometric reasoning can not only provide richer information about the scene for subsequent high-level tasks such as scene understanding, augmented reality and human computer interaction, but also improve 2D recognition performances [3], [4]. 3D reconstruction has been a well studied problem and there have been many practically applicable techniques such as structure from motion, multi-view stereo and depth sensors, but these techniques are limited in some scenarios. For example, a large number of acquisitions from multiple views are often required in order to obtain a complete 3D model, which is not preferred in some real-time applications; the depth sensors in general cannot work outdoor and have a limited sensing range; and reconstructing a dynamic scene is still an open problem. In this paper, we aim to investigate the possibility of estimating the 3D shape of an object from a single 2D image, which is complementary to the aforementioned techniques and may potentially address the above issues.

Estimating the 3D geometry of an object from a single view is an ill-posed problem. But it is a possible task for a human observer, since human can leverage visual memory of object shapes. Inspired by this idea, much effort has been made towards 3D model-based analysis leveraging the increasing availability of online 3D model databases, such as the Google 3D warehouse that includes millions of CAD models of various objects and many publicly available shape scans and motion capture data that model the 3D shape and pose of human.

To address intra-class variability and nonrigid deformation and avoid exhaustively enumerating all possibilities, many previous works, e.g. [5], [6], adopted a 3D deformable model originated from the “active shape model” [7] to represent shapes, where each shape is defined by a set of ordered landmarks and the one to be estimated is assumed as a linear combination of some basis shapes usually obtained from principal component analysis (PCA). For human poses, the sparse representation was proposed to handle highly articulated deformation of human bodies which cannot be captured by PCA [8], [9]. In model inference, the 3D deformable model is matched to the landmarks annotated or detected in images, and the problem is reduced to a 3D-to-2D shape fitting problem where the shape (weights of the linear combination) and the viewpoint (camera extrinsic parameters) have to be estimated simultaneously. Figure 1 gives an illustration of the problem.

While this approach has achieved promising results in various applications, the model inference is still a challenging problem, since the subproblems of shape and viewpoint estimation are coupled: the viewpoint needs to be known in order to fit the 3D model to 2D, and inversely, the 3D model needs to be known in order to estimate the viewpoint. The joint estimation of shape and viewpoint results in a nonconvex optimization problem, and the orthogonality constraint on the camera rotation makes the problem even more complicated. Previous methods often rely on an alternating scheme to alternately update the shape and viewpoint parameters, which has no guarantee for global convergence and is sensitive to initialization. As mentioned in many previous works, e.g. [8], [5], most of the failed cases were attributed to bad initialization. Some heuristics have been proposed to address this issue, such as initializing multiple times [9] or using a viewpoint-aware detector [6]. But there is still no guarantee for global optimality.

In this paper, we propose a convex relaxation approach

Fig. 1. Illustration of the problem studied in this paper. The unknown 3D model is defined by a set of landmarks and assumed to be a linear combination of some predefined basis shapes with sparse coefficients. Given the 2D correspondences of the landmarks in a single image, the computational problem is to simultaneously estimate the coefficients of the sparse representation as well as the viewpoint of the camera.

to addressing the aforementioned issue. We use an augmented shape-space model, in which a shape is represented as a linear combination of rotatable basis shapes. This model gives a linear representation of both intrinsic shape deformation and extrinsic viewpoint changes. Next, we use the convex relaxation of the orthogonality constraint to convert the entire problem into a spectral-norm regularized least squares problem, which is a convex program. Then, we develop an efficient algorithm to solve the proposed convex program based on the alternating direction method of multipliers (ADMM). To achieve fast computation, we derive a closed-form solution to compute the proximal operator of the spectral norm. Furthermore, we extend the proposed model to handle outliers in the input 2D correspondences. This work extends its earlier version [10] with the extensions including postprocessing for rotation synchronization, outlier modeling, more experiments and real examples with learned detectors.

The remainder of this paper is organized as follows. We summarize related work in Section 2, formulate the problem in Section 3, introduce the proposed method and technical details in Section 4, report empirical results in Section 5, and finally conclude the paper in Section 6. The code is available at http://cis.upenn.edu/xiaowz/shapeconvex.html.

2 RELATED WORK

The most related work includes the ones that fit a 3D deformable model to the features in a 2D image. A popular application is human face modeling for various tasks such as recognition [11], feature tracking [12] and facial animation [13]. Recently, there has been an increased number of works on 3D object modeling. For example, Hejrati and Ramanan [5] used the deformable model for 3D car modeling. They produced 2D landmarks by a variant of deformable part models [14] and then lifted the 2D model to 3D. Lin et al. [15] proposed a method for joint 3D model fitting and fine-grained classification for cars. Zia [6] et al. developed a probabilistic framework to simultaneously localize 2D landmarks and recover 3D object models.

While the conventional active shape model performs well to describe the shape variability of rigid objects, it can hardly handle the structural variability of nonrigid objects such as human bodies. To address this issue, Ramakrishna et al. [8] proposed a sparse representation based approach to reconstructing 3D human pose from annotated joints in a still image. Wang et al. [9] adopted a 2D human pose detector to automatically locate the joints and used a robust estimator to handle inaccurate joint locations. Fan et al. [16] proposed to improve the performance of [8] by enforcing locality when building the pose dictionary. Zhou and De La Torre [17] formulated human pose estimation as a matching problem, where a learned spatio-temporal pose model was matched to point trajectories extracted from a video. Akhter and Black [18] integrated a joint-angle constraint into the sparse representation to reduce the possibility of invalid reconstruction.

Recently, there was growing interest in reconstructing category-specific object models from a collection of single images of different instances with the same category [19], [20], [21], [22]. The deformable shape model was adopted and fitted to either visual hulls in the pre-segmented images or landmarks annotated in the images, and at the same time the basis shapes were also learned from images during the model inference. The problem studied in this paper can be regarded as a subproblem in this process where the basis shapes are given.

A common component or an intermediate step in the works mentioned above is to fit a 3D deformable model to 2D correspondences. As we mentioned in the introduction, these works relied on nonconvex optimization, which may be sensitive to initialization. The convex formulation proposed in this paper can potentially serve as a building block or provide a good initialization to improve the performance of the existing methods.

Another line of work tried to solve single-image reconstruction by finding the nearest neighbor in the shape collection followed by a refinement step [23], [24]. The initial shape and viewpoint were found by enumerating all instances and viewpoints and comparing the test image with the rendered one. The initial estimate was refined by optimizing both the viewpoint and nonrigid deformation according to image contours. This approach produces very detailed reconstruction and is applicable to a wide range of object categories, but it is computationally expensive and requires accurate image segmentation.

There are also many alternative approaches for 3D human pose recovery from single images such as using a known articulated skeleton [25], [26], [27], probabilistic graphical models [28], [29], explicit regression [30], [31], to name a few. But these approaches are customized for human pose modeling and not straightforwardly generalizable for other objects.

Our work is also closely related to nonrigid structure from motion (NRSfM), where a deformable shape is recovered from multi-frame 2D-2D correspondences. The low-rank shape-space model has been frequently used in NRSfM, but the basis shapes are unknown. The joint estimation of shape/pose variables and basis shapes was typically solved via matrix factorization followed by metric rectifi- cation [32], [33]. In some recent works, iterative algorithms were employed for better precision [34], [35] or sequential processing [36], and the problem studied in this paper is analogous to the step of fixing basis shapes and updating the remaining variables in those works.

3 PROBLEM STATEMENT

The problem studied in this paper can be described by the following linear system:

where denotes the unknown 3D shape, which is represented by 3D locations of p points. denotes their projections in a 2D image. is the camera calibration matrix. To simplify the problem, the weakperspective camera model is usually used, which is a good approximation when the object depth is much smaller than the distance from the camera. With this assumption, the calibration matrix has the following simple form:

where s is a scalar depending on the focal length and the distance to the object.

There are always more variables than equations in (1). To make the problem well-posed, a widely-used assumption is that the unknown shape can be represented as a linear combination of predefined basis shapes, which is originated from the active shape model [7]:

where for represents a basis shape and its weight in the combination. Thus, the reconstruction problem is reduced to a problem of estimating several coefficients by fitting the model (3) to the landmarks in an image, which greatly reduces the number of unknowns.

Since the basis shapes are predefined, the relative rotation and translation between the camera frame and the coordinates that define the basis shapes need to be taken into account, and the 3D-2D projection is depicted by:

where and correspond to the rotation matrix and the translation vector, respectively. R should be in the special orthogonal group

Equation (4) can be further simplified as

where denotes the first two rows of the rotation matrix, and the translation T has been eliminated by centralizing the data, i.e. subtracting each row of W and B by its mean. Note that the scalar s in the calibration matrix has been absorbed into .

In the active shape model [7], the principal components of training samples are used as the basis shapes, which assumes that all shapes lie in a low-dimensional linear space. In more recent work, e.g. [8], [37], [38], [39], it has been shown that the low-dimensional linear space is insufficient to model complex shape variation, e.g., human poses, and a promising approach is to use an over-complete dictionary and represent an unknown shape as a sparse combination of atoms in the dictionary. Such a sparse representation implicitly encodes the assumption that the unknown shape should lie in a union of subspaces that approximates a nonlinear shape manifold. The representabilities of PCA and sparse representation for 3D human pose modeling will be empirically compared in Section 5.2.

Based on the sparse representation of shapes, the fol- lowing type of optimization problem is often considered to estimate an unknown shape:

where and represents the norm of c, which is the convex surrogate of the cardinality. denotes the Frobenius norm of a matrix. The terms in the loss function of (7) correspond to the reprojection error and the sparsity of representation, respectively.

The optimization in (7) is nonconvex and there is an orthogonality constraint. A commonly-used strategy to solve the optimization is the alternating minimization scheme, in which two steps are alternated: (1) fixing and updating c by any minimization solver, e.g., the methods summarized in [40]; (2) fixing c and updating using certain rotation representations such as the quaternions, the exponential map or a manifold representation. In some previous works [8], [18], is updated by the singular value decomposition (SVD), and it is worth noting that this method can only yield an approximate solution since is not a square matrix and . Generally no closed-form solution exists in this case [41].

The alternating algorithm is summarized in Algorithm 1. Due to the nonconvexity of (7), Algorithm 1 may get stuck at local minima when initialization is far away from the true solution.

1 initialize c and ;

2 while not converged do

3 update c using any minimization solver; 4 update using SVD or any local optimization solver over SO(3);

5 end

Algorithm 1: Alternating minimization to solve (7).

4 PROPOSED METHODS

4.1 Convex relaxation

We propose to use the following shape-space model:

in which there is a rotation for each basis shape. The model in (8) implicitly accounts for the viewpoint variability and the projected 2D model is

where is the product of and the first two rows of , which satisfies

The motivation of using the model in (9) is to achieve a linear representation of shape variability in 2D, such that we can get rid of the bilinear form in (6), which is a necessary step towards a convex formulation.

The model in (9) is equivalent to the affine-shape model in literature [42], [43], which uses an augmented linear space to represent the shape variation in 2D caused by both intrinsic shape deformation and extrinsic viewpoint changes. This representation also appears in most NRSfM literature, e.g. [32], [34]. As mentioned in [43], the augmented linear space can represent any 2D shape produced by the 3D shape model projected into the image plane, but the increased degree of freedom may result in invalid shapes. In this work, we try to reduce the possibility of invalid cases by enforcing the orthogonality constraint on s and the sparsity constraint on the number of activated basis shapes. We will show that these constraints can be conveniently imposed by minimizing a convex regularizer.

Next, replacing the orthogonality constraint in (10) by its convex counterpart is considered. The following lemma has been proven in literature [44, Section 3.4]:

Lemma 1. The convex hull of the Stiefel manifold Q = equals the unit spectral-norm ball conv denotes the spectral norm (a.k.a. the induced 2-norm) of a matrix X, which is defined as the largest singular value of X.

Based on Lemma 1, we have the following proposition:

Proposition 1. Given a scalar s, the convex hull of S = equals the spectral-norm ball with a radius of |s|: conv .

Proof. Since , we have

Consequently, the tightest convex relaxation to the constraint in (10) is given by .

Finally, with the modified shape model, the relaxed orthogonality constraint and the assumption of sparse representation, the following optimization is proposed for shape recovery under the noiseless case:

In (11), reducing can decrease the objective and the only constraint on is that . Therefore, is equal to when an optimum is attained. Consequently, (11) is equivalent to the following problem:

The formulation in (12) is a linear inverse problem, where a set of orthogonal matrices1 are estimated by minimizing their spectral norms. Interestingly, the conditions for exact recovery using such a convex program has been theoretically analyzed in [45]. Numerical results will be presented in Section 5.1 to demonstrate the exact recovery property.

To consider noise in real applications, the following formulation is proposed:

The problem (13) is the final formulation, which is a penalized least-squares problem. We have following remarks:

1. The problem in (13) is convex programming, which can be solved globally. We will provide an efficient algorithm to solve it in Section 4.3.

2. Notice that in the above formulations denotes the spectral norm of a matrix. As we will show in Section 4.2, minimizing the spectral norm of a matrix is equivalent to minimizing the -norm of the vector of singular values, which will simultaneously shrink the norm of the matrix towards zero and enforce its singular values to be equal. Therefore, by spectral-norm minimization, we can not only minimize the number of activated basis shapes but also enforce each transformation matrix to be orthogonal (an orthogonal matrix has equal singular values).

3. For the cases with missing or invisible landmarks, the model parameters can be estimated by minimizing reprojection errors of observed landmarks, i.e., including a binary weight matrix in the first term of (13). The unobserved landmarks can be hallucinated from the reconstructed model as their locations are known on the basis shapes.

Fig. 2. Illustration of the proximal operator of the spectral norm.

4.2 Proximal operator of the spectral norm

Before providing the specific algorithm to solve (13), we first prove the following proposition, which will serve as an important building block in our algorithm.

Proposition 2. The solution to the following problem

is given by , where

and denote the left singular vectors, right singular vectors and the singular values of A, respectively. is the projection of a vector to the unit -norm ball.

Proof. The problem in (14) is a proximal problem [46]. The proximal problem associated with a function F is defined as

with the solution denoted by proxand named the proximal operator of F.

For the problem (14), , where means the norm. It says that F is a spectral function operated on singular values of a matrix. Based on the property of spectral functions [46, Section 6.7.2], we have

where f is the -norm. The proximal operator of the -norm can be computed by Moreau decomposition [46, Section 6.5]:

given that the -norm is the dual norm of the -norm.

The solution for the case with two singular values is illustrated in Figure 2, where corresponds to in (15). From the illustration we have two observations: (1) If the projection is on the edge of the -norm ball, then and turns out to be orthogonal; (2) If is small and lies inside the -norm ball, then the projection will be itself, , and turns out to be all-zero. These two observations illustrate the effects of spectral-norm regularization in (13): (1) Enforcing s to be orthogonal; (2) Pruning small s to achieve a sparse representation.

4.3 Optimization

The algorithm to solve (13) is presented here. The noiseless case (12) can be solved similarly.

The proposed algorithm is based on ADMM [47] and the proximal operator of the spectral norm derived in Section 4.2. An auxiliary variable Z is introduced and (13) is rewritten as

where

The augmented Lagrangian of (19) is

where Y is the dual variable and is a parameter controlling the step size in optimization. Then, the ADMM alternates the following steps until convergence:

The subproblem in (22) can be rewritten as

where is the i-th column-triplet of . Therefore, each can be updated separately by solving a proximal problem based on Proposition 2:

For the subproblem in (23), is a quadratic form and admits a closed-form solution:

The steps are summarized in Algorithm 2. Following the standard proof in [47], it can be shown that the sequences of values produced by the iterations in Algorithm 2 converge to the optimal solution of the problem (19), which is also the optimal solution to the original problem (13). In implementation, the convergence criterion and the adaptive scheme for tuning step size from [47, Section 3] are adopted.

1 initialize

2 while not converged do

3 parallel for i = 1 to k do 4 -th column-triplet of

5

6 end 7 Z8 Y

9 end

4.4 Reconstruction

After (13) is solved, two algorithms for reconstructing the 3D shape from the estimated s are proposed.

4.4.1 Direct reconstruction

The 3D shape is reconstructed according to the relaxed shape model (8), where and are recovered from the estimated . The steps are summarized in Algorithm 3, where and denote the j-th row vectors of and , respectively. Note that is also a feasible solution. To eliminate the ambiguity, we assume that and impose this constraint when learning the shape dictionary.

1 for i = 1 to k do

2 ; 3 ; 4 ; 5 ; 6 ; 7 end

8 ; Algorithm 3: Direct reconstruction

4.4.2 Rotation synchronization and refinement

The 3D shape is reconstructed according to the original shape model (3). In order to recover the coefficient vector c and a single rotation from the estimated s, the following rotation synchronization problem is solved:

The problem in (28) corresponds to the “metric projection” step in [34], [35], which can be exactly solved by semidefinite programming [34] or approximately by a more efficient algorithm [35, Section 4.4.1].

The solution from the synchronization described above is not necessarily the solution to the original problem (7), but it can be used as a good initialization and refined by the alternating minimization in Algorithm 1. In Section 5, it will be demonstrated that such a better initialization can clearly improve the performance of alternating minimization. A potential issue in the synchronization is that, when the rotations obtained from the convex program are very different from each other, the consensus of them might be meaningless and fails to provide a good initialization.

The full steps are summarized in Algorithm 4.

1 initialize c and by solving (28) using the algorithm in [35, Section 4.4.1];

2 refine c and by Algorithm 1;

3 ;

4.5 Outlier modeling

In real applications, the 2D correspondences are given by detectors and there are likely to be gross errors (outliers). In this case, we explicitly model the outliers by a sparse matrix E and modify the formulation in (13) as

Note that in this case the translation T cannot be eliminated by centralizing the data at the beginning.

The problem in (29) can be solved by ADMM as well. The algorithm is presented in Appendix A.

4.6 Dictionary learning

In general, one can simply use all existing or a few representative 3D models as the basis shapes. But when the collection is very large, e.g., a motion capture dataset often contains thousands of human skeletons, it is necessary to use the dictionary learning technique to learn a dictionary of basis shapes that can concisely summarize the variability in training data and enable a sparse representation. The dictionary learning problem can be formulated as follows:

where s denote the training shapes, s are the basis shapes to be learned, and represents the i-th coefficient for the j-th training shape. The two terms in the cost function correspond to the reconstruction error and the sparsity of representation, respectively. The nonnegativity constraint is introduced to remove the ambiguity as discussed in Section 4.4.1. The problem in (30) is locally solved by alternately updating C and s via projected gradient descent, an algorithm widely used in dictionary learning that converges to a local optimum [48]. The algorithm is summarized in Appendix B.

Fig. 3. The frequency of exact recovery. The intensity indicates the frequency of success under a variety of problem settings.

5 EXPERIMENTS

5.1 Exact recovery

We investigate whether the spectral-norm minimization in (12) can exactly solve the ill-posed inverse problem based on the prior knowledge of sparsity and orthogonality using simulation.

More specifically, the data is synthesized with the following model:

where (k = 50 and p is varying) are randomly generated bases with entries sampled independently from the normal distribution N(0, 1). Each is generated from , where is sampled from a uniform distribution U(0, 1) and denotes the first two rows of a rotation matrix with angles uniformly sampled from . Only a randomly selected subset of are left as nonzero with a varying cardinality of z. Then, W and B are taken as input and s are estimated by solving (12). The recovery error is defined as

where s are concatenated in , and is the algorithm estimate. A recovery is regarded as exact if the relative error .

Figure 3 shows the frequency of exact recovery with varying p (number of landmarks) and z (cardinality of true coefficients), which is evaluated over 100 trials for each setting. Note that the number of unknowns (6k) is much larger than the number of equations (2p). The proposed convex program can exactly solve the problem with a frequency equal to 1 in the lower-triangular area, where the number of landmarks is sufficiently large and the coefficients are truly sparse. This demonstrates the power of convex relaxation, which has proven to be successful in many inverse problems, e.g., compressed sensing [49] and matrix completion [50]. The performance drops in more ill-posed cases in the upper-triangular area. This observation is analogous to the phase transition in compressive sensing, where the recovery probability also depends on the number of observations and the underlying signal sparsity [51].

Note that the original model (6) is a special case of the relaxed model (31) when s are the scaled versions of each other. The result obtained by using the original model for data generation is similar to Figure 3.

Fig. 4. Representability of principal component analysis (PCA), sparse coding (SC) and nonnegative sparse coding (NNSC) for human poses.

5.2 Human pose estimation

The applicability of sparse representation for 3D human pose recovery has been thoroughly studied in previous work [8], [9], [16]. In this paper, we aim to illustrate the advantage of the proposed convex program compared to the alternating minimization scheme.

Empirical evaluation is carried out on the CMU motion capture dataset [52]. The dataset includes thousands of sequences of 3D human skeletons and part of them are selected for evaluation. More specifically, eight motion categories are considered including walking, running, jumping, climbing, boxing, dancing, sitting and basketball. For each motion, six sequences are collected from different subjects with three sequences for training and the rest for testing. Then, a dictionary is learned with all training sequences from various motions and used to reconstruct the test sequences. A 15-joint model (i.e., head, thorax, pelvis, shoulders, elbows, wrists, hips, knees, and ankles) is adopted to represent a human skeleton.

To build the shape dictionary, all 3D poses in the training set are collected and aligned by the Procrustes method. The size of the dictionary k is set as 128 yielding an overcomplete dictionary since the dimension of the basis vector is 45 (15 joints in 3D). The dictionary is initialized by uniformly sampling k poses from the training data and updated by the algorithm described in Section 4.6. The representability of the learned dictionary is measured by the reconstruction error, i.e., the first term in (30). To evaluate the representability, (30) is solved multiple times with a variety of yielding a sequence of reconstruction errors and the corresponding sparsity levels. The more general case without the nonnegativity constraint is also tested. The result is shown in Figure 4, which indicates that the sparse coding needs much fewer activated bases compared to PCA to achieve the same reconstruction error, and adding the nonnegativity constraint sacrifices the representability but the difference is small in our experiment.

To generate 2D testing data, an orthographic camera rotating around the subject (360 degrees per sequence) is simulated and the 3D joints are projected to 2D with the simulated camera. The following methods are compared:

• Convex: the proposed convex program with direct reconstruction (Algorithm 2+Algorithm 3).

• Convex+refine: the proposed convex program followed by refinement (Algorithm 2+Algorithm 4). The local update of rotation is implemented by the

Fig. 5. Quantitative results on the CMU motion capture dataset. (a) The mean 3D estimation errors for different motions. (b) The box plot of estimation errors. (c) The comparison between objective values achieved by the mean shape initialization and by the convex initialization. (d) The sensitivity to Gaussian noise. (e) The sensitivity to outliers. The length of error bar in (d) and (e) indicates half standard deviation.

manifold optimization over the Stiefel manifold with the Manopt toolbox [53].

• Altern: the alternating algorithm with the mean shape as initialization (Algorithm 1). The rotation is updated by SVD as in previous work [8], [16], [17].

• PMP: Projected Matching Pursuit from Ramakrishna et al. [8], which solves sparse coding by greedily selecting basis shapes instead of minimization. The dictionary for PMP is constructed by the method provided in the original paper but with the same training data as ours.

The estimation error is evaluated by the 3D Euclidian distances between the estimated joint locations and the ground truth in the camera frame up to a translation and scaling. The mean errors for various motions are shown in Figure 5(a) and the statistics of errors in Figure 5(b). The proposed convex method consistently outperforms the nonconvex methods for all motions. Comparing “convex+refine” and “altern” which solve the same problem (7) at the end, the error is apparently decreased by using better initialization (the convex solution). To further verify the fact that the alternating minimization depends on initialization, we compare the objective values achieved by the alternating algorithm initialized by the convex solution and the mean shape. As shown in Figure 5(c), the objective values achieved by convex initialization are always smaller. The mean errors of “convex” and “convex+refine” are comparable on the average. The rotation synchronization and refinement may lead to worse reconstructions if the convex solution fails to provide a good initialization or the optimal solution to (7) is not necessarily the best reconstruction, which may happen when the pose to be reconstructed cannot be well represented by the learned dictionary.

Some visual examples of reconstructed poses for various motions are shown in Figure 6. All methods perform well in the “walk” and “run” examples, where the poses are close to the mean pose (stand upright). But for more complex motions, the nonconvex algorithms may be stuck at local optima as shown in other examples, while the proposed convex algorithm still obtains appealing results. A demonstration video showing several reconstructed 3D pose sequences is included in the supplementary material. The reconstructions by the convex method are much smoother over time compared to the ones by the nonconvex method which suffer from abrupt changes due to the inherent instability of nonconvex optimization.

The robustness of the proposed method against Gaussian noise is also tested. Gaussian noise with a varying standard deviation is added to the 2D input. The estimation error versus the standard deviation of the simulated noise is shown in Figure 5(d). The performances of the convex methods are consistently better than the alternating ones under all noise levels. To test the robustness of the extended model in (29) against outliers, a proportion of landmarks are selected with their 2D locations changed to be some random

Fig. 6. Qualitative results on the CMU motion capture dataset. The rows from top to bottom correspond to the input 2D poses, the ground-truth 3D poses (visualized in a novel view), and the reconstructions from the proposed method, the alternating minimization, and the PMP method [8], respectively. Red and green indicate left and right, respectively.

values within a proper range (roughly the rectangle that covers the skeleton). For comparison, the nonconvex model (7) is also extended by introducing an outlier term similar to the one in (29) and minimized by the alternating scheme. The curves of estimation error versus outlier proportion are shown in Figure 5(e). The robust models clearly outperform the original models and the convex method achieves a better performance than the alternating one.

To illustrate the real applicability of the proposed method, we apply the 2D pose detector from Yang and Ramanan [54] on the PARSE dataset [55], and use the proposed method to lift the detected 2D poses to 3D with the dictionary learned on the CMU dataset. Some selected results are shown in Figure 7.

5.3 Car model estimation

The applicability of the proposed method for car model estimation is demonstrated using the Fine-Grained 3D Car (FG3DCar) dataset [15], which provides images of cars, 2D landmark annotations and some 3D models. The 3D models of 15 cars are concatenated as the shape dictionary to reconstruct the other cars from the visible landmarks annotated in the images (40 points per image). Several examples of reconstruction are shown in Figure 8. For comparison, the reconstructions from the algorithm proposed in the original paper [15] are also shown, which adopts a perspective camera model and locally updates the camera and shape parameters with a Jacobian system initialized by the mean shape and predefined camera parameters. The nonlinear optimization performs well in the sedan example but fails in the SUV and truck examples, where the models deviate far away from the mean shape. Similar results were reported in the original paper [15] and the authors proposed to use the class-specific mean shape for better initialization. Instead, the proposed method can achieve appealing results with arbitrary initialization.

We observed that the alternating minimization in Algorithm 1 also performed very well for car model reconstruction, as the shape variability of cars was relatively small and the mean-shape initialization was very close to the true solution. But when there are outliers in the observation, the initial estimate of viewpoint might be unreliable and the alternating algorithm may fail. To validate this, some outliers are simulated. More specifically, for each image 20 annotated landmarks are randomly selected with their locations changed to be random values in the image plane. Then, the robust models are solved by the convex method and by the alternating algorithm initialized with the mean shape, respectively. Since there is no 3D ground truth (the 3D models provided in the dataset were reconstructed by the authors), the 2D shape fitting error is evaluated, i.e., comparing the 2D projection of the reconstructed 3D model with the original 2D annotations, to see whether the synthesized outliers can be corrected. The comparison is shown in Figure 9. While the estimation errors are identical for most examples, there are still many examples in which the convex method performs apparently better than the alternating method. The same conclusion can be drawn by comparing

Fig. 7. Qualitative results on the PARSE dataset with 2D poses detected by an existing 2D pose detector [54]. In each example, the detected 2D pose superposed on the original image and the reconstruction in two different views are shown. The last row shows two failed examples.

Fig. 8. Qualitative results on the FG3DCar dataset given 2D correspondences. The columns correspond to the original image, the input 2D landmarks, and the 2D/3D models output by the proposed method and the nonlinear optimization [15], respectively. Only visible landmarks (per image) are used for model fitting. The 3D models are visualized in a novel view different from the original image. The car models are the BMW 5 Series 2011 (sedan), the Nissan Xterra 2005 (SUV) and the Dodge Ram 2003 (pick-up truck), respectively.

Fig. 9. Quantitative comparison between the proposed method (“con- vex+refine”) and the alternating minimization (“altern”) on the FG3DCar dataset with synthesized outliers. (a) The 2D shape error. (b) The objective value.

the objective values. The mean 2D errors of the “altern”, “convex” and “convex+refine” methods are 44.47, 37.21 and 34.76 pixels, respectively. The rotation synchronization and refinement is beneficial here as the shape variability is small and a more constrained model is preferred.

Finally, we demonstrate the real applicability of the proposed method with learned landmark detectors. For each landmark, the image patches around the landmark in training images are collected as positive examples and a number of random patches in the background as negative examples. A classifier based on SVM with HOG features is trained with the collected examples. To handle 2D appearance variability, the positive examples are clustered and a mixture of classifiers is learned for each landmark. During testing, a test image is scanned by the learned classifiers and the locations with maximum responses are picked as the landmark locations. The detections are shown in the first column of Figure 10. The false detections are mostly due to self occlusion. The visibility is unknown and we fit the model with all detected landmarks despite of their correctness. The 2D fitted models and 3D reconstructions are shown in the next two columns, which demonstrate that the proposed method can robustly fit the model at the presence of many outliers and without knowing the visibility. The last row shows two failed examples, in which more than half of the landmarks are incorrectly located.

5.4 Parameter tuning

The parameters and in the proposed model (13) and (29) control the sparsity of shape coefficients and outliers, respectively. In general, the best values of and can be obtained by cross validation for a specific task. But empirically we find that the estimation error changes very smoothly with the parameters varying in a proper range and a set of fixed parameters works well after normalizing the data. More specifically, the coordinates of input 2D shape and 3D basis shapes are scaled such that the average variance of each shape along all directions is unit. With the data normalization, the estimation error as a function of parameters over a subset of CMU motion capture data is shown in Figure 11. Each curve has a relatively flat valley. In all experiments in this paper, we fix and after the data normalization.

Fig. 11. Sensitivity to model parameters on the CMU motion capture dataset. (a) Estimation error of model (13) as a function of Estimation error of model (29) as a function of simulated outliers.

5.5 Computational time

The proposed algorithms are implemented in MATLAB and tested on a desktop with a Intel i7 3.4GHz CPU and 8G RAM. In our experiments, the ADMM algorithm generally converges within 500 iterations to reach a stopping criterion of . In human pose estimation, the average computational time of our algorithm is 1.91s per frame, while those of the alternating algorithm and the PMP algorithm [8] are 1.34s and 3.70s, respectively. Note that the for-loops in Algorithm 2 (Line 3) can be paralleled to further accelerate the computation.

6 DISCUSSION

In summary, we proposed a method for aligning a 3D deformable shape model with a sparse representation to 2D correspondences by solving a convex program, which guarantees global optimality. Intuitively, we adopted an augmented 3D shape space to achieve a linear representation of both shape and viewpoint variability and proposed to use the spectral-norm regularization to penalize invalid cases caused by the augmentation. We also extended our model to handle outliers in 2D correspondences. We empirically demonstrated the exact recovery property of the relaxed model as well as the advantage of convex optimization compared to alternative nonconvex methods especially when the shape variability was large and there were outliers.

We demonstrated the applicability of the proposed method for reconstructing human poses and car models. The point correspondences across 3D training shapes are provided in the used datasets. With the increasing availability of techniques for 3D shape matching, e.g. [56], [57], and databases with rich annotations, e.g. PASCAL3D+ [58] and 3D ShapeNet [59], we expect to see more applications of the proposed framework to reconstruct a variety of object categories in future.

Fig. 10. Qualitative results on the FG3DCar dataset with detected 2D landmarks. In each example, the detected landmarks superposed on the original image, 2D fitted model and 3D reconstruction are shown. The 2D landmarks are located by learned detectors based on HOG and SVM. The green and red dots correspond to the inliers and outliers, respectively. The outlier is defined as any detected landmark at least 30-pixel away from the manual annotation. Note that the inlier/outlier classification is only for the purpose of illustration and not for model fitting, i.e., the models are fitted with all landmarks despite of the correctness of detection. The last row shows two failed examples.

APPENDIX A ALGORITHM TO SOLVE THE ROBUST MODEL

The algorithm to solve (29) is presented. The problem is rewritten as follows by introducing an auxiliary variable Z:

The augmented Lagrangian of (33) is

The following steps are iterated until convergence:

The whole procedure is very similar to Algorithm 2. The differences are the steps to update E and T . For (35) and (36), they can be computed similarly as the steps in Algorithm 2:

For (37), the problem is a proximal problem associated with the -norm and can be solved by

where signthat refers to the elementwise soft-thresholding operator. For (38), the solution is simply given by

Note that there is no theoretical guarantee for the convergence of ADMM solving a multi-block problem such as (33) [60], though it always converges in our experiments.

APPENDIX B ALGORITHM TO SOLVE DICTIONARY LEARNING

The algorithm to solve the dictionary learning problem in (30) is presented. The cost function can be rewritten as:

where is the concatenation of s. Then, it is minimized by projected gradient descent and the algorithm is summarized in Algorithm 5.

1 initialize , step sizes and ;

2 while not converged do

3 while not converged do

4 ; 5 for i = 1 to k do

6 for j = 1 to n do

7 if then 8 ; 9

11 end

12 end /* update dictionary */

13 ˜( ˜B, C) ;

14 for i = 1 to k do

15 if then 16 ; 17

18 end

19 end

ACKNOWLEDGMENTS

The authors are grateful for support through the following grants: NSF-DGE-0966142, NSF-IIS-1317788, NSF-IIP-1439681, NSF-IIS-1426840, ARL MASTCTA W911NF-08-2-0004, ARL RCTA W911NF-10-2-0016, ONR N000141310778 The authors gratefully acknowledge Dr Xiaoyan Hu from Beijing Normal University for helpful discussions and processing of motion capture data.

REFERENCES

[1] Y. Xiang and S. Savarese, “Estimating the aspect layout of object categories,” in International Conference on Computer Vision and Pattern Recognition, 2012. 1

[2] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic, “Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014. 1

[3] S. Fidler, S. Dickinson, and R. Urtasun, “3d object detection and viewpoint estimation with a deformable 3d cuboid model,” in Advances in Neural Information Processing Systems, 2012. 1

[4] E. Simo-Serra, A. Quattoni, C. Torras, and F. Moreno-Noguer, “A joint model for 2d and 3d pose estimation from a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013. 1

[5] M. Hejrati and D. Ramanan, “Analyzing 3d objects in cluttered images,” in Advances in Neural Information Processing Systems, 2012. 1, 2

[6] M. Z. Zia, M. Stark, B. Schiele, and K. Schindler, “Detailed 3d representations for object recognition and modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2608–2623, 2013. 1, 2

[7] T. Cootes, C. Taylor, D. Cooper, and J. Graham, “Active shape models – their training and application,” Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38–59, 1995. 1, 3

[8] V. Ramakrishna, T. Kanade, and Y. Sheikh, “Reconstructing 3d human pose from 2d image landmarks,” in European conference on Computer Vision, 2012. 1, 2, 3, 7, 8, 9, 11

[9] C. Wang, Y. Wang, Z. Lin, A. L. Yuille, and W. Gao, “Robust estimation of 3d human poses from a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014. 1, 2, 7

[10] X. Zhou, S. Leonardos, X. Hu, and K. Daniilidis, “3D shape esti- mation from 2D landmarks: A convex relaxation approach,” IEEE International Conference on Computer Vision and Pattern Recognition, 2015. 2

[11] V. Blanz and T. Vetter, “Face recognition based on fitting a 3D mor- phable model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1063–1074, 2003. 2

[12] L. Gu and T. Kanade, “3D alignment of face in a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, 2006. 2

[13] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3d shape regression for real-time facial animation,” ACM Transactions on Graphics (TOG), vol. 32, no. 4, p. 41, 2013. 2

[14] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008. 2

[15] Y.-L. Lin, V. I. Morariu, W. Hsu, and L. S. Davis, “Jointly optimizing 3d model fitting and fine-grained classification,” in European Conference on Computer Vision, 2014. 2, 9, 10

[16] X. Fan, K. Zheng, Y. Zhou, and S. Wang, “Pose locality constrained representation for 3d human pose reconstruction,” in European Conference on Computer Vision, 2014. 2, 7, 8

[17] F. Zhou and F. De la Torre, “Spatio-temporal matching for human detection in video,” in European Conference on Computer Vision, 2014. 2, 8

[18] I. Akhter and M. J. Black, “Pose-conditioned joint angle limits for 3D human pose reconstruction,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015. 2, 3

[19] T. J. Cashman and A. W. Fitzgibbon, “What shape are dolphins? building 3d morphable models from 2d images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 232– 244, 2013. 2

[20] S. Vicente, J. Carreira, L. Agapito, and J. Batista, “Reconstructing pascal voc,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014. 2

[21] J. Carreira, A. Kar, S. Tulsiani, and J. Malik, “Virtual view networks for object reconstruction,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015. 2

[22] A. Kar, S. Tulsiani, C. Jo˜ao, and J. Malik, “Category-specific object reconstruction from a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015. 2

[23] H. Su, Q. Huang, N. J. Mitra, Y. Li, and L. Guibas, “Estimating image depth using shape collections,” ACM Transactions on Graphics, vol. 33, no. 4, p. 37, 2014. 2

[24] Q. Huang, H. Wang, and V. Koltun, “Single-view reconstruction via joint analysis of image and shape collections,” ACM Transactions on Graphics, vol. 34, no. 4, p. 87, 2015. 2

[25] C. J. Taylor, “Reconstruction of articulated objects from point cor- respondences in a single uncalibrated image,” in IEEE Conference on Computer Vision and Pattern Recognition, 2000. 2

[26] P. Guan, A. Weiss, A. O. B˘alan, and M. J. Black, “Estimating human shape and pose from a single image,” in International Conference on Computer Vision, 2009. 2

[27] S. Leonardos, X. Zhou, and K. Daniilidis, “Articulated motion estimation from a monocular image sequence using spherical tangent bundles,” in IEEE International Conference on Robotics and Automation, 2016. 2

[28] L. Sigal and M. J. Black, “Predicting 3d people from 2d pictures,” in Articulated Motion and Deformable Objects. Springer, 2006, pp. 185–195. 2

[29] M. Andriluka, S. Roth, and B. Schiele, “Monocular 3d pose esti- mation and tracking by detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010. 2

[30] A. Elgammal and C.-S. Lee, “Inferring 3d body pose from silhouettes using activity manifold learning,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2004. 2

[31] A. Agarwal and B. Triggs, “Recovering 3d human pose from monocular images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 1, pp. 44–58, 2006. 2

[32] C. Bregler, A. Hertzmann, and H. Biermann, “Recovering non-rigid 3d shape from image streams,” in IEEE Conference on Computer Vision and Pattern Recognition, 2000. 3, 4

[33] J. Xiao, J. Chai, and T. Kanade, “A closed-form solution to non-rigid shape and motion recovery,” International Journal of Computer Vision, vol. 67, no. 2, pp. 233–246, 2006. 3

[34] M. Paladini, A. Del Bue, J. Xavier, L. Agapito, M. Stoˇsi´c, and M. Dodig, “Optimal metric projections for deformable and articulated structure-from-motion,” International Journal of Computer Vision, vol. 96, no. 2, pp. 252–276, 2012. 3, 4, 6

[35] A. Del Bue, J. Xavier, L. Agapito, and M. Paladini, “Bilinear modeling via augmented lagrange multipliers (balm),” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 8, pp. 1496– 1508, 2012. 3, 6

[36] A. Agudo, L. Agapito, B. Calvo, and J. Montiel, “Good vibrations: A modal analysis approach for sequential non-rigid structure from motion,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014. 3

[37] S. Zhang, Y. Zhan, M. Dewan, J. Huang, D. Metaxas, and X. Zhou, “Sparse shape composition: A new framework for shape prior modeling,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011. 3

[38] S. Zhu, L. Zhang, and B. M. Smith, “Model evolution: an incremental approach to non-rigid structure from motion,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010. 3

[39] Y. Zhu, D. Huang, F. De la Torre Frade, and S. Lucey, “Complex non-rigid motion 3d reconstruction by union of subspaces,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014. 3

[40] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Optimization with sparsity-inducing penalties,” Foundations and Trends RMachine Learning, vol. 4, no. 1, pp. 1–106, 2012. 3

[41] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998. 3

[42] A. Blake and M. Isard, Active contours. Springer, 2000. 4

[43] J. Xiao, S. Baker, I. Matthews, and T. Kanade, “Real-time combined 2d+ 3d active appearance models,” in IEEE Conference on Computer Vision and Pattern Recognition, 2004. 4

[44] M. Journ´ee, Y. Nesterov, P. Richt´arik, and R. Sepulchre, “Generalized power method for sparse principal component analysis,” The Journal of Machine Learning Research, vol. 11, pp. 517–553, 2010. 4

[45] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Foundations of Computational Mathematics, vol. 12, no. 6, pp. 805–849, 2012. 4

[46] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, pp. 123–231, 2013. 5

[47] S. Boyd, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2010. 5

[48] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” The Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010. 6

[49] E. J. Cand`es and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 21– 30, 2008. 7

[50] E. J. Cand`es and T. Tao, “The power of convex relaxation: Near- optimal matrix completion,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2053–2080, 2010. 7

[51] D. Donoho and J. Tanner, “Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 367, no. 1906, pp. 4273–4293, 2009. 7

[52] “Mocap: Carnegie mellon university motion capture database. http://mocap.cs.cmu.edu/.” 7

[53] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre, “Manopt, a matlab toolbox for optimization on manifolds,” Journal of Machine Learning Research, vol. 15, pp. 1455–1459, 2014. 8

[54] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible mixtures-of-parts,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011. 9, 10

[55] D. Ramanan, “Learning to parse images of articulated bodies,” in Advances in neural information processing systems, 2006. 9

[56] Q.-X. Huang and L. Guibas, “Consistent shape maps via semidefinite programming,” Computer Graphics Forum, vol. 32, no. 5, pp. 177–186, 2013. 11

[57] Q. Huang, F. Wang, and L. Guibas, “Functional map networks for analyzing and exploring large shape collections,” ACM Transactions on Graphics, vol. 33, no. 4, p. 36, 2014. 11

[58] Y. Xiang, R. Mottaghi, and S. Savarese, “Beyond pascal: A benchmark for 3d object detection in the wild,” in IEEE Winter Conference on Applications of Computer Vision, 2014. 11

[59] “http://shapenet.cs.stanford.edu/.” 11

[60] C. Chen, B. He, Y. Ye, and X. Yuan, “The direct extension of admm for multi-block convex minimization problems is not necessarily convergent,” Mathematical Programming, vol. 155, no. 1-2, pp. 57– 79, 2016. 13

designed for accessibility and to further open science