HybridPose: 6D Object Pose Estimation under Hybrid Representations

2020·Arxiv

Abstract

Abstract

We introduce HybridPose, a novel 6D object pose esti- mation approach. HybridPose utilizes a hybrid intermediate representation to express different geometric information in the input image, including keypoints, edge vectors, and symmetry correspondences. Compared to a unitary representation, our hybrid representation allows pose regression to exploit more and diverse features when one type of predicted representation is inaccurate (e.g., because of occlusion). Different intermediate representations used by HybridPose can all be predicted by the same simple neural network, and outliers in predicted intermediate representations are filtered by a robust regression module. Compared to state-of-the-art pose estimation approaches, HybridPose is comparable in running time and accuracy. For example, on Occlusion Linemod [3] dataset, our method achieves a prediction speed of 30 fps with a mean ADD(-S) accuracy of 47.5%, representing a state-of-the-art performance1. The implementation of HybridPose is available at https://github.com/chensong1995/HybridPose.

1. Introduction

Estimating the 6D pose of an object from an RGB image is a fundamental problem in 3D vision and has diverse applications in object recognition and robot-object interaction. Advances in deep learning have led to significant breakthroughs in this problem. While early works typically formulate pose estimation as end-to-end pose classifi-cation [39] or pose regression [16, 42], recent pose estimation methods usually leverage keypoints as an intermediate representation [38, 34], and align predicted 2D keypoints with ground-truth 3D keypoints. In addition to ground-truth pose labels, these methods incorporate keypoints as an intermediate supervision, facilitating smooth model training. Keypoint-based methods are built upon two assumptions:

Figure 1. HybridPose predicts keypoints, edge vectors, and symmetry correspondences. In (a), we show the input RGB image, in which the object of interest (driller) is partially occluded. In (b), red markers denote predicted 2D keypoints. In (c), edge vectors are defined by a fully-connected graph among all keypoints. In (d), symmetry correspondences connect each 2D pixel on the object to its symmetric counterpart. For illustrative purposes, we only draw symmetry correspondences of 50 random samples from 5755 detected object pixels in this example. The predicted pose (f) is obtained by jointly aligning all predictions with the 3D template, which involves solving a non-linear optimization problem.

(1) a machine learning model can accurately predict 2D keypoint locations; and (2) these predictions provide suf-ficient constraints to regress the underlying 6D pose. Both assumptions easily break in many real-world settings. Due to object occlusions and representational limitations of the prediction network, it is often impossible to accurately predict 2D keypoint coordinates from an RGB image alone.

In this paper, we introduce HybridPose, a novel 6D pose estimation approach that leverages multiple intermediate representations to express the geometric information in the input image. In addition to keypoints, HybridPose integrates a prediction network that outputs edge vectors between adjacent keypoints. As most objects possess a (partial) reflection symmetry, HybridPose also utilizes predicted dense pixel-wise correspondences that reflect the underlying symmetric relations between pixels. Compared to a unitary representation, this hybrid representation enjoys a multitude of advantages. First, HybridPose integrates more signals in the input image: edge vectors encode spacial relations among object parts, and symmetry correspondences incorporate interior details. Second, HybridPose offers more constraints than using keypoints alone for pose regression, enabling accurate pose prediction even if a sig-nificant fraction of predicted elements are outliers (e.g., because of occlusion). Finally, it can be shown that symmetry correspondences stabilize the rotation component of pose prediction, especially along the normal direction of the re-flection plane (details are provided in the supp. material).

Given the intermediate representation predicted by the first module, the second module of HybridPose performs pose regression. In particular, HybridPose employs trainable robust norms to prune outliers in predicted intermediate representation. We show how to combine pose initialization and pose refinement to maximize the quality of the resulting object pose. We also show how to train HybridPose effectively using a training set for the pose prediction module, and a validation set for the pose regression module.

We evaluate HybridPose on two popular benchmark datasets, Linemod [12] and Occlusion Linemod [3]. In terms of accuracy (under the ADD(-S) metric), HybridPose leads to improvements from state-of-the-art methods that merely utilize keypoints. On Occlusion Linemod [3], HybridPose achieves an accuracy of 47.5%, which beats DPOD [44], the current state-of-the-art method on this benchmark dataset.

Despite the gain in accuracy, our approach is efficient and runs at 30 frames per second on a commodity workstation. Compared to approaches which utilize sophisticated network architecture to predict one single intermediate representation (such as Pix2Pose [30]), HybridPose achieves better performance by using a relative simple network to predict hybrid representations.

2. Related Works

Intermediate representation for pose. To express the geometric information in an RGB image, a prevalent intermediate representation is keypoints, which achieves state-of-the-art performance [34, 32, 36]. The corresponding pose estimation pipeline combines keypoint prediction and pose regression initialized by the PnP algorithm [18]. Keypoint predictions are usually generated by a neural network, and previous works use different types of tensor descriptors to express 2D keypoint coordinates. A common approach represents keypoints as peaks of heatmaps [28, 48], which becomes sub-optimal when keypoints are occluded, as the input image does not provide explicit visual cues for their locations. Alternative keypoint representations include vector-fields [34] and patches [14]. These representations allow better keypoint predictions under occlusion, and eventually lead to improvement in pose estimation accuracy. However, keypoints alone are a sparse representation of the object pose, whose potential in improving estimation accuracy is limited.

Besides keypoints, another common intermediate representation is the coordinate of every image pixel in the 3D physical world, which provides dense 2D-3D correspondences for pose alignment, and is robust under occlusion [3, 4, 30, 20]. However, regressing dense object coordinates is much more costly than keypoint prediction. They are also less accurate than keypoints due to the lack of corresponding visual cues. In addition to keypoints and pixel-wise 2D-3D correspondences, depth is another alternative intermediate representation in visual odometry settings, which can be estimated together with pose in an unsupervised manner [47]. In practice, the accuracy of depth estimation is limited by the representational power of neural networks.

Unlike previous approaches, HybridPose combines multiple intermediate representations, and exhibits collaborative strength for pose estimation. Multi-modal input. To address the challenges for pose estimation from a single RGB image, several works have considered inputs from multiple sensors. A popular approach is to leverage information from both RGB and depth images [47, 40, 42]. In the presence of depth information, pose regression can be reformulated as the 3D point alignment problem, which is then solved by the ICP algorithm [42]. Although HybridPose utilizes multiple intermediate representations, all intermediate representations are predicted from an RGB image alone. HybridPose handles situations in which depth information is absent. Edge features. Edges are known to capture important image features such as object contours [2], salient edges [23], and straight line segments [45]. Unlike these low-level image features, HybridPose leverages semantic edge vectors defined between adjacent keypoints. This representation, which captures correlations between keypoints and reveals underlying structure of object, is concise and easy to predict. Such edge vectors offer more constraints than keypoints alone for pose regressions and have clear advantages under occlusion. Our approach is similar to [5], which predicts directions between adjacent keypoints to link keypoints into a human skeleton. However, we predict both the direction and the magnitude of edge vectors, and use these vectors to estimate object poses. Symmetry detection from images. Symmetry detection has received significant attention in computer vision. We refer readers to [22, 27] for general surveys, and [1, 41] for recent advances. Traditional applications of symmetry detection include face recognition [31], depth estimation [21], and 3D reconstruction [13, 43]. In the context of object pose estimation, people have studied symmetry from the perspective that it introduces ambiguities for pose estimation (c.f. [25, 36, 42]), since symmetric objects with differ-

Figure 2. Approach overview. HybridPose consists of intermediate representation prediction networks and a pose regression module. The prediction networks take an image as input, and output predicted keypoints, edge vectors, and symmetry correspondences. The pose regression module consists of a initialization sub-module and a refinement sub-module. The initialization sub-module solves a linear system with predicted intermediate representations to obtain an initial pose. The refinement sub-module utilizes GM robust norm and optimizes (9) to obtain the final pose prediction.

ent poses can have the same appearance in image. Several works [36, 42, 6, 25, 30] have explored how to address such ambiguities, e.g., by designing loss functions that are invariant under symmetric transformations. Robust regression. Pose estimation via intermediate representation is sensitive to outliers in predictions, which are introduced by occlusion and cluttered backgrounds [37, 32, 40]. To mitigate pose error, several works assign different weights to different predicted elements in the 2D-3D alignment stage [34, 32]. In contrast, our approach additionally leverages robust norms to automatically filter outliers in the predicted elements.

Besides the reweighting strategy, some recent works propose to use deep learning-based refiners to boost the pose estimation performance [19, 26, 44]. [44, 19] use point matching loss and achieve high accuracy. [26] predicts pose updates using contour information. Unlike these works, our approach considers the critical points and the loss surface of the robust objective function, and does not involve a fixed pre-determined iteration count used in recurrent network based approaches.

3. Approach

The input to HybridPose is an image I containing an object in a known class, taken by a pinhole camera with known intrinsic parameters. Assuming that the class of objects has a canonical coordinate system (i.e. the 3D point cloud), HybridPose outputs the 6D camera pose of the image object under , where is the rotation and is the translation component.

3.1. Approach Overview

As illustrated in Figure 2, HybridPose consists of a prediction module and a pose regression module. Prediction module (Section 3.2). HybridPose utilizes three prediction networks , , and to estimate a set of keypoints , a set of edges between keypoints , and a set of symmetry correspondences between image pixels , and S are all expressed in 2D. , and are trainable parameters.

The keypoint network employs an off-the-shelf pre- diction network [34]. The other two prediction networks, , and , are introduced to stabilize pose regression when keypoint predictions are inaccurate. Specifically, pre- dicts edge vectors along a pre-defined graph of keypoints, which stabilizes pose regression when keypoints are cluttered in the input image. predicts symmetry correspon- dences that reflect the underlying (partial) reflection symmetry. A key advantage of this symmetry representation is that the number of symmetry correspondences is large: every image pixel on the object has a symmetry correspondence. As a result, even with a large outlier ratio, symmetry correspondences still provide sufficient constraints for estimating the plane of reflection symmetry for regularizing the underlying pose. Moreover, symmetry correspondences incorporate more features within the interior of the underlying object than keypoints and edge vectors. Pose regression module (Section 3.3). The second module of HybridPose optimizes the object pose to fit the output of the three prediction networks. This module combines a trainable initialization sub-module and a train-

able refinement sub-module. In particular, the initialization sub-module performs SVD to solve for an initial pose in the global affine pose space. The refinement sub-module utilizes robust norms to filter out outliers in the predicted elements for accurate object pose estimation.

Training HybridPose (Section 3.4). We train HybridPose by splitting the dataset into a training set and a validation set. We use the training set to learn the prediction module, and the validation set to learn the hyper-parameters of the pose regression module. We have tried training HybridPose end-to-end using one training set. However, the difference between the prediction distributions on the training set and testing set leads to sub-optimal generalization performance.

3.2. Hybrid Representation

This section describes three intermediate representations used in HybridPose. Keypoints. The first intermediate representation consists of keypoints, which have been widely used for pose estimation. Given the input image I, we train a neural network to predict 2D coordinates of a pre-defined set of |K| keypoints. In our experiments, HybridPose incorporates an off-the-shelf architecture called PVNet [34], which is the state-of-the-art keypoint-based pose estimator that employs a voting scheme to predict both visible and invisible keypoints.

Besides outliers in predicted keypoints, another limitation of keypoint-based techniques is that when the difference (direction and distance) between adjacent keypoints characterizes important information of the object pose, inexact keypoint predictions incur large pose error. Edges. The second intermediate representation, which consists of edge vectors along a pre-defined graph, explicitly models the displacement between every pair of keypoints. As illustrated in Figure 2, HybridPose utilizes a simple network to predict edge vectors in the 2D im- age plane, where |E| denotes the number of edges in the pre-defined graph. In our experiments, E is a fully-connected graph, i.e., . Symmetry correspondences. The third intermediate representation consists of predicted pixel-wise symmetry correspondences that reflect the underlying reflection symmetry. In our experiments, HybridPose extends the network architecture of FlowNet 2.0 [15] that combines a dense pixel-wise flow and the semantic mask predicted by PVNet. The resulting symmetry correspondences are given by predicted pixel-wise flow within the mask region. Compared to the first two representations, the number of symmetry correspondences is significantly larger, which provides rich constraints even for occluded objects. However, symmetry correspondences only constrain two degrees of freedom in the rotation component of the object pose (c.f. [24]). It is necessary to combine symmetry correspondences with other intermediate representations.

A 3D model may possess multiple reflection symmetry planes. For these models, we train HybridPose to predict symmetry correspondences with respect to the most salient reflection symmetry plane, i.e., one with the largest number of symmetry correspondences on the original 3D model.

Summary of network design. In our experiments, , , and are all based on ResNet [11], and the im- plementation details are discussed in Section 4.1. Trainable parameters are shared across all except the last convolutional layer. Therefore, the overhead of introducing the edge prediction network and the symmetry predic- tion network is insignificant.

3.3. Pose Regression

The second module of HybridPose takes predicted intermediate representations {K, E, S} as input and outputs a 6D object pose for the input image I. Similar to state-of-the-art pose regression approaches [35], HybridPose combines an initialization sub-module and a re-finement sub-module. Both sub-modules leverage all predicted elements. The refinement sub-module additionally leverages a robust function to model outliers in the predicted elements.

In the following, we denote 3D keypoint coordinates in the canonical coordinate system as . To make notations uncluttered, we denote output of the first module, i.e., predicted keypoints, edge vectors, and symmetry correspondences as , and , respectively. Our formulation also uses the homogeneous coordinates and of and respectively. The homogeneous coordinates are normalized by the camera intrinsic matrix. Initialization sub-module. This sub-module leverages constraints between and predicted elements and solves in the affine space, which are then projected to SE(3) in an alternating optimization manner. To this end, we introduce the following difference vectors for each type of predicted elements:

where and are end vertices of edge , and is the normal of the reflection symmetry plane in the canonical system.

HybridPose modifies the framework of EPnP [18] to generate the initial poses. By combining these three constraints from predicted elements, we generate a linear system of the form Ax = 0, where A is matrix and its dimension is is a vector that contains rotation and translation parameters in affine space. To model the relative importance among key- points, edge vectors, and symmetry correspondences, we rescale (2) and (3) by hyper-parameters and , respectively, to generate A.

Following EPnP [18], we compute x as

where is the smallest right singular vector of A. Ideally, when predicted elements are noise-free, N = 1 with is an optimal solution. However, this strategy performs poorly given noisy predictions. Same as EPnP [18], we choose N = 4. To compute the optimal x, we optimize latent variables and the rotation matrix R in an alternating optimization procedure with following objective function:

where is reshaped from the first 9 elements of . After obtaining optimal , we project the resulting affine transformation into a rigid transfor- mation. Due to space constraint, we defer details to the supp. material.

Refinement sub-module. Although (5) combines hybrid intermediate representations and admits good initialization, it does not directly model outliers in predicted elements. Another limitation comes from (1) and (2), which do not minimize the projection errors (i.e., with respect to keypoints and edges), which are known to be effective in keypoint-based pose estimation (c.f. [35]).

Benefited from having an initial object pose , the refinement sub-module performs local optimization to refine the object pose. We introduce two difference vectors that involve projection errors:

where is the projection operator induced from the current pose (R, t).

To prune outliers in the predicted elements, we consider a generalized German-Mcclure (or GM) robust function

With this setup, HybridPose solves the following non-linear

optimization problem for pose refinement:

where , and are separate hyper-parameters for keypoints, edges, and symmetry correspondences. and denote the covariance information attached to the keypoint and edge predictions. . When covariances of predictions are unavailable, we simply set . The above optimization problem is solved by Gauss-Newton method starting from and .

In the supp. material, we provide a stability analysis of (9), and show how the optimal solution of (9) changes with respect to noise in predicted representations. We also show collaborative strength among all three intermediate representations. While keypoints significantly contribute to the accuracy of t, edge vectors and symmetry correspondences can stablize the regression of R.

3.4. HybridPose Training

This section describes how to train the prediction networks and hyper-parameters of HybridPose using a labeled dataset . With , , , and , we denote the RGB image, labeled keypoints, edges, symmetry correspondences, and ground-truth object pose, respectively. A popular strategy is to train the entire model end-to-end, e.g., using recurrent networks to model the optimization procedure and introducing loss terms on the object pose output as well as the intermediate representations. However, we found this strategy sub-optimal. The distribution of predicted elements on the training set differs from that on the testing set. Even by carefully tuning the trade-off between supervisions on predicted elements and the final object pose, the pose regression model, which fits the training data, generalizes poorly on the testing data.

Our approach randomly divides the labeled set T = into a training set and a validation set. is used to train the prediction networks, and trains the hyper-parameters of the pose regression model. Implementation and training details of the prediction networks are presented in Section 4.1. In the following, we focus on training the hyper-parameters using . Initialization sub-module. Let and be the output of the initialization sub-module. We obtain the optimal

hyper-parameters and by solving the following optimization problem:

Since the number of hyper-parameters is rather small, and the pose initialization step does not admit an explicit expression, we use the finite-difference method to compute numerical gradient, i.e., by fitting the gradient to samples of the hyper-parameters around the current solution. We then apply back-track line search for optimization. Refinement sub-module. Let be the hyper-parameters of this sub-module. For each instance , denote the objec- tive function in (9) as , where is a local parameterization of and , i.e., encodes the different the cur- rent estimated pose and the ground-truth pose in SE(3).

The refinement module solves an unconstrained optimization problem, whose optimal solution is dictated by its critical points and the loss surface around the critical points. We consider two simple objectives. The first objective forces , or in other words, the ground- truth is approximately a critical point. The second objective minimizes the condition number . This objective regularizes the loss surface around each optimal solution, promoting a large converge radius for . With this setup, we formulate the following objective function to optimize :

where is a constant hyperparemeter. The same strategy used in (10) is then applied to optimize (11).

4. Experimental Evaluation

This section presents an experimental evaluation of the proposed approach. Section 4.1 describes the experimental setup. Section 4.2 quantitatively and qualitatively compares HybridPose with other 6D pose estimation methods. Section 4.3 presents an ablation study to investigate the effectiveness of symmetry correspondences, edge vectors, and the refinement sub-module.

4.1. Experimental Setup

Datasets. We consider two popular benchmark datasets that are widely used in the 6D pose estimation problem, Linemod [12] and Occlusion Linemod [3]. In comparsion to Linemod, Occlusion Linemod contains more examples where the objects are under occlusion. Our keypoint annotation strategy follows that of [34], i.e., we choose |K| = 8 keypoints via the farthest point sampling algorithm. Edge vectors are defined as vectors connecting each pair of keypoints. In total, each object has edges. We further use the algorithm proposed in [8] to annotate Linemod and Occlusion Linemod with reflection symmetry labels.

Following the convention described in [4], we select 15% of Linemod examples as the training data, and the rest 85% as well as all of Occlusion Linemod examples for testing. To avoid overfitting, we use the same synthetic data generation scheme introduced in PVNet [34]. 2

Implementation details. We use ResNet [11] with pretrained weights on ImageNet [7] to build the prediction networks , , and . The prediction networks take an RGB image I of size (3, H, W) as input, and output a tensor of size (C, H, W), where (H, W) is the image resolution, and C = 1+2|K|+2|E|+2 is the number of channels in the output tensor.

The first channel in the output tensor is a binary segmentation mask M. If M(x, y) = 1, then (x, y) corresponds to a pixel on the object of interest in the input image I. The segmentation mask is trained using the cross-entropy loss.

The 2|K| channels afterwards in the output tensor give x and y components of all |K| keypoints. A voting-based keypoint localization scheme [34] is applied to extract the coordinates of 2D keypoints from this 2|K|-channel tensor and the segmentation mask M.

The next 2|E| channels in the output tensor give the x and y components of all |E| edges, which we denote as Edge. Let ) be the index of an edge. Then

Edge

is a set of 2-tuples containing pixel-wise predictions of the edge vector in Edge. The mean of is extracted as the predicted edge.

The final 2 channels in the output tensor define the x and y components of symmetry correspondences. We denote this 2-channel “map” of symmetry correspondences as Sym. Let (x, y) be a pixel on the object of interest in the input image, i.e. M(x, y) = 1. Assuming and , we consider (x, y) and to be symmetric with respect to the reflection symmetry plane.

We train all three intermediate representations using the smooth loss described in [9]. Network training employs the Adam [17] optimizer for 200 epochs. The learning rate is set to 0.001. Training weights of the segmentation mask, keypoints, edge vectors, and symmetry correspondences are 1.0, 10.0, 0.1, and 0.1, respectively.

The architecture described above achieves good performance in terms of detection accuracy. Nevertheless, it

Figure 3. Pose regression results. HybridPose is able to accurately predict 6D poses from RGB images. HybridPose handles situations where the object has no occlusion (c), light occlusion (b, e, f, h), and severe occlusion (a, d, g). For illustrative purposes, we only draw 50 randomly selected symmetry correspondences in each example.

should be emphasized that the framework of HybridPose can incorporate future improvements in keypoint, edge vector, and symmetry correspondence detection techniques. Besides, Hybridpose can be extended to handling multiple objects within an image. One approach is to predict instance-level rather than semantic-level segmentation masks by methods such as Mask R-CNN [10]. Intermediate representations are then extracted from each instance, and fed to the pose regression module in 3.3.

Evaluation protocols. We use two metrics to evaluate the performance of HybridPose:

1. ADD(-S) [12, 42] first calculates the distance between two point sets transformed by predicted pose and ground-truth pose respectively, and then extracts the mean distance. When the object possesses symmetric pose ambiguity, the mean distance is computed from the closest points between two transformed sets. ADD(-S) accuracy is defined as the percentage of examples whose calculated mean distance is less than 10% of the model diameter.

2. In the ablation study, we compute and report the the angular rotation error and the relative translation error between the predicted pose and the ground-truth pose , where d is object diameter.

Table 1. Quantitative evaluation: ADD(-S) accuracy on Linemod. Baseline approaches: Tekin et al. [38], BB8 [36], Pix2Pose [30], PVNet [34], CDPN [20], and DPOD [44]. Objects annotated with () possess symmetric pose ambiguity.

4.2. Analysis of Results

As shown in Table 1, Table 2, and Figure 3, HybridPose leads to accurate pose estimation. On Linemod and Occlusion Linemod, HybridPose has an average ADD(-S) accuracy of 91.3 and 47.5, respectively. The result on Linemod outperforms all except one state-of-the-art approaches that regress poses from intermediate representations. The result on Occlusion-Linemod outperforms all state-of-the-art approaches.

Table 2. Quantitative evaluation: ADD(-S) accuracy on Occlusion Linemod. Baseline approaches: PoseCNN [42], Oberweger et al. [29], Hu et al. [14], PVNet [34], and DPOD [44]. Objects annotated with () possess symmetric pose ambiguity.

Baseline comparison on Linemod. HybridPose outperforms PVNet [34], the backbone model we use to predict keypoints. The improvement is consistent across all object classes, which demonstrates clear advantage of using a hybrid as opposed to unitary intermediate representation. HybridPose shows competitive results against DPOD [44], winning on six object classes. The advantage of DPOD comes from data augmentation and explicit modeling of dense correspondences between input and projected images, both of which cater to situations without object occlusion. A detailed analysis reveals that the classes of objects on which HybridPose exhibits sub-optimal performance are among the smallest objects in Linemod. It suggests that pixel-based descriptors used in our pipeline are limited by image resolution.

Baseline comparison on Occlusion Linemod. HybridPose outperforms all baselines. In terms of ADD(-S), our approach improves PVNet [34] from 40.8 to 47.5, representing a 16.4% enhancement, which clearly shows the advantage of HybridPose on occluded objects, where predictions of invisible keypoints can be noisy, and visible keypoints may not provide sufficient constraints for pose regression alone. HybridPose also outperforms DPOD, the state-of-the-art model on this dataset.

Running time. On a desktop with 16-core Intel(R) Xeon(R) E5-2637 CPU and GeForce GTX 1080 GPU, HybridPose takes 0.6 second to predict the intermediate representations, 0.4 second to regress the pose. Assuming a batch size of 30, this gives an an average processing speed of around 30 fps, enabling real-time analysis.

4.3. Ablation Study

Table 3 summarizes the performance of HybridPose using different predicted intermediate representations on the Linemod dataset.

With keypoints. As a baseline approach, we estimate object poses by only utilizing keypoint information. This gives a mean absolute rotation error of 1.357°, and a mean relative translation error of 0.061.

With keypoints and symmetry. Adding symmetry correspondences to keypoints leads to some performance gain in

Table 3. Qualitative evaluation with different intermediate representations. We report errors using two metrics: the median of absolute angular error in rotation, and the median of relative error in translation with respect to object diameter.

rotation. On the other hand, the translation error remains almost the same. One explanation is that symmetry correspondences only constrain two degrees of freedom in a total of three rotation parameters, and provide no constraint on translation parameters (see (3)).

Full model. Adding edge vectors to keypoints and symmetry correspondences leads to salient performance gain in both rotation and translation estimations. One explanation is that edge vectors provide more constraints on both translation and rotation (see (2)). Edge vectors provide more constraints on translation than keypoints as they represent adjacent keypoints displacement and provide gradient information for regression. Unlike symmetry correspondences, edge vectors constrain 3 degrees of freedom on rotation parameters which further boosts the performance of rotation estimation.

5. Conclusions and Future Work

In this paper, we introduce HybridPose, a 6D pose estimation approach that utilizes keypoints, edge vectors, and symmetry correspondences. Experiments show that HybridPose enjoys real-time prediction and outperforms current state-of-the-art pose estimation approaches in accuracy. HybridPose is robust to occlusion. In the future, we plan to extend HybridPose to include more intermediate representations such as shape primitives, normals, and planar faces. Another possible direction is to enforce consistency across different representations in a similar way to [46] as a selfsupervision loss in network training.

6. Acknowledgement

We would like to acknowledge the support of this research from NSF DMS-1700234, a Gift from Snap Research, and a hardware donation from NVIDIA.

References

[1] Ibragim R. Atadjanov and Seungkyu Lee. Reflection symme- try detection via appearance of structure descriptor. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, pages 3–18, 2016. 2

[2] Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. Deepedge: A multi-scale bifurcated deep network for topdown contour detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4380–4389, 2015. 2

[3] Eric Brachmann, Alexander Krull, Frank Michel, Stefan Gumhold, Jamie Shotton, and Carsten Rother. Learning 6d object pose estimation using 3d object coordinates. In European conference on computer vision, pages 536–551. Springer, 2014. 1, 2, 6

[4] Eric Brachmann, Frank Michel, Alexander Krull, Michael Ying Yang, Stefan Gumhold, et al. Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3364–3372, 2016. 2, 6

[5] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affin-ity fields. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1302–1310, 2017. 2

[6] Enric Corona, Kaustav Kundu, and Sanja Fidler. Pose es- timation for objects with rotational symmetry. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7215–7222. IEEE, 2018. 3

[7] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, June 2009. 6

[8] Aleksandrs Ecins, Cornelia Ferm¨uller, and Yiannis Aloi- monos. Seeing behind the scene: Using symmetry to reason about objects in cluttered environments. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7193–7200. IEEE, 2018. 6

[9] Ross Girshick. Fast r-cnn. 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015. 6

[10] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- shick. Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017. 7

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. 4, 6

[12] Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Ste- fan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Proceedings of the 11th Asian Conference on Computer Vision - Volume Part I, ACCV’12, pages 548–562, Berlin, Heidelberg, 2013. Springer-Verlag. 2, 6, 7

[13] Wei Hong, Allen Yang Yang, Kun Huang, and Yi Ma. On symmetry and multiple-view geometry: Structure, pose, and

calibration from a single image. International Journal of Computer Vision, 60(3):241–265, 2004. 2

[14] Yinlin Hu, Joachim Hugonot, Pascal Fua, and Mathieu Salz- mann. Segmentation-driven 6d object pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3385–3394, 2019. 2, 8

[15] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1647–1655, 2017. 4

[16] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015. 1

[17] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. 6

[18] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision, 81(2):155, 2009. 2, 4, 5

[19] Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 683–698, 2018. 3

[20] Zhigang Li, Gu Wang, and Xiangyang Ji. Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 7678–7687, 2019. 2, 7

[21] Guilin Liu, Chao Yang, Zimo Li, Duygu Ceylan, and Qixing Huang. Symmetry-aware depth estimation using deep neural networks. arXiv preprint arXiv:1604.06079, 2016. 2

[22] Yanxi Liu, Hagit Hel-Or, Craig S. Kaplan, and Luc Van Gool. Computational symmetry in computer vision and computer graphics. Foundations and Trends in Computer Graphics and Vision, 5(1-2):1–195, 2010. 2

[23] Yu Liu and Michael S Lew. Learning relaxed deep supervi- sion for better edge detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 231–240, 2016. 2

[24] Yi Ma, Stefano Soatto, Jana Kosecka, and S. Shankar Sas- try. An Invitation to 3-D Vision: From Images to Geometric Models. SpringerVerlag, 2003. 4

[25] Fabian Manhardt, Diego Martin Arroyo, Christian Rup- precht, Benjamin Busam, Tolga Birdal, Nassir Navab, and Federico Tombari. Explaining the ambiguity of object detection and 6d pose from visual data. In Proceedings of the IEEE International Conference on Computer Vision, pages 6841–6850, 2019. 2, 3

[26] Fabian Manhardt, Wadim Kehl, Nassir Navab, and Federico Tombari. Deep model-based 6d pose refinement in rgb. In Proceedings of the European Conference on Computer Vision (ECCV), pages 800–815, 2018. 3

[27] Niloy J. Mitra, Mark Pauly, Michael Wand, and Duygu Cey- lan. Symmetry in 3d geometry: Extraction and applications. Comput. Graph. Forum, 32(6):1–23, Sept. 2013. 2

[28] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- glass networks for human pose estimation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pages 483–499, 2016. 2

[29] Markus Oberweger, Mahdi Rad, and Vincent Lepetit. Mak- ing deep heatmaps robust to partial occlusions for 3d object pose estimation. Lecture Notes in Computer Science, page 125–141, 2018. 8

[30] Kiru Park, Timothy Patten, and Markus Vincze. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 7668–7677, 2019. 2, 3, 7

[31] Georgios Passalis, Panagiotis Perakis, Theoharis Theoharis, and Ioannis A Kakadiaris. Using facial symmetry to handle pose variations in real-world 3d face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(10):1938–1951, 2011. 2

[32] Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstanti- nos G Derpanis, and Kostas Daniilidis. 6-dof object pose from semantic keypoints. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2011– 2018. IEEE, 2017. 2, 3

[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[34] Sida Peng, Yuan Liu, Qixing Huang, Hujun Bao, and Xi- aowei Zhou. Pvnet: Pixel-wise voting network for 6dof pose estimation. CoRR, abs/1812.11788, 2018. 1, 2, 3, 4, 6, 7, 8

[35] Mikael Persson and Klas Nordberg. Lambda twist: An accu- rate fast robust perspective three point (p3p) solver. In The European Conference on Computer Vision (ECCV), September 2018. 4, 5

[36] Mahdi Rad and Vincent Lepetit. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, pages 3828–3836, 2017. 2, 3, 7

[37] Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, Manuel Brucker, and Rudolph Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 699–715, 2018. 3

[38] Bugra Tekin, Sudipta N. Sinha, and Pascal Fua. Real-time seamless single shot 6d object pose prediction. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 292–301, 2018. 1, 7

[39] Shubham Tulsiani and Jitendra Malik. Viewpoints and key- points. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 1510–1519, 2015. 1

[40] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Mart´ın-Mart´ın, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d

object pose estimation by iterative dense fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3343–3352, 2019. 2, 3

[41] Zhaozhong Wang, Zesheng Tang, and Xiao Zhang. Re-flection symmetry detection using locally affine invariant edge correspondence. IEEE Trans. Image Processing, 24(4):1297–1301, 2015. 2

[42] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In Robotics: Science and Systems XIV, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, June 26-30, 2018, 2018. 1, 2, 3, 7, 8

[43] Tianfan Xue, Jianzhuang Liu, and Xiaoou Tang. Symmetric piecewise planar object reconstruction from a single image. In CVPR 2011, pages 2577–2584. IEEE, 2011. 2

[44] Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. Dpod: 6d pose object detector and refiner, 2019. 2, 3, 7, 8

[45] Ziheng Zhang, Zhengxin Li, Ning Bi, Jia Zheng, Jinlei Wang, Kun Huang, Weixin Luo, Yanyu Xu, and Shenghua Gao. Ppgnet: Learning point-pair graph for line segment detection. arXiv preprint arXiv:1905.03415, 2019. 2

[46] Zaiwei Zhang, Zhenxiao Liang, Lemeng Wu, Xiaowei Zhou, and Qixing Huang. Path-invariant map networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11084–11094, 2019. 8

[47] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1851–1858, 2017. 2

[48] Xingyi Zhou, Arjun Karpur, Linjie Luo, and Qixing Huang. Starmap for category-agnostic keypoint and viewpoint estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 318–334, 2018. 2

Supplemental Material: HybridPose: 6D Object Pose Estimation under Hybrid Representations

Chen Song*, Jiaru Song, Qixing Huang The University of Texas at Austin

This is the supplemental material to “HybridPose: 6D Object Pose Estimation under Hybrid Representations”. We provide detailed explanations to our the algorithm used in the initialization sub-module. We also conduct a stability analysis of the refinement sub-module, and show how the optimal solution to the the objective function changes with respect to noise in predicted representations.

1. Initial Solution for Pose Regression

Recall that we denote 3D keypoint coordinates in the canonical coordinate system as . To make notations uncluttered, we denote output of the first module, i.e., predicted keypoints, edge vectors, and symmetry correspondences as , and , respectively. Our formulation also uses the homogeneous coordinates and of , and respectively. The homogeneous coordinates are normalized by camera intrinsic matrix.

1.1. Three constraints for object pose.

We seek to generalize the EPnP algorithm which only exploits keypoint 2D-3D correspondences for pose estimation by leveraging hybrid representations, keypoint, edge vector and symmetry correspondence. To this end, we introduce the following difference vectors for each type of predicted elements:

where and are end vertices of edge , and is the normal of the reflection symmetry plane in the canonical system.

Proposition 1 If there is a perfect alignment between the predicted elements and the corresponding 3D keypoint tem-

plate with respect to the ground-truth pose . Then

Proof:

1. The proof of the first equality is straight-forward as there exists a “depth” so that

2. The proof of the second equality follows the first equality. So we have

Replacing the second by in the above equation, we have

3. To prove the third equality, define the depths of and as and and the corresponding 3D model points in the canonical system as and . is a point on the reflectional symmetry plane, whose normal is . Given a symmetry correspondence pair (), we have

Let . Following the camera perspective model, we have

Left multiply both sides of the equation by yields

(5) Geometrically, (5) reveals that is perpendicular to the plane with span of , thus we have

Since is a non-zero scalar, we can delete this term and finally get

1.2. Pose solution in eigenvector space.

A nice feature shared by (1), (2) and (3) is that all constraints are linear in the elements of R and t. This allows us to derive a closed-form solution of R and t in the affine transformation space. Specifically, we can define as a vector that contains rotation and translation parameters in affine space. Expanding constraint (1) and constraint (2) yields three linear equations for each predicted element respectively for x, and expanding constraint (3) yields one linear equation. By concatenating all linear equations of predicted elements together, we can generate a linear system of the form Ax = 0, where A is matrix and its dimension is .

To model the relative importance among keypoints, edge vectors, and symmetry correspondences, we rescale (2) and (3) by hyper-parameters and , respectively, to generate A. As discussed in the body of this paper, we calculate and by solving an optimization problem using finite-difference and back-track line search.

Then following EPnP [1], we compute x as

where is the smallest right singular vector of A. Ideally, when predicted elements are noise-free, N = 1 with is an optimal solution. However, this strategy performs poorly given noisy predictions. Same as EPnP [1], we choose N = 4.

1.3. Optimize a good linear combination.

To compute the optimal x, we optimize latent variables and the rotation matrix R with following objective function:

where is reshaped from the first 9 elements of . We solve this optimization problem with the following alternating procedure:

1. Fix and solve for R by SVD. i.e. R = given ,

2. Fix R and solve for ’s by optimizing a linear system in an element-wise manner.

To initialize ’s for the above optimization problem, we calculate with i = 1...3 by enforcing that is an orthogonal matrix2:

Since is a symmetric matrix, expanding (8) yields 6 nonlinear constraints for , which is however uneasy to solve. We then define a new vector y = and form a linear system Cy = z which has the unique solution with z generated from . Afterwards, it is easy to recover from y and optimize from initialized alone with .

After optimization, we again apply SVD to project onto the space of SO(3), i.e., and enforce where . Leveraging Ax = 0 defined in section (1.2), the corresponding translation is

where is reshaped from .

2. Stability Analysis for Pose Reﬁnement

In this section, we provide a local stability analysis of the pose regression procedure, which amounts to solving

the following optimization problem:

When predictions are accurate, then the optimal solution of the objective function described above should recover the underlying ground-truth. However, when the predictions possess noise, then the optimal object pose can drift from the underlying ground-truth. Our focus is local analysis, which seeks to understand the interplay between different objective terms defined by keypoints, edge vectors, and symmetry correspondences. Therefore, we assume the noise level of the input is small, and the perturbation of the output is well captured by low-order Taylor expansion of the output.

Our goal is to characterize the relation between the variance of the input noise and the variance of the output pose. We show that incorporating edge vectors and symmetry correspondences generally help to reduce the variance of the output.

The remainder of this section is organized as follows. In Section 2.1, we provide a local stability analysis framework for regression problems. In Section 2.2, we describe the structure of the pose regression and apply this framework to provide a preliminary analysis of the stability of pose regression. In Section 2.3, we provide further analysis on a specific example, which indicates the interactions among keypoints, edge vectors, and symmetry correspondences. Finally, Section 2.4 provide proofs of the propositions in this analysis.

2.1. Local Stability Analysis Framework

We begin with a general result regarding an optimization problem of the following form

In the context of this paper, y encodes the noise associated with the predictions, i.e., keypoints, edge vectors, and symmetry correspondences. provides a local parameterization of the output, i.e., the object pose. The specific expressions of y and x will be described in Section 2.2.

Without losing generality, we further assume that f sat-isfies the following assumptions (which are valid in the context of this paper):

• . Moreover, f(x, y) = 0 if and only if x = 0 and y = 0. This means , and (0, 0) is the strict global optimal solution.

Our analysis will utilize the following partial derivative of with respect to y.

Proposition 2 Under the assumptions described above, is unique in the local neighborhood of 0, and

Proof. See Section 2.4.2. Since we are interested in local stability analysis, we assume the magnitude of y is small. Thus,

If we further assume y follows some random distribution whose variance matrix if Var(y). Then the variance of the output is given by

Note that in our problem, f consists of non-linear least squares, i.e.,

The following proposition characterizes how to compute

Proposition 3 Under the expression described in (15), the second-order derivatives and at (0, 0) are given by

Proof. See Section 2.4.3.

2.2. Structure of Pose Stability

We begin by rephrasing the pose-regression problem described in the main paper.

Ground-truth setup. We use the same definition of variables as that in full paper. Recall that is coordinates of keypoint in canonical system. Let and be the ground-truth pose. Then the ground-truth 3D location of in the camera coordinate system is

Let . Then the ground-truth image coordinates of the projected keypoint is given by

Likewise, recall are symmetry correspondence in the world coordinate system, and let

denote the transformed points in the camera coordinate system, where . So the image coordinates of each symmetry correspondence are given by

Noise model. we proceed to describe the noise model used in the stability analysis. In this analysis, we assume each input keypoint is perturbed from the ground-truth location by , i.e.,

Likewise, we assume each input edge vector is perturbed from the ground-truth edge vector by , i.e.,

Finally, for symmetry correspondences, we assume that is not perturbed), and is perturbed by , i.e.,

Local parameterization. We parameterize the 6D object pose locally using exponential map with coefficients , i.e.,

Note this parameterization is quite standard for rigid transformations. Now consider the three terms used in pose regression3:

The following proposition characterizes the derivatives between each term and the parameters of the noise model and the parameters of the local parameterization.

Proposition 4 Define

The derivatives of are given by

The derivatives of are given by

Moreover, the derivatives of are given by

where is homogeneous coordinate of normalized by camera intrinsic matrix.

Proof. See Section 2.4.1.

Let y collect all the random variables in a vector. Let , and collect the Jacobi matrices for the predicted elements under each type in its column. Note that the size of is according to the derivations above. To facilitate the definition below, we reshape as a matrix by placing original elements to the upper-left corner, and zeros to elsewhere. Denote and as the weight in front of each term (without loss of generality, we set ). Then the variance matrix Var(c, c) can be approximated by

where

It we consider Varas a function of and and compute its derivatives at and , we obtain

where and are the corresponding components in Var(y). This means whenever

increasing the value of from zero is guaranteed to obtain a positive reduction in the variance matrix (in terms of both the trace-norm and the spectral-norm).

(21) is satisfied when and . In general, when and are uncorrelated, then it is likely that increasing its value can lead to reduction in the output variance matrix.

A very similar argument can be applied to , and we omit the details for brevity.

2.3. An Example

We proceed to provide an example that explicitly shows how the variance of Var([c, c]) is reduced by incorporating edge vectors and symmetry correspondences. To this end, we consider a simple object that is given by a square, whose normal direction is along the z-axis in the camera coordinate system. We assume this square object has eight keypoints, whose z coordinates are all 1, i.e., . Their x and y image coordinates are:

Moreover, assume that the normal to the reflection plane is (1, 0, 0). The ground-truth symmetry correspondences are dense, and they are in the form of (x, y) and , where .

With this setup and after simple calculations, we have

where , and . Likewise, we have

Finally, we have

We proceed to assume the following noise model for the input:

In other words, noises in different predictions are independent. Applying Prop. 2.4.3, we have that

are functions of and . For simplify, we only analyze , which is

It is easy to check that to minimize , the optimal value for is given by

In other words, incorporating edge vectors is helpful for reducing the velocity of the third dimension of the rotational component.

Similar analysis can be done for other . As the rationale is similar, we omit them for brevity. Contributions of keypoints, edge vectors, and symmetry correspondences. It is very interesting to study the structure of (22). First of all, all elements are relevant to keypoints. Edge vectors provide full constraints on the underlying rotation. Symmetry correspondences also provide constraints on two dimensions of the underlying rotation. However, by analyzing the structure of and , one can see that they do not provide constraints on two dimensions of the underlying translation (albeit on this simple model). This explains why only using edge vectors and symmetry correspondences leads to poor results on object translations.

2.4. Proof of Propositions

2.4.1 Proof of Proposition 4

Derivatives of and . It is straightforward to compute the derivatives of and with respect to and , respectively. In the following, we focus on the derivatives of with respect to (c, c). The derivatives of can be obtained by subtracting those of and those of .

Recall the local parameterization and. We have

Using chain rule, we have

Derivatives of . Again using chain rule, we have

Moreover,

2.4.2 Proof of Proposition 2

Proof: First of all, any optimal solution is a critical point of f. Therefore, it shall satisfy:

Consider a neighborhood, where , and is chosen so that it contains for each y, the critical point with the smallest norm. Assume that is positive semidefinite in this neighborhood.

By contradiction, suppose there exists two distinctive local minimums and for a given y, i.e.,

Through integration, (24) yields

Since the weighted sum of positive definite matrices is also positive definite. It follows that

In other words, it cannot have a zero eigenvalue, with non-zero eigenvector . In other words, the critical point is unique. Since the second order derivatives are positive definite, then each critical point is also a local minimum.

Computing the derivatives of (23) with respect to y, we obtain

2.4.3 Proof of Proposition 3

The proof is straight-forward as , and

References

[1] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision, 81(2):155, 2009. 2