A New Rank Constraint on Multi-view Fundamental Matrices, and its Application to Camera Location Recovery

2017·Arxiv

Abstract

Abstract

Accurate estimation of camera matrices is an important step in structure from motion algorithms. In this paper we introduce a novel rank constraint on collections of fundamental matrices in multi-view settings. We show that in general, with the selection of proper scale factors, a matrix formed by stacking fundamental matrices between pairs of images has rank 6. Moreover, this matrix forms the symmetric part of a rank 3 matrix whose factors relate directly to the corresponding camera matrices. We use this new characterization to produce better estimations of fundamental matrices by optimizing an L1-cost function using Iterative Re-weighted Least Squares and Alternate Direction Method of Multiplier. We further show that this procedure can improve the recovery of camera locations, particularly in multi-view settings in which fewer images are available.

1. Introduction

Accurate reconstruction of 3D scenes from multiview stereo images is one of the primary goals of computer vision. Current techniques use point correspondences to estimate either the essential or fundamental matrices between pairs of images, and then use the estimated matrices to recover the camera matrices and structure. Notable success was achieved when sequential methods were introduced [1, 20]. These methods first recover camera matrices and structure from two images. Then, adding one image at a time, they apply bundle adjustment to estimate the camera matrix (and structure) of the new image. Recent work attempts to further improve recovery by considering simul-

Figure 1: Illustration of our rank constraint. Collections of fundamental matrices estimated for pairs of images (top) are arranged in a matrix (bottom). This matrix should be equal (up to noise) to a matrix F or properly scaled fundamental matrix, which in turn forms the symmetric part of a rank 3 matrix A.

taneously subsets of multiple images and recovering camera matrices that are consistent over the entire subsets. Indeed a number of papers have focused on the consistent recovery of either camera orientation or location [2, 19, 18, 23, 24, 16].

This paper introduces new constraints to enable the consistent recovery of fundamental and essential matrices. This is potentially advantageous since those matrices capture simultaneously the location and orientation of the cameras, along (in the case of fundamental matrices) with their internal calibration parameters. For configurations of cameras that are not all collinear, our main result establishes that, when scaled properly, the matrix formed by appending all pairwise fundamental matrices in a multiview setting is of rank 6. More tightly, this matrix forms the symmetric part of a rank 3 matrix whose factors relate directly to the entries of the corresponding camera matrices. We further show that multiview settings of collinear cameras yield a rank 4 matrix.

We use this characterization to develop an optimization formulation for estimating consistent sets of fundamental matrices. Our formulation can accept sets of estimated fundamental matrices in which some are noisy, some are outliers, and some cannot be estimated at all from image pairs (i.e., missing data). In solving this optimization we seek a set of scaled fundamental matrices that satisfy our constraints and fit the estimated fundamental matrices. Our formulation uses an L1 cost function, which is optimized with Iterative Re-weighted Least Squares (IRLS) [12], to remove outliers, and uses Alternate Direction Method of Multipliers (ADMM) [4] to incorporate rank constraints.

Our work is related to a variety of approaches to structure from motion (SfM) that utilize rank constraints. Tomasi and Kanade showed that under an orthographic projection, and after centering, projected points form a rank 3 matrix. Sturm and Triggs [21, 22] extended this to perspective projection by showing that projected points, when scaled properly, form a rank 4 matrix. Unlike their work, which uses rank constraint on tracks of points in images, our work only considers fundamental matrices and so in multiview settings it gives rise to systems with many fewer variables, relying on potentially less noisy estimates. Our approach, which seeks to recover a consistent set of fundamental matrices, is analogous to rotation or translation averaging and to loop closure [10, 6, 7]. In fact, obtaining consistent fundamental matrices can be regarded as simultaneous averaging of rotation, translation and camera calibration and as a way to close all loops. Our experiments indicate that such joint averaging performs better than a separate averaging of rotation and translation.

A number of algorithms have recently been proposed for solving unconstrained, low rank systems with outliers and missing data (e.g., [5, 13, 17]) with remarkable success. Extending such techniques to incorporate SfM constraints is an important next step.

When thousands of images are available, existing methods that use pairwise epipolar constraints or tri-focal tensors can exploit highly over-determined systems to handle noise and outliers quite accurately. However, when fewer images are available the importance of rank constraints grows, and their introduction can potentially yield more accurate estimation of camera parameters. Indeed, we provide experiments that show that using our characterization, essential matrices can be estimated more accurately than with current state-of-the-art methods, and these in turn can be translated to better estimates of camera locations.

2. Low-Rank Characterization of Fundamental Matrices in Multiview Settings

2.1. Background

We first introduce notations and give a short summary of the relevant concepts in multi-view geometry. An extensive discussion of this topic can be found in [11]. Let denote a collection of n images of a scene and let and denote the location and orientation of the i’th camera in a global coordinate system. Let the denote the intrinsic camera calibration matrix for is nonsingular and is typically specified in the form

where, and respectively are the focal lengths in the x and y direction, form the principal point and represents the skew coefficient. Let be a scene point in the global coordinate system. Its projection onto (expressed in homogeneous coordinates) is given by , where . We therefore associate with the camera matrix , where I is a identity matrix and note that scaling does not affect projection.

Next, we consider the relations between pairs of images, and . We can express the camera rotation and translation relating two images by and . Clearly, and . Two images are further related by epipolar line constraints, which are expressed by , where denotes the fundamental matrix relating to can be estimated up to scale from point correspondences. is related to the rotation and translation between and and to their respective calibration matrices by , where denotes the skew-symmetric matrix corresponding to cross-product with . In cases in which the cameras are calibrated we set and replace the fundamental matrix with the essential matrix . Therefore, .

To derive our rank constraint we will need to express the essential and fundamental matrices relative to a global coordinate system. [25] derived an expression in terms of the camera matrices and . Here we will use the more recent derivation of [2] that, as we shall see below, is amenable to factorization:

where .

2.2. Low-rank Construction

We next introduce our main result, which includes a low rank characterization of the collection of fundamental matrices in multiview settings. For our result we will construct a matrix of size , denoted F, in which each of the blocks includes a fundamental matrix (see Figure 1), where we assume that each of the pairwise fundamental matrices in F is scaled properly. We further define for all , and note that this is consistent with (3). Likewise we define the matrix E from the essential matrices . We refer to F (resp. E) as the multiview matrix of fundamentals (essentials).

Claim 1: F (and likewise E) is symmetric and 6. Moreover,

1. If F is produced by n cameras whose centers are not all collinear then rank(F) = 6 and there exists a matrix A with rank(A) = 3 such that .

2. If F is produced by n cameras whose centers are all collinear then and there exists a matrix A with such that .

Proof: To prove the claim we begin by defining the matrix A as follows. Let , and . , and are matrices. Observ- ing (3) and recalling that is skew-symmetric we see that .

Next we construct the matrices U and V as

and set . Clearly, by construction, 3. Moreover, , and so F is symmetric and . Case 1: We show next that unless the cameras are all collinear rank(A) = 3. Clearly rank(V ) = 3. Therefore we need to show that also rank(U) = 3. We prove this by contradiction. Assume rank(U) < 3. Then , , s.t. Ut = 0. This implies that for all . Thus, all the ’s are parallel to t, violating our assumption that not all camera locations are collinear. Consequently rank(U) = 3 and therefore also rank(A) = 3.

Next we show that when the cameras are not all collinear rank(F) = 6. We recall that where and are non-singular. We can therefore write F = where the matrix K is block diagonal with blocks formed by . This implies that rank(F) = rank(E), and so we are left to show that rank(E) = 6.

We assume WLOG that the camera locations are centered at the origin, i.e., (since E is invariant to global translation of the cameras). We further argue that each column of U is orthogonal to each column of V . This is evident from the following identities

Let denote the matrix A where we substitute (so that .) Denote by the SVD of and are and is ). Since we have that and . Now we can decompose E as :

Since the columns of U are orthogonal to those of V , the matrixis column orthogonal. Thus, (5) is the SVD of E. And since is rank 3, is full rank. Consequently, rank(F) = rank(E) = 6.

Case 2: Suppose all camera centers are collinear. WLOG assume that the origin of the global coordinate system is also collinear with the n cameras (since F is unaffected by global translation), and so we can write for where and . Let , then clearly . Define (so ) and let the matrix be formed by stacking on top of each other then

Since T is skew-symmetric its rank is at most 2 and so is rank(A). It follows that

2.3. Tightness of our constraints

Claim 1 provides two constraints on the matrix F.

• and rank(A) = 3.

• The diagonal block of F vanishes, i.e., .

We now investigate how tight these constraints are in producing fundamental matrices that are consistent with a set of camera parameters. We show that the number of degrees of freedom allowed by these constraints is equal to the number of degrees of freedom in the camera matrices. However, we find that there exist matrices that are allowed by these constraints, but do not produce valid fundamental matrices.

Counting arguments show that our constraints allow degrees of freedom (DOFs) in defining F. Specifi-cally, since A is rank 3 it can be written as where U and V are , so together they have 18n entries. The constraint , however, gives rise to a 15 DOF ambiguity that should be subtracted from the number of entries of U and V , as we explain in the next paragraph. The constraint that requires to be skew symmetric, yielding 6n more constraints on the entries of U and V , yielding together DOFs.

note that we can write F as , where J is a permutation matrix defined as J =

(so ). With this notation the ambiguity in factorizing F is obtained by introducing a matrix Q such that so that has 36 entries, but the constraints reduces its degrees of freedom to 15. Denote

and to be skew symmetric and the sum , providing altogether 21 constraints on the 36 entries of Q, leaving 15 DOFs.

Coincidentally, the number of DOFs in factoring F is equal to the DOFs in defining n cameras. In general, the number of DOFs in defining n perspective cameras is 15. However, each camera matrix can be scaled arbitrarily and each choice of scale will (inversely) scale the respective row and column of F. In other words, n camera matrices, , scaled arbitrarily by non zeros , produce a collection of equivalent multiview fundamental matrices defined by

The freedom in choosing the entries of S accounts for the n missing DOFs.

We note however that although the DOFs in factoring F with our constraints are equal to the DOFs in defining n camera matrices there exist matrices that satisfy our constraints but cannot be realized with n cameras. Specifically, these constraints do not guarantee that all the pairwise fundamental matrices are rank deficient. The constraint restricts to be skew-symmetric, implying that either or is rank deficient. If all ’s (or equivalently all ’s) are chosen to be rank deficient then so are all the . If however some of the ’s and some of the ’s are chosen to be full rank then they may produce blocks that are rank 3 and so they are not legal fundamental matrices. Note that the skew-symmetry of guarantees that no more than 1/4 of the ’s can be of full rank. Indeed, our experiments (in Section 4) often produce ’s that are near rank 2; in a typical run the average ratio of the third to second largest singular value , presumably because the problem is so over-constrained.

In conclusion, while our constraints provide a neces-

sary but not sufficient conditions for consistency, counting considerations indicate that our constraints are nearly tight. Below we develop and optimization scheme that utilizes these constraints to infer the missing scale factors for collections of estimated pairwise fundamental matrices, to recover missing fundamentals and to correct noisy ones.

3. Low-rank Constrained Optimization to Recover Fundamental Matrices

In this section we formulate an optimization problem that uses the constraints derived in section 2 to achieve a better recovery of pairwise fundamental matrices. Assume we are given a set of fundamental matrices , where and denotes the subset of image pairs for which fundamental matrices have been estimated. (We will further assume .) We use these matrices to construct our measurement matrix whose block contains if and is zero otherwise. Note that in the absence of errors each non-zero block is related by an unknown scale factor to the corresponding block in the sought multiview matrix of fundamentals F, where depends on the distance between the i’th and j’th cameras. Recovering these scale factors is essential in order to apply our constraints. Our task therefore can be expressed as:

where F is constrained to fulfill the constraints in Claim 1. Here we have chosen to minimize over the sum of Frobenius norms of each block. Such mixed L1-L2 norm minimization is expected to be robust to outlier estimates of ’s.

We note that the formulation (6) is bilinear in F and the scale factors. We could avoid this bilinearity by minimizing instead . Such minimization, however, is subject to a zero trivial solution and so it requires an additional constraint such as . Our experience with such formulation is that it is quite sensitive to errors.

Expressing (6) with the constraints results in the following problem:

where denotes each sub-block of A. Our solution for F then is .

(7) introduces a number of challenges, including the mixed L1-Frobenius norms, the bilinearity, and the rank constraint. This problem is non-convex due to the latter two challenges. Below we describe how we approach these challenges with IRLS and ADMM. Our algorithm is sum- marized in Algorithm 1.

3.1. Handling Outliers with IRLS

We begin by addressing the mixed L1-Frobenius norm in the cost function. We approach this with Iterative Reweighted Least Squares (IRLS) [12]. IRLS converts the problem to weighted least squares where the weights are updated from one iteration to the next. At each iteration t of the IRLS we replace the cost function in (7) with

where

is a regularization parameters (we use ).

To clarify presentation we simplify our notations as follows. Let W and be matrices. Denoting their sub-blocks by and , we set and , where 1 is a matrix with all 1’s. We further use the subscript WF to denote the weighted Frobenius norm, i.e., and use to denote element-wise product of matrices. Therefore, in each IRLS iteration we seek to solve

3.2. Optimization using ADMM

Next, we wish to solve the non-convex optimization problem in (9), including the bilinearity and the rank constraint. To this end we will use a scaled version of Alternate Direction Method of Multiplier (ADMM) [4, 9]. We maintain a second copy of A, which we denote as B, and form the augmented Lagrangian of (9) as:

The last term in this objective, denotes the Lagrangian penalty; is a constant, and is a matrix of Lagrange multipliers of the same size as A that is updated in the ADMM steps. We next describe the ADMM steps, which are applied iteratively.

Step 1: Solving for . In each iteration, k, we solve the following sub-problems:

Since (11) is non-convex we will solve it by alternative minimization of A and

Because of the form of (11) it is useful to separate A into its symmetric and anti-symmetric parts, and , so that with and . Let and respectively denote its symmetric and anti-symmetric part. We can write (11) in terms of and and separately solve for them as follows:

To solve (12) we take the derivative w.r.t. and equate to 0. Thus we update according to

2. Optimize w.r.t. : We minimize the following subproblem

Step 2: Solving for B. This part of the ADMM deals with the rank constraint. It requires a solution to

This is solved by

where SV P(X, r) denotes the Singular Value Projection (SVP) of X into space the of rank-r matrices. To perform SV P(X, r) we compute the SVD of X and keep its top r singular values and the corresponding singular vectors.

Step 3: Update of . The matrix contains Lagrange multipliers that are used in the saddle-point formulation (10) to enforce the equality constraint A = B. The following update is a gradient ascent step that acts to maximize the augmented Lagrangian (10) for For details, see [4, 9].

Empirically we observe monotonic convergence of the cost function defined in Equation 7 with each iteration of IRLS on a sample problem as shown in Figure 2. For every iteration of the IRLS we run ADMM till convergence to optimize Equation 10.

Figure 2: Convergence of our optimization algorithm

Figure 3: SfM pipelines for LUD (left) and our method (right).

4. Experiments

To demonstrate the utility of our method we tested it in the problem of estimating essential matrices and camera locations from multiple images. Current iterative and global approaches to Structure from Motion (SfM) are often tested on large datasets when many pairwise essential matrices can be estimated, achieving outstanding performance. We argue that imposing rank constraints can be useful particularly when the number of images is relatively small. To demonstrate this we run our method on subsets of images of different sizes showing improved performance relative to the existing methods particularly with smaller subsets.

In many common SfM pipelines the intrinsic calibration parameters are recovered separately. Therefore, while our method can be applied when the calibration parameters are unknown, here we assume that the cameras are calibrated and so we apply our optimization algorithm to essential ma-

trices. Note that our derivations in Sections 2 and 3 hold also for essential matrices by setting . We next describe the tested methods:

• LUD [18]: Figure 3 shows the pipeline used by LUD to estimate camera locations and orientations from pairs of images. Starting from pairwise essential matrices estimated with SIFT [15] and RANSAC [3], this method first solves for camera orientations, denoted by in Figure 3, by iteratively applying [2] while rejecting outliers. Using camera orientations it then returns to the image keypoints to estimate pairwise camera directions, denoted by . Using these pairwise directions it applies IRLS to solve for camera locations (), which we compare to our method. In addition, we use the estimated camera locations and orientations to reconstruct the pairwise essential matrices .

• ShapeKick [8]: For this method we use the same pipeline as used with LUD, except that we replace the translation recovery part of LUD with ShapeKick. ShapeKick formulates the location recovery problem as a convex optimization and solves it with ADMM. They achieved comparable performance to LUD on the dataset of [24].

• 1DSfM [24]: This method uses a pre-processing technique, based on projection in many random directions, to remove outliers in the original pairwise direction measurements. In our experiments we use their software, which uses the pipeline described in [24] and only provides camera locations.

• Our method: Figure 3 shows the pipeline used by our method. From the pair-wise essential matrices we minimize (7) using the IRLS-ADMM summarized in Algorithm 1. Since our method is not convex it requires a good initialization. We initialize it with essential matrices produced by the LUD method of Ozyesil et al. [18], denoted . Specifically is used to initialize and A in Algorithm 1. Our algorithm improves these essential matrix estimates, producing a collection of new pairwise estimates in E, denoted . To further produce camera locations we first use and the rotations obtained by the LUD pipeline, , to solve for the pairwise camera directions . Then we apply translation solver of LUD to the with to produce camera locations . As is shown below, our improved estimates of essential matrices lead in turn to improved estimates of camera locations compared to the LUD pipeline.

We tested these methods on real image collections from [24], which comes with ‘ground truth’ estimates of camera locations and essential matrices produced with a sequential method similar to [20]. (These ground truth estimates are used also in [24, 18, 8].) For our experiments we used 14 different scenes from the dataset. For each scene we randomly selected 5 different sub-samples of N images from the dataset. We used N = 50, 100, and 150 images, resulting in 70 different trials for each N. In each trial we compared the quality of the essential matrix recovered by our method to that recovered by LUD and ShapeKick. Likewise, we compared the quality of our recovered camera locations to those obtained by the three competing methods.

Figure 4: These graphs show a comparison of the recovery error of essential matrices achieved with our method compared to LUD (in blue) and ShapeKick (in yellow), for collections of 50, 100, and 150 images from [24], The graphs on the left show the amount of relative improvement and the ones on the right show the fraction of improved trials.

Figures 4-5 show our results. Each graph summarizes the results of 70 trials with each value of N. Figure 4 shows the quality of our essential matrix estimates compared to those obtained with LUD and ShapeKick, and Figure 5 shows the quality of our camera location estimates compared to those achieved by the three competing algorithms. We measure these as follows. In each experiment k we consider the collection of pairwise essential matrices produced by our method. We first normalize each matrix and measure its error to the respective (normalized) ground truth matrix. We then take the mean (or median) of this error over all essential matrices. Denote this error by . We then produce similar error measures for each competing algorithm, denoted . We then report:

• Relative Improvement (in %): Here we report for each N and competing algorithm the average of

Figure 5: A comparison of the recovery error of camera locations achieved with our method compared to LUD (in blue) and ShapeKick (in yellow), and 1DSfM (in red) for collections of 50, 100, and 150 images from [24], The graphs on the left show the amount of relative improvement and the ones on the right show the fraction of improved trials.

• Percent of Improved Trials: This provides the percentage of trials in which our algorithm achieved more accurate results than a competing algorithm, i.e., , where I(.) is the indicator function and K denotes the total number of trials.

We provide similar measures to assess the quality of our camera locations estimates. In Figure 6 we further show the median error of camera location estimates for all methods in all trials for N = 50.

It can be seen overall that our method leads to improved estimation of essential matrices and of camera locations. With 50 images, compared to, e.g., LUD, our algorithm improves the median essential matrix estimates by 17.69%. With 150 images a smaller overall improvement of 6.68% is achieved. This suggests that our constraints are more effective when smaller numbers of images are used. Interestingly, however, despite this reduction the fraction of trials in which our method achieved more accurate estimates compared to LUD in fact increased slightly from 87% with 50 images to 98% with 150 images, indicating that our method remains effective also with larger number of images (albeit yielding smaller improvement). Similar results are observed for camera location estimation. With 50 and 150 images our algorithms improves the median camera location error by 19.73% and 8.77% respectively, while the fraction of trials in which our method achieved more accurate estimates than LUD increased slightly from 84% with 50 images to 90% with 150 images.

In our previous experiments we applied our optimization algorithm to essential matrices, assuming calibration is given. Below we further apply our algorithm to fundamental matrices in an uncalibrated setting. Since not all the entries of a fundamental matrix are of same orders of magnitude, we normalize each of the input pairwise fundamental matrices by centering all the images and scaling them uniformly to within the [1, 1] square and then compute a normalized fundamental matrix. This does not affect our rank constraint and can be inverted at the end of the process. We tested our method on 5 subsamples of 50 images for 14 different scenes and compared it to LUD. To evaluate the quality of the recovered fundamental matrices we convert them to essential matrices by applying the known calibration matrices and further use these to recover camera locations. The results can be seen in Figure 6. Using our method to recover fundamentals (in blue) yielded comparable accuracies to our results for essential matrix recovery (yellow) and both our approaches improve significantly (10-20%) over LUD as shown in Figure 7.

We further performed bundle adjustment (using [14]) initialized by the camera parameters obtained with our method and LUD. After bundle adjustment compared to LUD our method improved camera location estimates on average by 11.52%, 3.13% and 5.43%, improving in 70.59%, 64.29% and 63.77% of all trials for 50, 100 and 150 images respectively in terms of median translation error. These results indicate that our method maintains improved accuracies over LUD also after bundle adjustment.

With 50 images the recovery of essential matrices with our method requires roughly 20 iterations of IRLS and 1000 iterations of ADMM. These take overall about 2 minute on a 2.7 GHz Intel Core i5 computer.

To conclude, these experiments indicate that our characterization of essential matrices in multiview settings can be used to improve essential matrix and cameral location estimates. The advantage of these constraints appear to be particularly pronounced when fewer images are available.

5. Conclusion

We have introduced in this paper novel rank constraints on fundamental matrices in multiview settings. We have shown in particular that with non-collinear cameras the matrix that depicts the pairwise fundamentals is of rank 6 and forms the symmetric part of a rank 3 matrix whose factors are related directly to the entries of the respective camera matrices. We have used these constraints to develop an optimization framework to efficiently recover fundamental matrices for all pairs of images and to estimate their proper

Scenes0 10 20 30 40 50 60 70 0

Figure 6: Median camera location error obtained by the four algorithms for 5 subsets of 50 images for 14 different scenes (‘Notre Dame’, ‘Montreal Notre Dame’, ‘Alamo’, ‘Piazza del Popolo’, ‘Piccadilly’, ‘NYC Library’, ‘Yorkminster’, ‘Union Square’, ‘Madrid Metropolis’, ‘Tower of London’, ‘Vienna Cathedral’, ‘Roman Forum’ and ‘Ellis Island’, ‘Gendarmenmarkt’). For clarity we terminate the median T error axis at 30.

Figure 7: Improvement of our method over LUD using fundamental matrix (in blue) and essential matrix (yellow) for 50 images.

scale factors. Our experiments indicate that our method is able to provide improved estimates of essential matrices and camera locations in global SfM settings. Moreover, these experiments suggest that our constraints are particularly useful when fewer images are available.

Our plans for the future include improving the runtime of our optimization method. We intend to explore different ways to initialize the algorithm, possibly through convex relaxations. We would further want to explore the use of our method in related applications, e.g., in camera autocalibration.

References

[1] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building rome in a day. In 2009 IEEE 12th international conference on computer vision, pages 72–79. IEEE, 2009. 1

[2] M. Arie-Nachimson, S. Z. Kovalsky, I. Kemelmacher- Shlizerman, A. Singer, and R. Basri. Global motion estimation from point matches. In 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, pages 81–88. IEEE, 2012. 1, 2, 7

[3] R. C. Bolles and M. A. Fischler. A ransac-based ap- proach to model fitting and its application to finding cylinders in range data. In IJCAI, volume 1981, pages 637–643, 1981. 7

[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eck- stein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends Rin Machine Learning, 3(1):1–122, 2011. 2, 5, 6

[5] E. J. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009. 2

[6] A. Chatterjee and V. Madhav Govindu. Efficient and robust large-scale rotation averaging. In Proceedings of the IEEE International Conference on Computer Vision, pages 521–528, 2013. 2

[7] Z. Cui and P. Tan. Global structure-from-motion by similarity averaging. In Proceedings of the IEEE International Conference on Computer Vision, pages 864–872, 2015. 2

[8] T. Goldstein, P. Hand, C. Lee, V. Voroninski, and S. Soatto. Shapefit and shapekick for robust, scalable structure from motion. In European Conference on Computer Vision, pages 289–304. Springer, 2016. 7

[9] T. Goldstein, B. O’Donoghue, S. Setzer, and R. Bara- niuk. Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences, 7(3):1588–1623, 2014. 5, 6

[10] R. Hartley, J. Trumpf, Y. Dai, and H. Li. Rotation averaging. International journal of computer vision, 103(3):267–305, 2013. 2

[11] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. 2

[12] P. W. Holland and R. E. Welsch. Robust regression us- ing iteratively reweighted least-squares. Communications in Statistics-theory and Methods, 6(9):813–827, 1977. 2, 5

[13] Y. Hu, D. Zhang, J. Ye, X. Li, and X. He. Fast and ac- curate matrix completion via truncated nuclear norm regularization. IEEE transactions on pattern analysis and machine intelligence, 35(9):2117–2130, 2013. 2

[14] M. I. Lourakis and A. A. Argyros. Sba: A software package for generic sparse bundle adjustment. ACM Transactions on Mathematical Software (TOMS), 36(1):2, 2009. 8

[15] D. G. Lowe. Distinctive image features from scale- invariant keypoints. International journal of computer vision, 60(2):91–110, 2004. 7

[16] D. Martinec and T. Pajdla. Robust rotation and trans- lation estimation in multiview reconstruction. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007. 1

[17] T. Okatani and K. Deguchi. On the wiberg algorithm for matrix factorization in the presence of missing components. International Journal of Computer Vision, 72(3):329–337, 2007. 2

[18] O. Ozyesil and A. Singer. Robust camera location es- timation by convex programming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2674–2683, 2015. 1, 7

[19] O. Ozyesil, A. Singer, and R. Basri. Stable camera motion estimation using convex programming. SIAM Journal on Imaging Sciences, 8(2):1220–1262, 2015. 1

[20] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d. In ACM transactions on graphics (TOG), volume 25, pages 835–846. ACM, 2006. 1, 7

[21] P. Sturm and B. Triggs. A factorization based algo- rithm for multi-image projective structure and motion. In European conference on computer vision, pages 709–720. Springer, 1996. 2

[22] B. Triggs. Factorization methods for projective struc- ture and motion. In Computer Vision and Pattern Recognition, 1996. Proceedings CVPR’96, 1996 IEEE Computer Society Conference on, pages 845–851. IEEE, 1996. 2

[23] R. Tron and R. Vidal. Distributed 3-d localization of camera sensor networks from 2-d image measurements. IEEE Transactions on Automatic Control, 59(12):3325–3340, 2014. 1

[24] K. Wilson and N. Snavely. Robust global translations with 1dsfm. In European Conference on Computer Vision, pages 61–75. Springer, 2014. 1, 7, 8

[25] Z. Zhang and G. Xu. A general expression of the fun- damental matrix for both perspective and affine cameras. In Proceedings of the Fifteenth international joint conference on Artifical intelligence-Volume 2, pages 1502–1507. Morgan Kaufmann Publishers Inc., 1997. 2

6. Supplementary Material

6.1. Results on camera location error

In Table 1 we compare Our, LUD, ShapeKick and 1DSfM on 14 different scenes for N = 50, 100 and 150 images. For each scene and each choice of N we report the average of the median camera location error for 5 different trials. In Table 3-16 , each table compares four competing algorithms on 5 different trials for 50, 100 and 150 images.

6.2. Results on essential matrix error

In Table 2 we compare Our, LUD, ShapeKick and 1DSfM on 14 different scenes for N = 50, 100 and 150 images. For each scene and each choice of N we report the average of the median essential error for 5 different trials. Essential matrix error between two cameras is computed as the norm of their difference after they are normalized to unit norm. We multiply the essential matrix error by 100 and report it for convenience. In Table 17-30 , each table compares four competing algorithms on 5 different trials for 50, 100 and 150 images.

Table 1: Average of median camera location error for 5 trials for each scene and each choice of N

Table 2: Average of median essential matrix error for 5 trials for each scene and each choice of N

Table 3: Median translation error on ‘Alamo’

Table 4: Median translation error on ‘Ellis Island’

Table 5: Median translation error on ‘Madrid Metropolis’

Table 6: Median translation error on ‘Montreal Notre Dame’

Table 7: Median translation error on ‘Notre Dame’

Table 8: Median translation error on ‘NYC Library’

Table 9: Piazza Del Popolo

Table 10: Median translation error on ‘Piccadilly’

Table 11: Median translation error on ‘Roman Forum’

Table 12: Median translation error on ‘Tower of London

Table 13: Median translation error on ‘Union Square’

Table 14: Median translation error on ‘Vienna Cathedral’

Table 15: Median translation error on ‘Yorkminster’

Table 16: Median translation error on ‘Gendarmenmarkt’

Table 17: Median essential matrix error on ‘Alamo’

Table 18: Median essential matrix error on ‘Ellis Island’

Table 19: Median essential matrix error on ‘Madrid Metropolis’

Table 20: Median essential matrix error on ‘Montreal Notre Dame’

Table 21: Median essential matrix error on ‘Notre Dame’

Table 22: Median essential matrix error on ‘NYC Library’

Table 23: Median essential matrix error on ‘Piazza Del Popolo’

Table 24: Median essential matrix error on ‘Piccadilly’

Table 25: Median essential matrix error on ‘Roman Forum’

Table 26: Median essential matrix error on ‘Tower of London’

Table 27: Median essential matrix error on ‘Union Square’

Table 28: Median essential matrix error on ‘Vienna Cathedral’

Table 29: Median essential matrix error on ‘Yorkminster’

Table 30: Median essential matrix error on ‘Gendermenmarkt’

designed for accessibility and to further open science