Learning Geometrically Consistent Mesh Corrections

2019·Arxiv

Abstract

Abstract

Building good 3D maps is a challenging and expensive task, which requires high-quality sensors and careful, time-consuming scanning. We seek to reduce the cost of building good reconstructions by correcting views of existing low-quality ones in a post-hoc fashion using learnt priors over surfaces and appearance. We train a convolutional neural network model to predict the difference in inverse-depth from varying viewpoints of two meshes – one of low quality that we wish to correct, and one of high-quality that we use as a reference.

In contrast to previous work, we pay attention to the problem of excessive smoothing in corrected meshes. We address this with a suitable network architecture, and introduce a loss-weighting mechanism that emphasises edges in the prediction. Furthermore, smooth predictions result in geometrical inconsistencies. To deal with this issue, we present a loss function which penalises re-projection differences that are not due to occlusions. Our model reduces gross errors by 45.3%–77.5%, up to five times more than previous work.

I. INTRODUCTION

Dense 3D maps are a crucial component in many systems and better maps make robots easier to build and safer to operate. Despite recent progress in hardware such as the wide availability of GPUs, and algorithms that scale with the amount of data and the available hardware, high-quality maps, especially at large scales, remain difficult to build cheaply, often requiring expensive sensors.

The main motivation of our work is to reduce the cost of building good dense 3D maps. The reduction in cost can come from either: a) cheaper but nosier sensors, such as stereo cameras; b) less data, and therefore less time spent densely scanning an area. There are two ways in which we can produce better reconstructions with cheaper data. We can learn the kinds of errors a certain modality produces. For a stereo camera, for example, there will be missing data in areas without a lot of texture (walls, roads), and the ambiguity in depth is usually along the viewing rays. In addition, we can learn priors for a target environment: cars usually have known shapes, roads and buildings do not have holes in them, surfaces tend to be vertical or horizontal in an urban environment, etc.

We tackle the problem of correcting dense reconstructions with a convolutional neural network (CNN) that operates on rasterised views of a 3D mesh as a post-processing step, following a classical reconstruction pipeline. To train this model, we start with two meshes, a low-quality one and reference high-quality one. From each reconstruction, we render multiple types of images (such as inverse-depth, normals, etc.), referred to as mesh features (shown in Figure 2), from

Fig. 1. Illustration of geometric consistency for two views of the same scene a few meters apart. The predictions of our model can be used to compute corrected depth maps for a set of views. The corrected depth-maps should be consistent: they should be describing the same scene. To enforce this, we densely reproject inverse-depth from View 1 to View 2, using the corrected depth of View 2. The absolute difference (bottom) is the inconsistency of View 1 with respect to View 2. This is minimised during the training of our model, which enforces geometrically consistent predictions. The red overlay is the occlusion mask: the parts of the View 2 prediction that cannot be consistent with the View 1 prediction.

multiple viewpoints. We then train the model on the mesh features to predict the difference in inverse-depth between the high-quality reconstruction and the low-quality one, thus enabling us to correct the low-quality mesh.

Previous work [1] has demonstrated the idea of correcting meshes post-hoc via 2D rasterised views. We address its two main limitations. Firstly, we deal with the issue of overly smooth predictions. We propose some architecture and training changes: we add skip connections from the encoder to the decoder in our CNN, which are known to help in localising edges in predictions; we propose a loss-weighting method that penalises incorrect predictions more the closer

they are to an edge. Secondly, predictions on nearby views are not always consistent, i.e. when applying the predicted corrections the geometry of the scene is not always the same. To improve consistency, we employ a view synthesis based loss, and show that this also improves the performance of the network. An illustration of this idea is shown in Figure 1. Our contributions are as follows: Error correction: We propose a CNN model that is able to correct 2D views of a dense 3D reconstruction. We propose a novel weighting mechanism to improve performance around edges in the prediction. We evaluate it against existing work and show that we outperform it, especially when images are from multiple viewpoints. Geometric consistency: We adapt the photometric consistency loss that features in related tasks such as depth-from-mono [2] to the task of correcting reconstructions. This is a novel use of the loss, which has thus far only been employed on RGB images. We leverage the existing reconstruction to compute occlusion masks, and thus exclude from the loss areas where geometric consistency is impossible. We show that this use of geometric consistency as an auxiliary loss further improves our model.

II. RELATED WORK

Several systems for building 3D reconstructions have been proposed, such as BOR2G [3] or KinectFusion [4]. Our approach is to correct the output of such a system by looking at the meshes it builds. As we operate on inverse-depth images, our work is similar to depth refinement. Below, we review some of the literature on that topic, as well as some other methods that our system draws inspiration from.

Learnt depth refinement and completion: Some methods for learning depth map refinement and completion have been recently proposed. [5], [6] propose CNN model for solving the KITTI Depth Completion Challenge, where sparse depth maps produced from laser data are densified. [7] use normalised convolutions [8] to predict dense depth maps that have been sparsely sampled. These methods all rely on having some sort of high-quality, usually laser, depth information (albeit sparse) at run-time, and are only designed to fill in the missing data. In contrast, we only require high-quality data during training, and our method aims to refine depth from a low-quality mesh in a more general sense – both filling in blanks, as well as refining existing surfaces.

A few other methods more closely related in spirit to ours aim to learn depth refinement using meshes as reference. [9] use dictionary learning to model the statistical relationships between raw RGB-D images, and high-quality depth data obtained by fusing multiple depth maps with KinectFusion [4]. Two other recent works use fused RGB-D reconstruction to obtain high-quality reference data and train CNNs to enhance depth. [10] uses a colour image to predict normals and occlusion boundaries, supervised by the 3D reconstruction, and then formulates an optimisation problem to fill in holes in an aligned depth image. [11] introduce a 4000-image dataset of raw/clean depth image pairs, and train a CNN to enhance raw depth maps. Furthermore, they show that 3D

Fig. 2. Example mesh features. We fly a camera through an existing mesh and at each location produce mesh features. During training, inverse-depth images of a high-quality mesh are available, and our model learns a mapping from the mesh features pictured above to the high-quality inverse-depth. We refer the reader to [1] for an analysis of the relative usefulness of mesh features.

reconstructions can be obtained with fewer data and quicker when using their depth-enhancing network. These methods all rely on live colour images to guide the refinement or completion of live depth. That means that only limited data is available for training. In contrast, our purely mesh-based formulation allows us to extract many more training pairs from any viewpoint, removing any viewpoint-specific bias that might otherwise surface while learning.

Learnt depth estimation: Another related line of research has been monocular depth estimation, which inspires our

Fig. 3. Visibility masks for views generated at a location. The image regions highlighted in red, green, blue, and cyan are also visible in the (a) left, (b) right, (c) back, and (d) top views, respectively. (a): inverse-depth view from a dense 3D reconstruction; the regions visible in the left, and top views are respectively highlighted in green and cyan. (b): a view of the same scene as (a), 2 m to the right; the regions visible in the left, and top views are respectively highlighted in red and cyan. (c): a view of the same scene as (a), looking back; the region visible in top view is highlighted in cyan. (d): a view of the same scene as (a), looking down from 25 m above; the regions visible in the left, right, and back views are respectively highlighted in red, green, and blue. Areas outside the highlighted regions are occluded – they are not visible from the other view. For example, distant areas of (a)–(c), such as the sky, are not visible from the top view; the region behind the car on the right in (a) is not visible from (b). These occluded parts of the image are ignored when computing the geometric consistency loss (Section III-C.2). For visualisation purposes, morphological closing has been applied to the visibility masks shown here, to remove some of the small-scale noise.

choices of network architecture and the use of geometry as a source of self-supervision. [2], [12], [13], [14], [15], [16], [17], etc. propose various CNN models for depth estimation from single colour images. When explicit depth ground-truth is unavailable, the training is self-supervised using multiview geometry: the predicted depths and relative poses should maximise the photometric consistency of nearby input images. In this work, we show how the photometric consistency loss can be adapted to ensure consistency between inverse-depth predictions directly, without need for colour images.

III. METHOD

A. Training data

In contrast to many existing approaches, this work is about correcting existing reconstructions, and therefore we assume the existence of 3D meshes of a scene. In addition to the low-quality reconstruction we wish to correct, we also have a high-quality reconstruction of the same scene. For example, we learn to correct 3D reconstructions from depth-maps using laser reconstructions as a reference. To build the meshes we train on, we use the BOR2G [3] system.

The training data consists of 2D views of the meshes. We use the same virtual camera to generate aligned views of the low-quality mesh and the high-quality mesh for each scene. Because a mesh is available, we can generate multiple types of images at each viewpoint. In particular, we generate inverse-depth, normals, mesh triangle area, mesh triangle edge ratio (ratio between the shortest and the longest edge of each triangle), and surface-to-camera angle (Figure 2). We refer to these images as mesh features. The ground-truth labels () are computed as the difference in inverse-depth

between high-quality and low-quality reconstructions:

where p is a pixel index, and dhq and dlq are inverse-depth images for the high-quality and low-quality reconstruction, respectively. For notational compactness, is referred to as , and future definitions are over all values of p, unless otherwise mentioned.

Inverse-depth is used instead of depth for several reasons. Firstly, it emphasises surfaces close to the camera where more information is available per pixel. Secondly, background (non-surface) pixels are not processed separately – they are assigned a value of zero, corresponding to points infinitely far away from the camera. If we used depth, those pixels would either have to be assigned an arbitrary finite value, which would result in semantic discontinuities in the output, or learnt to be ignored by the network, since there is no standard way to deal with infinite values in a CNN. Finally, another advantage of inverse-depth is that resampling it, which is needed to compute geometric consistency, is simpler than resampling depth images.

B. Geometric consistency

Intuitively, since the reconstructions we wish to correct are static, the predictions made from overlapping views should be geometrically consistent. In other words, surfaces that appear in a certain location according to a prediction should appear in the same location in all predictions where they are in view.

We resample nearby predicted views according to the predicted geometry of the current view, and minimise the

absolute difference in inverse-depth. This dense warping is similar to reprojecting nearby views into the current view, but has the advantage of being differentiable, and of generating dense images instead of sparse reprojected pointclouds. An illustration of this idea is shown in Figure 1. Normally, dense warping is used in conjunction with colour images, where the values of the pixels are viewindependent. Since we are warping inverse-depth images, where the pixel values depend on the viewpoint, we need to compute the absolute difference in the same camera frame. Concretely, for a view t, a nearby view n, let be a CNN prediction for the target view, let be a prediction for the nearby view, and let ddlqt and ddlqn + the corrected inverse-depth images for the two views. Furthermore, let pt be pixel coordinates in the target view, K the intrinsic matrix of the virtual camera, and Tnthe SE(3) transform from view t to view n. We can then define the geometric inconsistency dwhere dis the predicted inverse-depth from view t in the frame of view n and is the warped inverse-depth from view n. They are defined as follows:

where xn is the 3D homogeneous point in view n corresponding to pixel pt, and pn its projection:

The sample pixel pn does not necessarily have integer coordinates, and thus may lie in-between pixels in the dgrid. Note that we add a small value with the same sign as xin Equations 3 and 4 to avoid dividing by zero. Under the mild assumption that surfaces between pixels are planar, we can sample dby linearly interpolating the four pixels nearest to pn – another advantage of the inverse-depth formulation.

Occlusion masks: A common problem with this approach is that views cannot be consistent in the presence of occlusions. In our setting, however, since the views are synthetic (and therefore the intrinsic and extrinsic parameters are perfectly known), we are able to create occlusion masks and only apply the geometric consistency loss where there are no occlusions.

We compute occlusion masks from the reference high-quality reconstruction. Each mesh triangle is assigned an index by hashing its world frame coordinates. Then, in addition to mesh features, an image with mesh triangle indices is generated at each location. For each pair of views, each pixel of the triangle index image is resampled from

one view into the other in a similar fashion to the inverse-depth above. Instead of interpolating between the four nearest neighbours, these are returned as four separate samples. The pixels where the indices match at least one of the four samples are considered unoccluded, and the pixels where they do not are considered occluded (see Figure 3).

Small errors appear due to rasterisation when mesh triangles are too small, often those that are far away from the camera. We could avoid these errors by generating occlusion masks with the OpenGL rasterisation pipeline when the rest of the data is generated. However, this greatly increases the space required to store training data – the number of occlusion masks scales quadratically with the number of views we want to enforce consistency between. Generating the occlusion masks on the fly means that geometric consistency can be enforced between arbitrary views, so more settings can be explored without regenerating part of the training data. In Section IV we show that the quality of the occlusion masks is sufficient to demonstrate the advantages of geometric consistency.

C. Model

1) Network architecture: The model used is an encoderdecoder CNN similar to the one proposed in [1]. The encoder is composed of residual blocks based on the ResNet-50 architecture [18], and the decoder uses up-convolutions proposed by [19]. U-Net [20] style skip connections are added between the encoder and the decoder to improve the sharpness of predictions. As a simple way to offer our model some introspective capabilities, we predict a soft attention mask (with values ) in addition to the error in inverse-depth. This mask is multiplied pixel-wise with the error prediction to modulate which parts are going to be used and does not require extra supervision. Table I provides an overview for each of the layers of the proposed CNN.

2) Loss: The objective function has several components, as follows. The first term is the data loss that minimises the error between the output and the label. To compute this loss, we use berHu norm [21]. For large errors, this behaves

in the same way as an L2 norm. For small errors, where the gradients of L2 become too small to drive the error completely to zero, L1 norm is used instead. The advantages of this norm have also been observed in [12], [22]. The data loss is defined as follows:

where p is the pixel index, V is the set of valid pixels (to account for missing data in the ground-truth), W is a per-pixel weight detailed in Section III-C.3, and are the prediction and the target, respectively, and is the berHu norm.

To improve small-scale details and prevent artefacts in the prediction, while also allowing for sharp discontinuities, we also apply a loss on the gradient of the predictions:

We use the Sobel operator [23] to approximate the gradients in the equation above.

The geometric consistency loss guides nearby predictions to have the same 3D geometry, and relies on reprojected nearby views . For a target view t, a set of nearby views N, the set of pixels unoccluded in a nearby view Un (see Figure 3), this loss is defined as:

Note that Un has no relation to the set of valid pixels (V) from the previous losses, since this loss is only computed between predictions. This enables the network to make sensible predictions even in parts of the image which have no valid label.

Finally, we also include an L2 weight regulariser, , to reduce overfitting by keeping the weights small. The overall objective is thus defined as:

where s is the scale, and the s are weights for each of the components (see Table II for values).

3) Loss weight: Inspired by the work of [20] on U-Nets, we use a loss-weighting mechanism based on the Euclidean Distance Transform [24] to emphasise edge pixels when regressing to the error in depth. We first extract Canny edges [25] from the ground-truth labels. Based on these edges, we then compute the per-pixel weights as:

where W(p) is the loss weight for pixel p, EDT(p) is the Euclidean Distance Transform at pixel p, and wmin and wmax are the desired range of the per-pixel weight.

TABLE II SUMMARY OF HYPERPARAMETERS USED IN SYSTEM

Fig. 4. Gross error correction. For each threshold (Equation 14), we count how many pixels are incorrect in the predictions over the test set. Our full model removes 45.3% of the smaller errors and 77.5% of the gross errors. The baseline model is unable to effectively handle the multi-view setup, and fails to correct gross errors.

IV. EXPERIMENTS

A. Experimantal setup

Training and inference: The network is implemented in Python using TensorFlow v1.12. Each model is trained on an Nvidia Titan V GPU. The weights are optimised for 500 000 steps with a batch size of 16 using the Adam [26] optimiser. The learning rate () is decayed linearly for the first 120 000 steps. Generating the mesh features takes an average of 52 ms per view using OpenGL on an Nvidia GTX Titan Black, and inference takes an average of 12.5 ms on the Titan V. All the training hyper-parameters are defined in Table II.

Dataset: Three sequences from the KITTI visual odometry (KITTI-VO) dataset were used as the input to the reconstruction pipeline. For each sequence, two reconstructions are built: one from the stereo camera depth-maps, and one from the laser data. Both the meshes are generated with a fixed voxel width of 0.2 m. In the experiments, we show how to learn a correction of the depth-map reconstruction using the laser reconstruction as reference. Using OpenGL Shading Language (GLSL), we create a virtual camera and project each dense reconstruction into mesh features. We sample

TABLE III GENERALISATION CAPABILITY OF DEPTH ERROR CORRECTION

locations along the original trajectory in each sequence every 0.3m, and at each location we generate mesh features from four different viewpoints. An illustration of the viewpoints is shown in Figure 3. For all experiments, the KITTI-VO sequences 00, 05, and 06 are used, from which a total of 96 728 training examples of size 96 288 are generated. We split each of the mesh feature sequences into three distinct parts: the first 80% we use as training data, the next 10% we use for validating hyperparameter choices, and the last 10% we use for evaluation. All three sequences are predominantly in urban environments with small amounts of visible vegetation.

Performance metrics: We use some metrics common in literature for assessing inverse-depth predictions.

Our first metric measures the accuracy of our network’s ability to estimate errors under a given threshold, serving as an indication of how often our estimate is correct. The thresholded accuracy measure is essentially the expectation that a given pixel in V is within a threshold thr of the label:

where dhq is the reference inverse-depth, dis the predicted inverse depth, V is the set of valid pixels, and n is the cardinality of V, and represents the indicator function. For granularity, we use thr 252,1.253}.

In addition, the mean absolute error (MAE) and root mean square error (RMSE) metrics provide a quantitative measure of per pixel error and are computed as follows:

where the ‘i’ indicates that the metrics are computed over inverse-depth images.

B. Gross error correction

We first look at how well our model corrects gross errors in inverse-depth. Equation 14 can be used to classify pixels in an image as either correct or incorrect at a given threshold. Using this method, we count the number of incorrect pixels in our predictions, and compare it to the number of incorrect pixels in the input inverse-depth. As a baseline, we train the model proposed in [1] on our dataset, and compare it to our full model, trained with and without the geometric consistency loss (Figure 4). Our proposed model outperforms the baseline at correcting both small errors (thr = 1.05) as well as larger errors, reducing the number of errors at thr = 1.253 by 77.5%.

C. Generalisation capability

To be useful in practice, the model needs to be able to generalise to new data. For example, separate models could be trained for indoor scenes and outdoor scenes, or other different types of environments, but certainly within the same kind of environment, one model should work well for a variety of scenes.

We evaluate the ability of our proposed model to generalise by training the full model on a subset of the available sequences, and testing it on the rest (excluding the frames used for validation in Section IV-B). Table III shows the data splits and the performance of the models on the test sequences. Our model performs particularly well on sequence 06, where the input reconstruction has very large errors. This highlights the ability of our model to reduce gross errors.

D. Ablation study

To better understand how different components of our model improve the learnt correction, we perform an ablation study. The results in Table IV show that our proposed geometric consistency loss (rows with GC) improves performance at all error scales.

The proposed attention mask (rows with attn) improves the performance in the absence of geometric consistency, but slightly limits the performance especially with larger errors. However, qualitatively (Figure 5), the attention mask allows us to better handle missing training data. In particular, surfaces are not spuriously removed or added in those regions: the model learns to mask those regions out of the correction and keep them as they are.

V. CONCLUSION

In this paper we present a method for correcting gross errors in dense 3D meshes. We extracted paired 2D mesh features from two reconstructions and trained a neural network to predict the difference in inverse-depth between the two. We addressed the issue of overly-smooth predictions with a U-Net architecture and a loss-weighting mechanism that emphasises edges. The geometric consistency of our

TABLE IV DEPTH ERROR CORRECTION ABLATION STUDY RESULTS

Fig. 5. Illustration of how the quality of the predictions changes as different components are added to the system. Each three rows show an example, as follows: (a) Input inverse-depth and ground-truth error, and ground-truth inverse-depth. The shaded areas in the ground-truth error represent missing data in the reference mesh. (b)–(d) Prediction and corrected inverse-depth. (e), (f): Attention mask, prediction, and corrected inverse-depth. The baseline model (b) only learns very rough predictions and is unable to generalise well to the top viewpoint (last example). Our proposed losses help with generalisation across viewpoints, but without skip connections in the network predictions are not very well localised (c). Our model (d)–(f) makes well-localised predictions. The attention mask removes some of the spurious predictions where there is no reference data (e), and geometric consistency further guides this (f).

predictions is improved with a view-synthesis loss that targets inconsistencies. Our experiments show that the proposed method reduces gross errors in inverse-depth views of the mesh by up to 77.5%.

ACKNOWLEDGMENT

The authors would like to acknowledge the support of the UK’s Engineering and Physical Sciences Research Council (EPSRC) through the Centre for Doctoral Training in Autonomous Intelligent Machines and Systems (AIMS) Programme Grant EP/L015897/1. Paul Newman is supported by EPSRC Programme Grant EP/M019918/1. The authors would like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility1 in carrying out this work.

REFERENCES

[1] M. Tanner, S, . S˘aftescu, A. Bewley, and P. Newman, “Meshed up: Learnt error correction in 3D reconstructions,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, May 2018.

[2] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017.

[3] M. Tanner, P. Pini´es, L. M. Paz, and P. Newman, “BOR2G: Building optimal regularised reconstructions with GPUs (in cubes),” in Proceedings of the International Conference on Field and Service Robotics (FSR), Toronto, Canada, Jun. 2015.

[4] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. W. Fitzgibbon, “KinectFusion: Real-time dense surface mapping and tracking,” in IEEE International Symposium on Mixed and Augmented Reality, 2011, pp. 127–136.

[5] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant CNNs,” in Proceedings of the IEEE International Conference on 3D Vision (3DV), 2017.

[6] A. Eldesokey, M. Felsberg, and F. S. Khan, “Propagating confidences through CNNs for sparse data regression,” in Proceedings of the British Machine Vision Conference (BMVC), 2018.

[7] J. Hua and X. Gong, “A normalized convolutional neural network for guided sparse depth upsampling,” in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2018.

[8] H. Knutsson and C.-F. Westin, “Normalized and differential convolution: Methods for interpolation and filtering of incomplete and uncertain data,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 1993.

[9] H. Kwon, Y.-W. Tai, and S. Lin, “Data-driven depth map refinement via multi-scale sparse representation,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 159–167.

[10] Y. Zhang and T. A. Funkhouser, “Deep depth completion of a single RGB-D image,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 175– 185.

[11] J. Jeon and S. Lee, “Reconstruction-based pairwise depth dataset for depth image enhancement using CNN,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[12] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in Proceedings of the IEEE International Conference on 3D Vision (3DV), 2016, pp. 239–248.

[13] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox, “DeMoN: Depth and motion network for learning monocular stereo,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5622– 5631.

[14] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised

learning of depth and ego-motion from video,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[15] T. Dharmasiri, A. Spek, and T. Drummond, “ENG: End-to-end neural geometry for robust depth and pose estimation using CNNs,” arXiv:1807.05705v2 [cs.CV].

[16] R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5667– 5675.

[17] M. Klodt and A. Vedaldi, “Supervising the new with the old: Learning SfM from SfM,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.

[19] W. Shi, J. Caballero, F. Husz´ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video superresolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1874–1883.

[20] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015.

[21] A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contemporary Mathematics, vol. 443, pp. 59–72, 2007.

[22] F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from sparse depth samples and a single image,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2018.

[23] I. Sobel and G. Feldman, “A 33 isotropic gradient operator for image processing,” presented at the Stanford Artificial Intelligence Project (SAIL), 1968.

[24] P. F. Felzenszwalb and D. P. Huttenlocher, “Distance transforms of sampled functions,” Theory of Computing, vol. 8, pp. 415–428, 2012.

[25] J. F. Canny, “A computational approach to edge detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, pp. 679–698, 1986.

[26] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.

designed for accessibility and to further open science