Learning to Correct 3D Reconstructions from Multiple Views

2020·Arxiv

Abstract

Abstract

This paper is about reducing the cost of building good large-scale 3D reconstructions post-hoc. We render 2D views of an existing reconstruction and train a convolutional neural network (CNN) that refines inverse-depth to match a higher-quality reconstruction. Since the views that we correct are rendered from the same reconstruction, they share the same geometry, so overlapping views complement each other. We take advantage of that in two ways. Firstly, we impose a loss during training which guides predictions on neighbouring views to have the same geometry and has been shown to improve performance. Secondly, in contrast to previous work, which corrects each view independently, we also make predictions on sets of neighbouring views jointly. This is achieved by warping feature maps between views and thus bypassing memory-intensive 3D computation. We make the observation that features in the feature maps are viewpoint-dependent, and propose a method for transforming features with dynamic filters generated by a multi-layer perceptron from the relative poses between views. In our experiments we show that this last step is necessary for successfully fusing feature maps between views.

I. INTRODUCTION

Building good dense 3D reconstructions is essential in many robotics tasks, such as surveying, localisation, or planning. Despite numerous advancements in both hardware and algorithms, large-scale reconstructions remain costly to build.

We approach this issue by trying to reduce the data acquisition cost either through the use of cheaper sensors, or by collecting less data. To make up for the cheaper but lower quality data, we have to turn to prior information from the operational environment (e.g. roads and buildings are flat, cars and trees have specific shapes, etc). To learn these priors, we train a CNN over 2D views of 3D reconstructions, and predict refined inverse-depth maps that can be fused back into a refined 3D reconstruction. We take this detour through two dimensions in order to avoid the high memory requirements a volumetric approach over large-scale reconstructions would impose.

While operating in 2D, neighbouring views are related by the underlying geometry. Previous work [1] has leveraged this relation during training, where a geometric consistency loss is imposed between neighbouring views that penalises mismatched geometry. Here, we take another step and explore how neighbouring views can be used together when predicting refined depth, and to that end introduce a method for aggregating feature maps in the CNN.

To fuse feature maps from multiple views, we could either “un-project” them into a common 3D volume or “collect” them into a common target view through reprojection, as proposed by [2]. As un-projecting into 3D re-introduces the limitation we wished to avoid, we take the latter approach.

Fig. 1. Feature map aggregation. In the top two rows we show inverse- depth images for a front view and a top view, along with two feature maps. To aggregate the feature map of the top view with the front view, we first warp the top feature map into the front view. The relative transform between the top and the front view, T front, is processed by a multi-layer perceptron (MLP) to generate the weights for a linear transform that maps the features from the top view to the front view. Finally, the resulting feature map can be aggregated with the front view feature map. Note that in the front feature map the features fade from green to violet towards the horizon, while in the warping of the top view, the features do not change with depth. Only after transforming the features do we see them fading towards the horizon. Analogously, we aggregate the front view feature map with the top view one. For visualisation, the above feature maps are projected to three channels using the same random projection.

Directly aggregating feature maps between views – either in a 3D volume or in a common target view – implies features are somewhat independent of viewpoint. To lift this restriction, we propose a method for transforming features between views, enabling us to more easily aggregate feature maps from arbitrary viewpoints. Concretely, we use the relative pose between views to generate a projection matrix in feature space that can be used to transform feature maps, as illustrated in Figure 1.

This paper brings the following contributions:

1) We introduce a method for fusing multi-view data that decouples much of the multi-view geometry from model parameters. Not only do we warp feature maps

between views, but we make the key observation that features themselves can be view-point dependent, and show how to transform the feature space between views.

2) We apply this method to the problem of correcting dense 3D meshes. We render 2D views from reconstructions and learn how to refine inverse-depth, while making use of multi-view information.

In our experiments, we look at two ways of aggregating feature maps, and conclude that the feature space transformation is necessary to benefit from the use of multiple views when correcting reconstructions.

II. RELATED WORK

Our work focuses on refining the output of an existing 3D reconstruction system such as BOR2G [3] or KinectFusion [4], thus producing higher-quality reconstructions. Since we achieve this by operating on 2D projections and refining inverse-depth, our work is related to depth refinement as well. In the following we summarise some of the related literature and methods used in this work. Mesh correction: Tanner et al. [5] first propose fixing 3D reconstructions by refining 2D projections of them with a CNN, one at a time. The geometrical relation between neighbouring views is leveraged in [1] during training, by the addition of a geometric consistency loss that penalises differences in geometry. In this work, we process neighbouring views jointly not only while training, but also when making predictions. Learnt depth refinement and completion: There are several depth refinement methods similar to our approach. [6] fuse multiple depth maps with KinectFusion [4] to obtain a high-quality reference mesh, and use dictionary learning to refine raw RGB-D images. Using a CNN on the colour channels of an RGB-D image, [7] predict normals and occlusion boundaries and use them to optimise the depth component, filling in holes. [8] render depth images from a reconstruction at the same locations as the raw depth, obtaining a 4000-image dataset of raw/clean depth image pairs. The authors train a CNN to refine the raw depth maps, and show that using it reduces the amount of data and time needed to build 3D reconstructions. All these methods require a colour image in order to refine depth, and operate on live data, which limits the amount of training data available. Our method is designed to operate post-hoc, on existing meshes. We can therefore generate an arbitrary number of training pairs from any viewpoint, removing any viewpoint-specific bias that might otherwise surface while learning.

Another recent approach proposes depth refinement by fusing feature maps of neighbouring views through warping [2]. While this is similar to our approach, we take the additional step of transforming the features between views, and consider two feature aggregation methods. Dynamic filter networks: Generating filters for convolutions dynamically conditioned on network inputs is presented by [9], where the method is used to predict filters for local spatial transforms that help in video prediction tasks. Our feature transformation is also based on this framework:

given a relative transform between two views, we predict the weights that would transform features from one view to another. A key distinction is that, while the filters in the original work are demonstrated over the spatial domain, we operate solely on the channels of the feature map with a 11 convolution.

III. MEHTOD

A. Training Data

Our main goal is to correct existing dense 3D reconstructions. To bypass the need for expensive 3D computation, we operate on 2D projections of a mesh from multiple viewpoints. As we want to capture as much of the 3D geometry as possible in our projections, we render several mesh features for each viewpoint: inverse-depth, colour, normals, and triangle surface area (see Figure 2).

During training, we have access to two reconstructions of the same scene: a low-quality one that we learn to correct, and a high-quality one that we use for supervision. In particular, we learn to correct stereo-camera reconstructions using lidar reconstructions as supervision. Figure 3 shows an overview of our method.

The ground-truth labels are computed as the difference in inverse-depth between high-quality and low-quality reconstructions:

where p is a pixel index, and dhq and dlq are inverse-depth images for the high-quality and low-quality reconstruction, respectively. For notational compactness, is referred to as , and future definitions are over all values of p, unless otherwise specified.

There are several advantages to using inverse-depth. Firstly, geometry closer to the camera will have higher values and therefore these areas will be emphasised during training. Secondly, inverse-depth smoothly fades away from the camera, such that the background – that has no geometry and is infinitely far away – has a value of zero. If we were to use depth, we would have to treat background as a special case, since neural networks are not equipped to deal with infinite values out-of-the-box. Finally, when warping images from one viewpoint from another, as described in the following sections, we are, in essence, re-sampling. To correctly interpolate depth values, we would have to use harmonic mean, which is less numerically stable, whereas interpolating inverse-depth can be done linearly.

B. Image Warping

During both training and prediction, we need to fuse information from neighbouring views. While training, we want to penalise the network for making predictions that are geometrically inconsistent between views, using the geometric consistency loss from [1], described in Section III-C.4. When making predictions, we want to be able to aggregate information from multiple views. To enable this, we need to warp images between viewpoints, such that corresponding pixels are aligned.

Fig. 2. Example of training data generated. Each column represents a different view rendered around the same location. The top row shows the inverse- depth images rendered from the lidar reconstruction, with areas visible in the other views shaded: red for left, green for right, blue for back, and cyan for top. The next four rows show the mesh features we render from the stereo camera reconstruction: inverse-depth, colour, normals, and triangle surface area. Our proposed model learns to refine the low-quality inverse-depth (second row), using the rendered mesh features (rows 2–5) as input, processing all four views jointly, and supervised by the high-quality inverse-depth label (first row).

Consider a view t, an inverse-depth image dt, and a pixel location within the image, pt = [u v 1. The homogeneous 3D point corresponding to pt is:

where K is the camera intrinsic matrix. Consider further a nearby view n, an image In, and the relative transform TnSE(3) from t to n that maps 3D points between views: xn = Tnxt. The pixel in view n corresponding to pt is:

where the superscript indexes into the vector xn. We can now warp image In into view t:

Note here that pn might not have integer values, and therefore might not lie exactly on the image grid of In. In that case, we linearly interpolate the nearest pixels.

Since the value of inverse-depth is view-dependent, when warping inverse-depth images we make the following additional definition:

which represents an image aligned with view t, with inverse-depth values in the frame of n.

Occlusions: Pixel correspondences computed through warping are only valid where there are no occlusions. We therefore need a mask to only take into account unoccluded regions. When rendering mesh features, we also render an additional image where every pixel is assigned the ID of the visible mesh triangle at that location. The triangle ID is computed by hashing the global coordinates of its vertices. We can then warp this image of triangle IDs from the source to the target view. If the ID of a pixel matches between the warped source and the target image, we know that the same surface is in view in both images, and thus that pixel is unoccluded.

TABLE I OVERVIEW OF THE CNN ARCHITECTURE FOR ERROR PREDICTION

C. Network architecture

1) Model: We use an encoder-decoder architecture, with asymmetric ResNet [10] blocks, the sub-pixel convolutional layers proposed in [11] for upsampling in the decoder, and skip connections between the encoder and the decoder to improve sharpness, as introduced by the U-Net architecture [12]. Throughout the network, we use ELU [13] activations and group normalisation [14]. Table I details the blocks used in our network.

Since a fair portion of the input low-quality reconstruction is already correct, we train our model to predict the error in the input inverse-depth, . We then compute the refined inverse depth as the output of our network:

Clipping is required here because inverse-depth cannot be negative. However, since we are supervising the predicted error, the network can learn even when the predicted inverse-depth is clipped and would therefore lack a gradient. To ensure our network can deal with any range of inverse-depth, we offset the input such that it has zero mean, scale it to have

Fig. 3. Illustration of our training set-up. Starting with a low-quality reconstruction (stereo camera in this instance), we extract mesh features from several viewpoints. Our network learns to refine the input inverse-depth with supervision from a high-quality (lidar) reconstruction. The blue arrows above indicate the losses used during training: for each view, we regress to the high-quality inverse-depth, as well as to its gradient; between nearby prediction, we apply a geometric consistency loss to encourage predictions with the same geometry. Within the network, feature maps are aggregated between views so information can propagate within a neighbourhood of views.

standard deviation of 1, and undo the scaling on the predicted error .

2) Feature Map Warping and Aggregation: As our predictions are related by the 3D geometry of a scene, we would like to ensure predictions are consistent between views. This is taken into account during training by using the geometric consistency loss from [1], as described in Section III-C.4.

However, during inference we would like to aggregate information from multiple views to improve predictions. Take, for example, two views, t and n, and feature maps in the network, Ft and Fn, after a certain number of layers, corresponding to each of the views. We would like to aggregate them such that Ft Fn is a feature map containing information from both views. Since the feature maps are aligned with the input views, we cannot do that pixel-wise. Using the input depth, we warp the feature map of one view into the frame of the other, such that the input geometry is aligned. For the two input views, we can thus warp Fn to the viewpoint of t, and then aggregate the two feature maps: FtFt Ft, obtaining a feature map aligned with t that combines information from both views.

The aggregation step is necessary (instead of simply concatenating aligned feature maps) to allow for an arbitrary number of views. For the same reason, the aggregation function needs to be invariant to permutations of views. We consider two such aggregation functions: averaging, and attention. For a target view t with neighbourhood N {n1,n2,...}, averaging is defined as Ft, where we consider FtFt.

For the attention-based aggregation, we use the attention method proposed by Bahdanau et al. [15]. A per-pixel score EtFt,Ftis computed using a small 3-layer convolutional sub-network. The per-pixel weight of each view is obtained by applying softmax to the scores: Atexp(Etexp(Et. Finally, the aggregation function becomes AtFt. In both cases, pixels deemed occluded are masked out.

In our model, we apply this warping and aggregation to every skip connection and to the encoder output, thus mixing information across views and scales. For every view in a batch, we aggregate all the other overlapping views within the batch.

3) Feature Space Reprojection: When the network is trained to make predictions one input view at a time, the feature maps at intermediate layers within the network contain viewpoint-dependent features, as illustrated in Figure 1. Warping a feature map from one view to another only aligns features using the scene geometry, but does not change their dependence on viewpoint.

Imagine, for example, two RGB-D images in an urban scene – one from above looking down, and one at the road level. The surface of the road will have the same colour in both views. The depth of the road, however, will be different: in the top view, it will be mostly uniform, while in the road-level view it will increase towards the horizon. We can easily establish correspondences between pixels in the two images if we know their relative pose. When not occluded, corresponding pixels will have the same colour, but not necessarily the same depth. This is because colour is view-point independent, while depth depends on the viewpoint. We know, however, how to compute one of the depth values, given the corresponding pixel in the other view, and the relative pose. Thus, armed with this observation we propose a way to learn a mapping of features between viewpoints.

Consider again two views t and n, and a warped feature map Ftfrom the vantage point n. Given a spatial location p, we have a feature vector fn = Ft, its corresponding location in space xn , and a transform TtSE(3). We compute a matrix W = g(Tt, using a small multi-layer perceptron (MLP) to model g. This allows us to learn a linear transform in feature space between view n and view t:

Without this mechanism, the network would have to learn to extract viewpoint-independent features to allow for feature aggregation between views.

Concretely, we implement this as a dynamic filter network (DFN), with a 4-layer MLP generating filters for a 11 linear convolution of the warped feature map Ft. To keep the MLP small, we first project the input feature maps to D = 32 dimensions, apply the dynamically generated filters, and then project back to the desired number of features. We use the same transformation W across each of the the scales that we aggregate.

In the experiments, we show that this mechanism is

Fig. 4. Illustration of the training (blue), validation (orange), and test (green) splits on the three KITTI-VO sequences we are using. Map data copyrighted OpenStreetMap [16] contributors and available from https://www.openstreetmap.org.

essential for enabling effective multi-view aggregation.

4) Losses: We supervise our training with labels from a high-quality reconstruction, as sown in Figure 3. The labels provide two per-pixel supervision signals, one for direct regression, , and one for prediction gradients, :

where V is the set of valid pixels (to account for missing data in the ground-truth), and are the prediction and the target, respectively, and is the berHu norm [17], whose advantages for depth prediction have been explored by [18], [19]. We use the Sobel operator [20] to approximate the gradients in Equation 9.

The geometric consistency loss guides nearby predictions to have the same 3D geometry, and relies on warped nearby views d. For a target view t, a set of nearby views N, the set of pixels unoccluded in a nearby view Un (see Figure 2, top row), this loss is defined as:

Both dand are aligned with view t and contain inverse-depth values in frame n, as per Equation 4 and Equation 5. Note that Un has no relation to the set of valid pixels (V) from the previous losses, since this loss is only computed between predictions. This enables the network to make sensible predictions even in parts of the image which have no valid label.

Finally, we also include an L2 weight regulariser, , to reduce overfitting by keeping the weights small. The overall objective is thus defined as:

where the s are weights for each of the components. We use 1, 1, and 10.

IV. EXPERIMENTS

A. Experimental Setup

1) Dataset: For the experiments, we use sequences “00”, “05” and “06” from the KITTI visual odometry (KITTI-VO) dataset [21]. Using the BOR2G reconstruction system [3], we create pairs of low/high quality reconstructions (meshes) from the stereo camera, and lidar, respectively. Following the same trajectory used when collecting data (as it is collisionfree), every 0.65 m we render mesh features from four views (left, right, back, top), illustrated in Figure 2. For each view, we render a further 3 samples with small pose perturbations for data augmentation. In total, we obtain 178 544 distinct views of size 96288 over 7.2 km.

2) Training and inference: We train all our models on Nvidia Titan V GPUs, using the Adam optimiser [22], with 999, and a learning rate that decays linearly from 10to 5 10over 120 000 training steps. We clip the gradient norm to 80. Each training batch contains 4 different examples, and each example is composed of the four views rendered around a single location. Unless otherwise mentioned, we train our models for 500 000 steps. During inference, our full model runs at 11.3 Hz when aggregating 4 input views, compared to the baseline that runs at 12.6 Hz, so our method comes with little computational overhead.

3) Metrics: As our method operates on 2D views extracted from the mesh we are correcting, we measure how well our network predicts inverse-depth images, with the idea that better inverse-depth images result in better reconstructions. We employ several metrics common in the related tasks of depth prediction and refinement.

One way to quantify performance is to see how often the error in prediction is small enough to be correct. The thresholded accuracy measure is essentially the expectation that a given pixel is within a fraction thr of the label:

where dhq is the reference inverse-depth, dis the predicted inverse depth, V is the set of valid pixels, n is the cardinality of V, and represents the indicator function. For granularity, we use thr 252,1.253}.

In addition, we also compute the mean absolute error (MAE) and root mean square error (RMSE) metrics to quantify per pixel error:

TABLE II DEPTH ERROR CORRECTION RESULTS

TABLE III GENERALISATION CAPABILITY OF DEPTH ERROR CORRECTION

where the ‘i’ indicates that the metrics are computed over inverse-depth images.

B. Gross Error Correction

For the first set of experiments, we take the first 80% of the views from each sequence as training data, the next 10% for validation, and show our results on the last 10%. An illustration of the KITTI sequences and splits is shown in Figure 4.

As baseline, we train our model with geometric consistency loss but without any feature aggregation. During inference, this model makes predictions one view at a time.

To illustrate our method, we train a further two models for each aggregation method (averaging and attention): one with the feature transform disabled, and one with it enabled.

As it can be seen in Table II, the baseline already refines inverse-depth significantly. Without our feature transformation, the models are unable to use multi-view information because of the vastly different viewpoints, and indeed this slightly hurts performance. Only when transforming the features between viewpoints does the performance increase over the baseline, highlighting the importance of our method for successfully aggregating multiple views.

C. Generalisation

To asses the ability of our method to generalise on unseen reconstructions, we divide our training data by sequence: we use two of the sequences for training, and the third for testing. Sequences 00 and 05 are recorded in a suburban area with narrow roads, while sequence 06 is a loop on a divided road with a median strip, a much wider space and visually distinct. We train models for 200 000 steps and aggregate feature maps by averaging. The results in Table III show that our method successfully uses information from multiple views, even in areas of a city different from the ones it was trained on. Furthermore, they reaffirm the need for our feature transformation method in addition to warping.

V. CONCLUSION AND FUTURE WORK

In conclusion, we have presented a new method for correcting dense 3D reconstructions via 2D mesh feature renderings. In contrast to previous work, we make predictions on multiple views at the same time by warping and aggregating feature maps inside a CNN. In addition to warping the feature maps, we also transform the features between views and show that this is necessary for using arbitrary viewpoints.

The method presented here aggregates feature maps between every pair of overlapping input views. This scales quadratically with the number of views and thus limits the size of the neighbourhood we can reasonably process. Future work will consider aggregation into a shared 2D spatial representation, such as a 360view, which would scale linearly with the input neighbourhood size.

ACKNOWLEDGMENT

The authors would like to acknowledge the support of the UK’s Engineering and Physical Sciences Research Council (EPSRC) through the Centre for Doctoral Training in Autonomous Intelligent Machines and Systems (AIMS) Programme Grant EP/L015897/1. Paul Newman is supported by EPSRC Programme Grant EP/M019918/1.

REFERENCES

[1] S, tefan S˘aftescu and P. Newman, “Learning geometrically consistent mesh corrections,” arXiv:1909.03471 [cs.CV], 2019.

[2] S. Donne and A. Geiger, “Defusr: Learning non-volumetric depth fusion using successive reprojections,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[3] M. Tanner, P. Pini´es, L. M. Paz, and P. Newman, “BOR2G: Building optimal regularised reconstructions with GPUs (in cubes),” in Proceedings of the International Conference on Field and Service Robotics (FSR), Toronto, Canada, Jun. 2015.

[4] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. W. Fitzgibbon, “KinectFusion: Real-time dense surface mapping and tracking,” in IEEE International Symposium on Mixed and Augmented Reality, 2011, pp. 127–136.

[5] M. Tanner, S, . S˘aftescu, A. Bewley, and P. Newman, “Meshed up: Learnt error correction in 3D reconstructions,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, May 2018.

[6] H. Kwon, Y.-W. Tai, and S. Lin, “Data-driven depth map refinement via multi-scale sparse representation,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 159–167.

[7] Y. Zhang and T. A. Funkhouser, “Deep depth completion of a single RGB-D image,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 175– 185.

[8] J. Jeon and S. Lee, “Reconstruction-based pairwise depth dataset for depth image enhancement using CNN,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[9] B. D. Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool, “Dynamic fil-ter networks,” in Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2016.

[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.

[11] W. Shi, J. Caballero, F. Husz´ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-

resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1874–1883.

[12] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015.

[13] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.

[14] Y. Wu and K. He, “Group normalization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[15] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings of the International Conference on Learning Representations (ICLR), 2014.

[16] OpenStreetMap contributors, “Planet dump retrieved from https:// planet.osm.org,” https://www.openstreetmap.org, 2017.

[17] A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contemporary Mathematics, vol. 443, pp. 59–72, 2007.

[18] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in Proceedings of the IEEE International Conference on 3D Vision (3DV), 2016, pp. 239–248.

[19] F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from sparse depth samples and a single image,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2018.

[20] I. Sobel and G. Feldman, “A 33 isotropic gradient operator for image processing,” presented at the Stanford Artificial Intelligence Project (SAIL), 1968.

[21] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.

[22] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.

designed for accessibility and to further open science