This paper is about detecting and rectifying mess ups in dense reconstructions. Dense reconstruction as a mapping from input images to 3D meshes is a well studied area. However much of that work is feed-forward in the sense that it strives to produce the best mesh in an open loop. In this work we consider how, given an arbitrary generated mesh and the images that were used to create it, one might correct reconstruction errors post-hoc.
Prior efforts towards improving the accuracy of generated 3D meshes often focused on regularising the camera depth maps [1], using expensive high-quality sensors [2], or regularising the dense-reconstruction output [3]. While regularisation based approaches impose structure to smooth local surfaces, gross reconstruction errors still remain. In contrast the focus of this work is to identify error prone areas in 3D reconstructions. The goal is to facilitate the removal or repair of these mesh regions in order to improve overall mapping accuracy. As an example of what we will describe Figure 1 illustrates how the most inaccurate areas in a reconstruction are identified and corrected in a 2D representation of the scene.
To realise this capability, a mapping is learnt from rasterised geometric and appearance features to the error in the camera-based reconstruction. Accurate laser reconstructions serve as our ground-truth to train a CNN that estimates the quality of depth-map reconstructions. A simple example of this process is shown in Figure 2, which displays the same scene reconstructed by laser and camera input data. Compared to the laser reconstruction’s colour and mesh images, the regions in the camera reconstruction that are
Fig. 1. An illustration of using the CNN presented in this paper to correct an inverse-depth image generated from a dense 3D reconstruction. Structural errors in the reconstruction that are not obvious in (a) due to the lack of colours, are evident in (b), especially around the car. An overlay of the error predicted by the CNN’s over (b) is displayed in (c). Note that this predicted error is signed, with positive error shown in red and negative error shown in blue. (d) shows how simply subtracting this error from the inverse depth image results in a better representation of the scene (e.g. the car edges are more defined).
likely incorrect can readily be identified – e.g. holes in the road and the “smearing” area between the cars. Ideally, we would have a similar but automated way of perceiving the mesh outputs to identify erroneous regions. In this paper, we present our method for training a neural network to recognise and highlight these regions.
A high-level overview of the proposed system is presented in Figure 3 that shows the framework for building the meshes used in training a CNN. Two dense reconstructions are used: one constructed using camera-only depth-maps and another constructed using more accurate laser data. Both geometric and appearance features are extracted from each
Fig. 2. Camera depth-map and laser mesh reconstructions generated using BOR2G [4], used for network inputs and ground truth respectively. In the top examples ((a) and (b)) one can easily recognise the “smearing” between the cars and the holes in the road are incorrect. The goal of this work is to use a neural network to automatically recognise these regions in depth-map reconstructions by leveraging training data derived from laser reconstructions (e.g., (c) and (d)).
reconstruction and rasterised to a 2D image suitable for training the CNN (Section II).
In Section III, we describe our network architecture and loss function based on related work in the area of monocular depth estimation [5] and re-purpose it towards the task of correcting for mesh errors. In particular, we investigate the potential and value of providing a diverse set of features (e.g. colour, inverse depth, triangle edge ratios and surface normals) to the network which are not commonly available without a mesh representation. Finally, in Section IV we demonstrate the efficacy of the proposed system in identifying and correcting errors on three meshes created using sequences from the KITTI visual odometry (KITTI-VO) [6] data set. Within this framework, the following contributions are made:
1) The novel use of a CNN applied to rasterised mesh features is shown to identify error-prone regions in a low quality mesh 3D reconstruction;
2) Using the network predictions to correct for error in the inverse-depth images leads to significant improvements over what is extracted from the original mesh;
3) We analyse which of the rasterised mesh features are most informative to the CNN through an ablation study on the trained network.
Much work in the deep learning literature relies upon simple RGB images as the input [8][9][5]. However, as our problem formulation assumes that a dense 3D model is available, we can extract a much richer set of features from the reconstruction. Throughout this work we use the dense mesh reconstruction proposed by Tanner et al. [4] referred to as BOR2G, although our framework is agnostic to the 3D mesh pipeline.
A. Feature Creation
To extract dense 2D features suitable for a CNN, we begin by using GLSL to “fly” a virtual camera through the 3D models. Our vertex shader transforms the 3D model’s vertices into the camera reference frame to compute the depth of each vertex and the normalised camera vector. These camera-frame points are then grouped into triangles and passed along to a geometry shader that computes the triangle’s area, surface normal, and edge lengths. Finally, the fragment shader processes each of the preceding feature values and rasterises them into individual pixels in feature images, as shown in Figure 4. This GLSL process enables us to extract six feature images: two per-pixel features (RGB and depth) and four per-triangle features (area, surface normal, edge length ratio, and surface-to-camera angle). Intuitively, the mesh-triangle-derived features contain a lot of information; they provide geometric cues that have been accumulated from multiple depth-maps and integrated into a 3D reconstruction by BOR2G. We look at whether this intuition is verified in Section IV.
B. Ground-Truth Data
where dc and dl are pixels in the depth-map features for the camera and laser reconstruction, respectively, with A being the scaling constant. In disparity images from stereo cameras,
Fig. 3. Machine Learning Data-Flow Pipeline. To train our network, we create separate depth-map and laser dense reconstructions (using the techniques discussed in [7]). We create ground-truth data by subtracting the rasterised inverse-depth images from each reconstruction. The neural network trains on rasterised feature images (see Figure 4) to learn to generalise the ground-truth error in new scenes.
A = fxb where fx is the camera’s x focal length and b is the baseline distance between the left and right image sensors. Because our approach is agnostic of the input sensor, A can be arbitrary, and we set A = 1 when evaluating performance (Section IV).
A. Network Architecture
For our network architecture, we based our network on Lania et al.’s Fully Convolutional Residual Network [5] with concatenated ReLU activation functions on the intermediate layers [10]. This network was designed to infer per-pixel depth from a single RGB image, a task highly correlated with our aim to compute estimated depth error given a set of feature inputs. In this work, the input layer is generalised to accept F input channels dependant on the number of active features used from the previous section. Table I provides high-level details of the CNN where a series of residual blocks based on the ResNet-50 architecture [11] is followed by up-projection blocks, proposed in [5]. Note the fractional strides denote spatial upsampling. The final layer is a simple 33 convolutional layer which outputs real-valued estimates corresponding to the inverse-depth error at each pixel location.
B. Generalisation Capacity
To be able to deploy a learnt CNN to new meshes created without a high-fidelity laser we need our network generalise to unseen data. To this end, three techniques are employed: cropping, downsampling, and regularisation. Firstly, we randomly perturb and crop all input feature images before providing them to the network. After each epoch of training (i.e. the network has viewed all the training images), the next epoch will receive a slightly different cropped region of each input image. This prevents the network from associating a specific pixel location in the training data with its corresponding output.
Secondly, the feature maps are gradually downsampled via projection blocks (from ResNet-50) creating a bottleneck,
TABLE I OVERVIEW OF THE CNN ARCHITECTURE FOR ERROR PREDICTION
thus reducing representational capacity. This also provides greater context to the convolutional filters in the later layers of the network by increasing the size of their receptive field.
Thirdly, we implement a L2 weight regulariser to further constrain the representational capacity by preventing the network from over-relying on the cost function at the expense of generalisation performance.
C. Loss Function
Several loss functions are widely used in machine learning applications. The L2 norm is traditionally popular because it heavily penalises large errors and is smooth. However, unlike the L1 norm, it has a near zero gradient for small errors, thus is often unable to drive the error completely to zero. Over the years, researchers have proposed alternative norms which combine the “best” (based on application) characteristics of each of these norms. The Huber norm uses L2 near the origin and L1 elsewhere, while BerHu uses the L1 norm near the origin and the L2 elsewhere [12]. We choose BerHu as our cost function on the networks error prediction since the L1 term places a higher penalty on small errors (when compared to Huber or L2) while still providing strong penalties for larger errors. This is in contrast to regularising depth maps where the Huber norm is more favourable to capture high
Fig. 4. Example Input Features. As our dense reconstructions provide a 3D model of our operating environment, we can extract a variety of features. Pictured above are example features our GLSL pipeline currently extracts. The network has the potential to use these low level features in its intermediate representation.
gradient edges around objects [1]. These benefits were also observed in the related task of depth prediction [5].
A. Experimental Setup
Dataset: Three sequences from the KITTI-VO dataset where used as the input to the reconstruction pipeline. The input of the network consists of features from the camera based mesh, while the ground truth used for both training and evaluation comes from the laser based mesh, as described in Section II-B. For all experiments, the KITTI-VO sequences 00, 05 and 06 were used for training and leave-one-out style cross-validation. All three sequences are predominantly in urban environments with small amounts of visible vegetation. Using GLSL, we created a virtual camera to project the dense reconstruction into input feature images. The trajectory of the virtual camera follows the original trajectory in the KITTIVO sequences. This enables the use of real colour images as input alongside the generated features. Both the camera and laser reconstruction meshes were generated with a fixed voxel width of 0.2 m.
Network Training and Inference: The network module was implemented in Python using the TensorFlow framework. The network weights were optimised using the ADAM solver [13] with a learning rate of 10for 250 epochs over two of the sequences. At this point, the value of the loss function appears converged with variance coming form the stochasticity of the mini-batch sampling. To reduce this variance and confirm convergence, the optimisation is run a further 50 epochs with the learning rate reduced to 10
. Other training hyper-parameters include a batch size of 16 and a L2 weight decay factor of 10
per training step. Inference takes, on average, 72 ms per half-size KITTI image and aassociated features.
Performance Metrics: This work uses two different metrics to measure the resulting performance. The first metric is the commonly used root mean square error (RMSE) which provides a quantitative measure of per pixel error, computed as follows:
where dl is the ground truth depth, dis the predicted depth, X is the set of valid pixels, n is the cardinality of X.
Our second metric measures the accuracy of our network’s ability to estimate errors under a given threshold, serving as an indication of how often our estimate is correct. The thresholded accuracy measure from [14] is essentially the expectation that a given pixel in X is less than a threashold thrk:
where represents the indicator function. As in [14], thr = 1.25, and k
.
B. Correcting Depth Maps
The central question of this paper is whether we can correct 3D mesh reconstructions using a CNN. As a proxy, the mesh surface projected as an inverse-depth image and used to represent the local 3D scene. For our baseline reference, we compare the inverse-depth image from the camera-only reconstruction (created using BOR2G) with the equivalent inverse-depth from the laser-only reconstruction, thus ddc. To obtain the refined depth-map, we use the network output
to correct the depth-map reconstruction,
Fig. 5. Analysis of our framework’s sensitivity to disabling different input features on the depth correction accuracy
giving us ddc
. An example of this process is shown in Figure 6.
Quantitatively, we find our CNN based correction provides a 10% relative RMSE improvement over the baseline generated by BOR2G. Furthermore, it is consistently more accurate over different thresholds, indicating better depth values both across more pixels. A qualitative visualisation of the network output is provided in Figure 7, along side the ground truth difference in camera to laser inverse depth maps. This visualisation shows that the network is capable of identifying small errors over broad regions (e.g.the red offset in the road level potentially caused by imperfect calibration) as well as large but narrow errors commonly found at object boundaries.
C. Feature Ablation Study
In the previous experiment, all features were provided to the CNN under the assumption that some features may contain more information than others and that the network can choose to ignore certain features if they contain no information. This will be reflected by measuring the sensitivity in the network output when a specific feature is withheld from the input. It is also possible that the full set of features contains redundant information which would result in the network spreading its internal importance weighting across multiple features. Such behaviour is implicitly encouraged through the L2 weight regularisation placed on the network, meaning that small weights across multiple channels are favoured over large weights on single channels. If the network learns to use redundant features then it should be more robust to the absence of one of the features if the equivalent information could be found elsewhere.
Figure 5 shows the relative feature importance evaluated by disabling one feature at a time and monitoring the influ-ence on the inverse depth accuracy . Crucially, the inverse-depth from the camera reconstruction was considered to be the most important. On the other hand, mesh specific features such as triangle area and edge length ratio (Figure 4 (d)(f)) appear to be least useful. To test if redundancy is explaining this lack of sensitivity, we keep only RGB, inverse-depth
TABLE III FINE-TUNING WITH REDUCED FEATURE INPUTS
and normals, to fine-tune the network (Table III). Here, the CNN failed to recover the lost performance confirming our intuition that mesh features provide additional information not easily derived from simple (stereo) camera features.
In this paper we present a supervised learning technique for training a neural network to detect and correct error-prone regions of a dense reconstruction. Using a GLSL pipeline, we placed a virtual camera at various positions and orientations throughout the reconstruction to generate both input feature data and, by comparing the laser and camera reconstruction features, ground-truth data to train the network. We demonstrated that the error predictions of our model can be used to make corrections to a mesh projected into an inverse-depth image and provide quantitative evaluation.
The authors would like to acknowledge the support of the UK’s Engineering and Physical Sciences Research Council (EPSRC) through the Centre for Doctoral Training in Autonomous Intelligent Machines and Systems (AIMS) Programme Grant EP/L015897/1, Programme Grant EP/M019918/1, and the Doctoral Training Award (DTA). Additionally, the donation from Nvidia of the Titan Xp GPU used in this work is also gratefully acknowledged.
[1] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in real-time,” in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp. 2320–2327.
[2] Y. Bok, D.-G. Choi, and I. S. Kweon, “Sensor fusion of cameras and a laser for city-scale 3d reconstruction,” Sensors, vol. 14, no. 11, pp. 20 882–20 909, 2014.
[3] T. Wright and B. Lennox, “Algorithmic approach to planar void detection and validation in point clouds,” in Conference Towards Autonomous Robotic Systems. Springer, 2017, pp. 526–539.
[4] M. Tanner, P. Pini´es, L. M. Paz, and P. Newman, “Borˆ 2g: Building optimal regularised reconstructions with gpus (in cubes),” in Field and Service Robotics. Springer, 2016, pp. 111–124.
[5] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in International Conference on 3D Vision. IEEE, 2016, pp. 239–248.
[6] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[7] M. Tanner, P. Pini´es, L. M. Paz, and P. Newman, “DENSER Cities: A System for Dense Efficient Reconstructions of Cities,” ArXiv, 2016.
Fig. 6. An example demonstrating how our network corrects for reconstruction error on a test set image. (a), (b): A scene where gross errors in the fence on the left and car on the right are captured within the output predictions of the network (c). Here the colour intensity shows the magnitude of the errors with red indicating positive (camera depth is too close) and blue for negative (camera depth is too far) depth errors. Trivially applying the output of out network as a correction to the low quality inverse depth map (b) produces a higher quality inverse depth estimate (d), capturing structure missed in the original camera based reconstruction.
Fig. 7. Example Ground-Truth and Network Prediction. Given the scene present in (a), our network successfully recognises (c) the most error-prone regions of the reconstruction – e.g. unlikely borders around cars, fences, and buildings. The last row shows an instance of a failure where the grass on the left of the road appears to be consistent from the camera view point and the car ahead is dynamic and therefore significantly challenges the static world assumption in the reconstruction process.
[8] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in Neural Information Processing Systems, 2014, pp. 2366–2374.
[9] F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[10] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units,” in International Conference on Machine Learning, 2016, pp. 2217–2225.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Learning for Image Recogni-
tion,” in Computer Vision and Pattern Recognition, 2016.
[12] A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contemporary Mathematics, vol. 443, pp. 59–72, 2007.
[13] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
[14] F. Liu, C. Shen, and G. Lin, “Deep Convolutional Neural Fields for Depth Estimation from a Single Image,” in Computer Vision and Pattern Recognition (CVPR), 2015.