In Simultaneous Localization and Mapping (SLAM) system, it is important to acquire the position and orientation of the camera called visual odometry (VO). Generally, VO in SLAM systems has been studied through feature-based method [1–3] and direct-based method [4]. These methods focus on camera trajectory optimization while not on frame-to-frame (F2F) estimation.
Our approach aims to estimate F2F VO without optimization methods, such as loop closing [1] and bundle adjustment [5]. Geiger et al. [6] suggested the feature-based F2F VO which generates consistent 3D point cloud through feature matching between frames. Ciarfuglia et al. [7] presented the correlation between optical flow and camera ego-motion in a non-geometric method by adopting support vector machine.
Recently, deep networks showed remarkable improvement of the computer vision technology [10, 11]. One of
Fig. 1: RGB images [8] and disparity maps [9] (images are on temporal order from top to bottom).
the first convolutional neural network (CNN) based methods were proposed by Konda et al. [12, 13], who showed the feasibility of learning F2F visual odometry. Moreover, Costante et al. [16] (P-CNN) and Muller et al. [17] (Flowdometry) predicted F2F camera ego-motion using pre-built optical flow images [14, 15]. They insisted that using optical flow images as training is adequate than RGB domain because it contains displacement information. However, these methods still remain a scale problem. SFMLearner [18] and UndeepVO [19] predict depth information and camera ego-motion through unsupervised learning. Considering the potentials of deep learning-based methods, we design an end-to-end deep convolution neural network to estimate F2F monocular visual odometry.
The contributions of our paper are summarized as follows. First, our visual odometry network is implemented using only the disparity map (right column of Fig. 1). Since the disparity map has spatial clues in each frame, we can effectively address the scale problem and obtain better performance than the current existing optical flow-based networks such as P-CNN [16] and Flowdometry [17]. In addition, our network was designed to extract both attention and feature map through two parallel blocks. The attention block enables to focus on sensitive regions of the image. Second, a skip-ordering scheme is introduced to learn larger displacements of frames by training additional image pairs. The contributions of our method do not only improve the performance but also enable the camera ego-motion to have robustness regardless of diverse driving environment.
Fig. 2: Frame to frame odometry model which is based on disparity maps. Our network has four blocks of ‘Frame feature’, ‘Attention’, ‘Translation’ and ‘Rotation’.
In this section, we describe a deep learning network that predicts F2F camera ego-motion, as illustrated in Fig. 2. The proposed network focuses on improving odometry estimation accuracy based on disparity maps with skip-ordering (SO) strategy. In this study, monocular depth estimation based on [9] is employed to generate disparity maps. The network consists of four blocks: frame feature, attention, translation, and rotation. The details of the proposed network are explained in the following subsections.
2.1. Network architecture
The front part of the network is designed to extract frame feature and attention maps through two parallel blocks, as illustrated in Fig. 2. A frame feature map contains general information about a camera ego-motion prediction. An attention map reflects the sensitivity of the camera ego-motion estimation in the frame feature map. Since intensity value in the disparity map is highly dependent on the camera motion displacement, the attention map is learned by utilizing its property. For instance, camera ego-motion is difficult to be predicted from objects in a long distance such as buildings, trees, and sky in the scene. After each extraction, the attention map and frame feature map are pixel-wisely multiplied per frame, and then two multiplied maps are concatenated.
Translation and rotation blocks use the concatenated map as their input. Each block learns rotation and translation information in parallel to be robust on feature learning. The rotation block has two additional layers than the translation block because the former has higher nonlinearity than the latter [20]. Moreover, we normalize losses to balance rotation and translation errors, which is described in detail in the next section.
All layers use the ReLU as an activation function except for the attention layer. Each attention layer is followed by the sigmoid activation function to yield the value between 0 and 1 for the attention map.
2.2. Training and testing
The KITTI dataset [8] is used in train (sequence number 00 to 07) and test (08 to 10) phases [16,17]. The ground truth of the KITTI has 12 values per an image, which of 9 are the rotation matrix and 3 are the position. These are the information about the first frame of the sequences in world coordinate system. Therefore, it is necessary to change the related information between frames, to train the network. Eq. 1 below explains how the information about rotation matrix R, position P and translation T is obtained, which are expressed as:
where and
of the j-th frame with respect to the i-th frame.
A rotation matrix is generally represented by Euler angle, quaternion or Lie group (3). However, we empirically found that the quaternion vectors have the limitations in learning the rotation matrix. When the amount of rotation is small, one of four elements in the quaternion has abnormally higher value than the others, which can be subjected as a biased result. In Lie group (3), one-to-one mapping between rotation matrix and Lie group (3) cannot be transformed with each other when no rotation occurs. Therefore, Euler angle is suitable for F2F camera ego-motion estimation in our network.
2.2.1. Training
In training phase, three consecutive disparity maps are used as an input (), where d represents the disparity map, and t denotes the time order. As described above, skip-ordering scheme is introduced to learn larger displacements of frames by training additional image pairs. Thus, we use (
,
) as skip-ordering (SO) and each elements in (
), (
), (
) are paired with others. Through SO strategy, the network becomes robust against various motion changes.
Our network loss consists of the weighted sums of rotation and translation parts, which can be expressed as:
where and
are rotation and translation losses, respectively. The weighted factor
is known to have a large value to normalize the losses of rotation and translation, because the former has higher nonlinearity than the latter [20]. We experimentally set
. The mean squared error L is described in Eq. 3.
in which N is the number of training samples.
The loss L is minimized by the Adam optimization [21]. The learning rate starts from 15, then reduces by a factor of 2 for every 5 epochs until 30 epochs.
2.2.2. Testing
The network uses two consecutive disparity maps as an input and yields the relation of translation and Euler angle between the frames.
To obtain the rotation matrix and position vector from 1st to j-th frame, following equations are required:
The rotation matrix of the first frame () is set to 3x3 identity matrix and located at the origin of world coordinates, (0, 0, 0). The obtained results are evaluated using the provided code by KITTI [8].
The proposed method has been compared with the handcraftbased method VISO2-M [6], and three learning-based methods of SVR-VO [7], P-CNN [16], and Flowdometry [17].
All methods were evaluated using KITTI benchmark metrics. The VISO2-M and SVR-VO were implemented by the provided code in [6, 16], receptively. Each result of P-CNN and Flowdometry is reported in [16, 17]. Since VISO2-M does not solve the scale problem, we recovered the scale through the range of position in ground truth. Table. 1 shows the performance of the compared algorithms.
To evaluate the structure of our network, RGB-VO and D-VO have been additionally experimented. Each network was trained end-to-end by monocular RGB and disparity map using SO without attention layers. As presented in Table. 2, RGB-VO shows higher average translation performance than compared to VISO2-M and SVR-VO. However, it is worse than other deep network approaches. D-VO, which simply replaced the domain with a disparity map, has better average translation performance than RGB-VO. Translation error is reduced from 13.83% to 12.44% indicating that using disparity map is effective on translation accuracy. However, since a significant improvement of rotation is not found, merely using disparity map is not adequate.
For evaluating the value of an attention block, AD-VO with SO was additionally evaluated. From the comparison of D-VO and AD-VO with SO, translation error was reduced from 12.44% to 8.59% and the rotation error was reduced from 0.0474% to 0.0334%. These results prove that the usage of the attention block is effective. Accordingly, AD-VO with SO shows the best average error on translation and performance of rotation accuracy in sequence 10. The further experiment on D-VO was conducted without SO to determine the effect of SO. Comparing D-VO without and with SO, the translation error was reduced from 16.02% to 12.44%, with the rotation error from 0.0562% to 0.0474%. From the comparison of the result, it can be seen that using attention layer and SO improve the performance.
Attention layer and SO does not only improve the performance but also stabilize the result. We have analyzed the standard deviation of each algorithm and network between the test sequences. The standard deviation of AD-VO with SO was measured as the smallest among the algorithms with 0.868 and 0.001 in translation and rotation. Unlike the other networks, our method have small variation in translation error between test sequences and stable results, regardless of the environment. Through an analysis of the attention block, it determines which regions should be considered. Fig. 3 shows the result of reconstructed trajectories in test sequences.
Fig. 3: Reconstructed trajectories of test sequences(08, 09, 10). Our algorithm is AD-VO with skip-ordering.
Table 1: The performance of the comparison algorithms. Translation and rotation error of test sequences using KITTI devkit.
Table 2: The performance of our algorithms. Translation and rotation error of test sequences using KITTI devkit. AD-VO works reliably in a variety of driving environment. ‘D’ means that we use disparity map and ‘A’ means that model adapt attention block.
In this paper, we presented a novel system to obtain F2F camera ego-motion from monocular images. We studied four different algorithms and compared the performance using evaluation metrics provided by KITTI benchmark. Our system is designed not only to recover the scale problem but also to achieve stable results in various environment. An Attention block allows where the network should be trained to focus on influential regions of the image. Moreover, we suggested a skip-ordering scheme to train the larger displacement of camera ego-motion. To the best of our knowledge, DA-VO is the first F2F camera ego-motion network using only disparity maps. In the future, we aim to enhance the rotation performance by combining optical flow and disparity map.
[1] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
[2] Georg Klein and David Murray, “Parallel tracking and mapping for small ar workspaces,” in Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on. IEEE, 2007, pp. 225–234.
[3] Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse, “Monoslam: Real-time single camera slam,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.
[4] Jakob Engel, Thomas Sch¨ops, and Daniel Cremers, “Lsd-slam: Large-scale direct monocular slam,” in European Conference on Computer Vision. Springer, 2014, pp. 834–849.
[5] Chris Engels, Henrik Stew´enius, and David Nist´er, “Bundle adjustment rules,” Photogrammetric computer vision, vol. 2, no. 2006, 2006.
[6] Andreas Geiger, Julius Ziegler, and Christoph Stiller, “Stereoscan: Dense 3d reconstruction in real-time,” in Intelligent Vehicles Symposium (IV), 2011 IEEE. Ieee, 2011, pp. 963–968.
[7] Thomas A Ciarfuglia, Gabriele Costante, Paolo Valigi, and Elisa Ricci, “Evaluation of non-geometric methods for visual odometry,” Robotics and Autonomous Systems, vol. 62, no. 12, pp. 1717–1730, 2014.
[8] Andreas Geiger, Philip Lenz, and Raquel Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3354–3361.
[9] Cl´ement Godard, Oisin Mac Aodha, and Gabriel J Bros- tow, “Unsupervised monocular depth estimation with left-right consistency,” in CVPR, 2017, vol. 2, p. 7.
[10] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
[11] Kyungjae Lee, Junhyeop Lee, Joosung Lee, Sangwon Hwang, and Sangyoun Lee, “Brightness-based convolutional neural network for thermal image enhancement,” IEEE Access, vol. 5, pp. 26867–26879, 2017.
[12] Kishore Reddy Konda and Roland Memisevic, “Learn- ing visual odometry with a convolutional network.,” in VISAPP (1), 2015, pp. 486–490.
[13] Kishore Konda and Roland Memisevic, “Unsupervised learning of depth and motion,” arXiv preprint arXiv:1312.3429, 2013.
[14] Thomas Brox, Andr´es Bruhn, Nils Papenberg, and Joachim Weickert, “High accuracy optical flow estimation based on a theory for warping,” in European conference on computer vision. Springer, 2004, pp. 25–36.
[15] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox, “Flownet: Learning optical flow with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2758–2766.
[16] Gabriele Costante, Michele Mancini, Paolo Valigi, and Thomas A Ciarfuglia, “Exploring representation learning with cnns for frame-to-frame ego-motion estimation,” IEEE robotics and automation letters, vol. 1, no. 1, pp. 18–25, 2016.
[17] Peter Muller and Andreas Savakis, “Flowdometry: An optical flow and deep learning based approach to visual odometry,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017, pp. 624– 631.
[18] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe, “Unsupervised learning of depth and ego-motion from video,” in CVPR, 2017, vol. 2, p. 7.
[19] Ruihao Li, Sen Wang, Zhiqiang Long, and Dong- bing Gu, “Undeepvo: Monocular visual odometry through unsupervised deep learning,” arXiv preprint arXiv:1709.06841, 2017.
[20] Alex Kendall, Matthew Grimes, and Roberto Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE, 2015, pp. 2938–2946.
[21] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.