Acquiring information about the road lane structure is a crucial step for autonomous navigation. To this end, several approaches tackle this task from different perspectives such as lane marking detection or semantic lane segmentation. However, to the best of our knowledge, there is yet no purely vision based end-to-end solution to answer the precise question: How to estimate the relative number or "ID" of the current driven lane within a multi-lane road or a highway?
In this work, we propose a real-time, vision-only (i.e. monocular camera) solution to the problem based on a dual left-right convention. We interpret this task as a classification problem by limiting the maximum number of lane candidates to eight. Our approach is designed to meet low-complexity spec-ifications and limited runtime requirements. It harnesses the temporal dimension inherent to the input sequences to improve upon high-complexity state-of-the-art models. We achieve more than 95% accuracy on a challenging test set with extreme conditions and different routes.
As modern autonomous driving systems are progressively targeting the general consumer market, they have become increasingly reliant on the analysis and processing of visual information provided by mounted cameras, which represent for many applications a suitable alternative to expensive sensors such as high accuracy GPS or LIDARs. This was mostly enabled by the remarkable advancement of visual-based approaches in robustness and accuracy, which obviously benefited from the recent progress in matters of artificial intelligence and machine learning.
More specifically, self-driving cars operating in level three or higher require an accurate representation and understanding of the surrounding environment as a pre-requisite for efficient decision-making and safe drive control. For this, many tasks prove necessary including pixel-level semantic segmentation, object recognition & detection, mapping & localization and safe path planing, among others. For the latter two, an accurate lane-level knowledge about the position of the car in a multi-lane road or a highway could play a central role in improving the accuracy of the outcome of these applications. This revolves around the capability of the car to localize itself on lane-level and determine on which lane it is currently driving relative to a fixed reference such as the left or right borders of the road. The resulting knowledge represents valuable features which can be stored and added to the map. Similarly, it can contribute to improving the localization step and the planing for the safest path to take.
Knowing exactly the current lane of the vehicle on a multi-lane road could be necessary for many applications. This procedure is known as ’Lane ID estimation’ not to be confused with the task of lane detection, segmentation or estimation that is widely popular in the literature and defined as the task of semantically distinguishing the drivable lane under the form of a distinguished pixel-level labeled entity. A notable approach for accurately estimating the ID of the current driven lane is the LaneQuest [Aly et al., 2015] approach. It is a technique that relies on ubiquitous inertial sensors available in commodity smart-phones to provide an accurate estimate of the car’s current lane without any visual input. In [Knoop et al., 2017], a GPS-based method known as GPS-PPP is introduced. It allows for sub-meter real-time accurate positioning of vehicles on multi-lane motorways. [Dao et al., 2007] suggests a Markov-based alternative for lane-level localization that leverages connectivity between neighboring cars taking part in the traffic within a certain range and exchanging information to precisely locate each other. There exist, however, certain approaches which harness visual cues to perform the lane-level positioning for autonomous vehicles. [Cui et al., 2015] Proposes an accurate real-time positioning method for robotic cars in urban environments. This method uses a robust lane marking detection algorithm, as well as an effi-cient shape registration algorithm between the detected lane markings and a GPS-based road shape prior, to improve the robustness and accuracy of the global localization. In [Nede- vschi et al., 2004], a 3D lane detection method is introduced where the availability of 3D information allows the separation between the road and the obstacle features. Consequently, The lane can be modeled as a 3D surface and the prediction of its current parameters is performed using past information
Figure 1: Dual left-right convention for lane ID estimation
and vehicle dynamics.
3.1 Dual Left-Right Convention
In the context of autonomous driving, a reliable perception and interpretation of traffic rules in general and roadway lanes in particular is necessary for the vehicles to ensure safe traveling. It is of high importance that autonomous cars are reliably aware of the road structure to be able to make proper driving decisions. For this work, we define the problem of lane ID estimation as the task of determining the relative number or identifier ID of the current lane driven based only on the content of a scene captured by a monocular camera. Unlike the standard lane estimation problematic generally defined as the task of figuring out the semantic, topological and geometric properties of the driven road, our present work comes as the answer to a very specific question concerning the number (ID) of the currently driven lane out of many surrounding the vehicle. This ID is defined with respect to a pre-fixed reference point, could be the left or the right border of the road. Two conventions can be used to define the corresponding lane ID, where left ID is calculated with respect to the left road side and with respect to the right one. These two conventions are related as follows:
with being the lane count (total number of lanes in the image), the driven lane ID using the left convention and the right one. An example of our double convention definition could be seen in Fig. 1.
The motivation behind using this double scheme is to enforce more constraints on the estimated output of the network during the training phase. In fact, the model is supposed to deliver three classification vectors for left and right estimated lane IDs together with lane count. The enforced redundancy by using two conventions improves the likelihood that both or at least one of the ID estimates is correct. We approach the task from two different point of views, namely left and right sides of the road and get profit from this cooperation using always the most reliable side that offers more visual information and better features. This could be the case for different challenging situations such as strongly occluded vision or cluttered scenes. We also assume that estimating the relative lane ID with respect to the closest road side is more reliable since features and landmarks on the border are richer
Figure 2: The "Moka-convLSTM" architecture used for relative lane ID estimation
and easier to learn for the model. Hence, the more visual information the camera captures from the road side, the more accurate and reliable the network estimation is likely to be.
3.2 Model Architecture
As the backbone of the present solution, we propose an inhouse designed architecture for the neural network used to perform relative lane ID estimation inspired by the model suggested in [Halfaoui et al., 2016]. Our new architecture called "Moka-convLSTM" shown in Fig. 2 is composed of an encoder part which gradually extracts high level features by down-sampling the image/feature maps, together with a decoder part which enables the recovery of the full resolution. Long-range dense links are used to connect both parts which improves the quality of the feature maps during both contractive and refinement stages. In fact, multiple feature maps representing different high level abstractions from previous stages are combined at each resolution level to obtain wide range of stacked features that are passed accordingly to the next layers using the same strategy. At the end, three independent full-connected blocks are linked to the final layer of the decoder in order to generate the required classification vectors corresponding to the left, right and lane count estimates. As a further extension, our network is equipped with ConvLSTM cells [Xingjian et al., 2015] to guarantee recurrence necessary to process sequences. These cells are interposed at each level in order to enable the capturing of the temporal dimension. Obviously, each one is constructed out of a memory cell accumulating state information over the given input sequence, a forget gate responsible for deciding upon how much information from the past cells should be retained and input and output gates and respectively responsible for receiving and emitting information. The main advantage of convLSTM is the ability of holding information on previous data introduced to the network and using it to make decisions about the current input or forecasting future states. In other words, they help the model remember and take into consideration distant occurrences from the past into the final output.
3.3 Training Setup
The proposed model was trained end-to-end to perform relative lane ID estimation on a set consisting of 244 sequence of 2500 image each (600k images). The data that cannot be disclosed due to legal and security restrictions was ex-
clusively recorded in the Chinese city of Shanghai at different dates, seasons, weather conditions, day times and routes over a span of time from June 2018 until February 2019. For the labeling, a semi-automatic strategy was used to generate proper lane count and lane ID labels for each image following our previously detailed double convention scheme based on a high-accuracy GPS device and a pre-stored high-definition map of the city. For training, we employ a custom version of Pytorch [pyt, 2019 accessed July 7 2019] and make use of the Adam optimizer [Kingma and Ba, 2014]. We train the network up to 300K iterations with batches of size 2 containing sequences of consecutive frames of length 4 and a learning rate that is divided by 2 every 20k iterations starting from iteration 150k. We set momentum values to and with weight-decay . Input images were introduced in sequential form under the resolution and underwent heavy random augmentations performed on the fly... Our proposed network performs a classification task to estimate the lane count, left and right IDs out of a total number of N = 8 classes . The proposed cost function for training the network is composed of three terms summarized as follows: 1. : A standard cross-entropy loss term for performing multi-class classification for each output defined as:
with x is the estimated probability classification vector and y the corresponding ground-truth label written in one-hot encoding form as well, valid for all three output entities and their corresponding ground-truth labels. 2. : An adaptive penalty term that gives more weight to the smallest ID between the right and left estimates, which comes in line with our previous assumption that the nearest road side would be more suitable to use as reference point to define the lane ID.
with z the predicted scalar value for left or the right estimates.
3. : A triangular constraint enforcing the linearity dependence between the three predicted scalar outputs: right , left and lane count estimated by the network and expected to fulfill the mathematical relation:
To sum up, the proposed final training cost function is de-fined as:
right (IDs) and lane count respectively. are the estimated probability classification vectors by the network, are the corresponding ground-truth labels in one-hot encoding form and are the final scalar estimates for IDs and lane count.
4.1 Brightness Adjustment For Recurrent Models
We propose a pre-processing step to adjust the brightness level of input images. Obviously, images of a single input sequence need to depict reasonable brightness levels without strong fluctuations where features are easily extractable and learnable. Therefore, adjusting the brightness domain in which the recurrent network operates proves necessary as the training set presents a rich panoply of weather conditions, day times and seasons, which in turn causes considerable illumination variations among the training images. Particularly, we notice that any significant abrupt change in terms of brightness for images of the same input sequence is tightly correlated with a decrease in our lane ID estimation accuracy, as past information tracked through convLSTM cells contributes significantly to shaping the decision of the network. So if the model is presented with a sequence with considerable brightness fluctuations, the convLSTM cells will not get profit from tracking past information over time, but would rather suffer from it as current and past internal states present meaningful differences.
As a measure for enforcing consistency, we first keep track of the running average perceived brightness from the input images belonging to the same sequence. Then, we process each coming image in the deployment phase before introducing it to the model according to how its depicted brightness compares to the average tracked value. Specifically, if the perceived brightness of this current frame is below the tracked value then we adjust its brightness via a linear transformation of pixel intensities inspired by gamma correction. As an example, let’s assume we have a car driving on a normal highway in clear conditions then a tunnel emerges. If image I, the currently processed image, is the first frame captured inside the tunnel, then it would depict totally different brightness properties compared to past images, as the lighting conditions had changed abruptly. Let’s assume that I could be divided into three color channels red , blue and green with its perceived brightness and the running average tracked brightness value over all images captured outside the tunnel. We can first calculate the adjusting factor as:
Then, we use this factor to generate new color channels for the image I by combining three new channel matrices and produced as follows with pixel location defined by the tuple (x, y):
After reconstructing the brightness adjusted image out of the newly generated color channels, we introduce it finally to the pre-trained model to perform relative lane ID estimation.
Figure 3: The decision module for final adopted convention
4.2 Decision Module For Final Used Convention
In the deployment phase, our model delivers simultaneously two ID estimates and according to left and right conventions for each input image together with the lane count. Hence, an additional decision step should be considered to pick up a single final output out of these two candidates. Even if both estimates are correct, only a single ID should be considered otherwise a confusion might arise for the user. That is why, specifying the final estimated lane ID must be always done in reference to the adopted convention. Many possible ways could be harnessed to make the decision. However, we opt for a specific technique we define as the entropy-based decision (we borrow the "entropy" concept from physics to describe how dispersed is the distribution of a certain set of elements). This decision process could be summarized as follows:
After getting the output classification vectors delivered by the pre-trained model, we consider each one separately: We first select the maximum probability value out of the N elements of each one. Then, we calculate the mean value out of all elements of the vector. Finally, we subtract the resulting mean out of the pre-selected maximum and we end up with values describing roughly the distribution of elements in each vector. These are considered to make the decision about the final output lane ID based on their comparison and the one with the higher value should be adopted as the final estimate.
Furthermore, a penalizing term against fluctuations between consecutive estimates for adjacent input images is additionally introduced. The idea is to enforce a smooth estimation between consecutive IDs by avoiding unreasonable jumps in the predicted lanes between consecutive frames. Obviously, a vehicle cannot abruptly move from the first to the seventh lane, for example, within the very short span of time separating the capturing of the two images. Therefore before applying the comparison between the previously discussed entropy values, we weight them accordingly with a corresponding factor P specific to each convention, which can be calculated as follows:
with is the final output at the current time t and is the output of the previous frame using the same convention.
For evaluation, we run a battery of experiments to assess the proposed approach. As the nature of the targeted application is quite specific, the performed literature research didn’t reveal other methods aiming to solve the precise task of visual end-to-end lane ID estimation. Generally, the community is interested in a larger scope including applications such as lane segmentation [Oliveira et al., 2016; Jang et al., 2018; Meyer et al., 2018], estimation [Kim and Park, 2017; Gurghian et al., 2016; Rabe et al., 2016], prediction [Tang et al., 2018; Son et al., 2015] or detection [Li et al., 2016; Lee et al., 2017; Niu et al., 2016; Jung et al., 2015]. Although these methods could be leveraged to solve the task at hand, this would however require additional intermediate steps and further processing to estimate the current lane ID. Since we are interested in an efficient solution that is aimed to be a building block of a large processing chain for high-complexity applications such as mapping, localization and path planing, a complex and sophisticated approach that is expensive in time and resources would be unfit for these purposes. That is why, we avoid considering complex methods and focus rather on the important aspects that can make the expected solution suitable to the requirements of the usecase (efficiency, realtime usage and accuracy). We analyse the performance of some popular custom architectures for lane ID estimation from monocular images. These include Alexnet [Krizhevsky, 2014], VGG [Si- monyan and Zisserman, 2014], Resnet [He et al., 2016] and Densenet [Huang et al., 2017]... For the sake of consistency, all models were trained using the previously detailed setup and tested on a new set including 163 unfamiliar sequences with 2500 image each (400k).
A numerical comparison between the architectures in terms of performance, required runtime for processing a single image on an NVIDIA Geforce GTX TITAN X, memory size and parameters number is shown in table 1. Results indicate that our dual convention scheme offers meaningful improvement against considering left and right accuracies seperately. This is valid for all models and their corresponding raw accuracy values (the upper bound case where we use always the best out of the two conventions, i.e. at least one ID is correct), as well as for the final accuracy which is calculated after applying a decision module to choose a single output out of both candidates (might not be always correct choice). Obviously, reaching high accuracy levels comes at cost of the complexity aspect. Hence, it is then expected that both models with the highest complexity levels, namely Vgg16 and Vgg19, are the best performing in terms of raw as well as final accuracy. Our basic Moka model (without recurrence) designed to fulfill the tight time and memory requirements of our buisness usecase performs quite decently considering its limited number of parameters with 92.75 % raw accuracy and 86.23% final one. The question that arised here was how to improve the performance of our basic Moka while keeping its complexity low? To answer this, we decided to rely on the temporal dimension of input image sequences instead of stacking additional layers atop of the current architecture. Two alternatives were considered to improve the performance without loosing too much on complexity. First, we extended the fully connected blocks after the decoder part with standard LSTM cells to track past estimates, where inputs are the one-dimensional classification vectorst. This version we named Moka-stdLSTM. Second, we proposed the Moka-convLSTM version with interposed
Table 1: Numerical performance analysis for custom architectures Alexnet, VGG, Resnet, Densenet, ResNext, Shufflenet, Mobilenet and different versions of our Moka model (basic, convLSTM, standard LSTM) on test set
Table 2: Performance analyis using different brightness thresholds for pre-processing before applying the various models on test set
Figure 4: Final accuracies using different pre-processing setups for brightness adjustment
Table 3: Performance comparison (final accuracy) using different decision criteria for the choice between left and right conventions. Consid- ering left and right classification vectors, we compare Max: Maximum values, Max-M: Maximums - means of the corresponding vectors, E: Entropy values, Max-E: Maximums - entropys, Z-score: Statistical standard scores.
Figure 5: Visual result samples using best performing model Moka-convLSTM-B130 (LtoR: left convention, RtoL: right convention)
memory cells at each convolutional layer to keep track of the slightest details aggregated from previous runs. Unlike the standard case, the incoming cell inputs are two-dimensional (2D) convolved feature maps. Both modified versions improved significantly upon the original without sacrifying the important properties of the basic Moka model. Although slightly ouperformed, they reached the accuracy range of the best networks with signficant advantage in terms of required runtime and complexity as shown in table 1.
As a further improvement measure, we proposed, as previously discussed, the additional pre-processing step to enforce brightness level consistency. A thourough examination of different brightness thresholds was performed during experiments but for the sake of simplicity we only show the results with few chosen values in table 2. Obviously, this measure has way more effect on recurrent models considering the temporal dimension than custom networks trained on single images as depicted in table 2 and Fig 4. Despite the additional time overhead needed for each incoming image, the low-complexity recurrent models still fulfill the required runtime constraints and are now clearly ahead of the VGG models in terms of accuracy with MOKA-convLSTM-B130 as new best performer achieving 98.41% and 95.36% for raw and final accuracy, respectively.
Obviously, the difference between raw and final accuracies for all models is a sign that the decision step we use to choose between left and right conventions can be improved. There is still a potential gap to close between the final reached accuracy and the best performance that we can achieve if we fully get profit from the model estimation for both conventions. As a final test, we consider several decision criteria used in combination with all models to thouroughly explore the impact of this step on final accuracies. The goal is to find the best decision strategy to minimize this gap.
The comparison shown in table 3 shows that using different criteria has different impact on performance when used in combination with varying models. However, we observe that the recurrent model MOKA-convLSTM-B130 is still best performer in three cases out of five and more imporantly the best overall performer despite a sustained difference between raw accuracy 98.71% and the new final accuracy value 95.47 %.
In the context of autonomous driving, there is a growing consensus that the more boundaries we push towards full autonomy, the more specific and challenging the technical issues to be addressed become. In this work, we provide a solution to the particular task of lane ID estimation that can be beneficial to several applications such as mapping and localization, path planing and safe path estimation
Our novel end-to-end low-complexity solution relies only on monocular visual information and harnesses the temporal dimension inherent to the input sequences to yield precise lane ID estimation for real-time requirements using a convLSTMbased network.
[Aly et al., 2015] Heba Aly, Anas Basalamah, and Moustafa Youssef. Lanequest: An accurate and energy-efficient lane detection system. In 2015 IEEE International Conference on Pervasive Computing and Communications (PerCom), pages 163–171. IEEE, 2015.
[Cui et al., 2015] Dixiao Cui, Jianru Xue, and Nanning Zheng. Real-time global localization of robotic cars in lane level via lane marking detection and shape registration. IEEE Transactions on Intelligent Transportation Systems, 17(4):1039–1050, 2015.
[Dao et al., 2007] Thanh-Son Dao, Keith Yu Kit Leung, Christopher Michael Clark, and Jan Paul Huissoon. Markov-based lane positioning using intervehicle communication. IEEE Transactions on Intelligent Transportation Systems, 8(4):641–650, 2007.
[Gurghian et al., 2016] Alexandru Gurghian, Tejaswi Koduri, Smita V Bailur, Kyle J Carey, and Vidya N Murali. Deeplanes: End-to-end lane position estimation using deep neural networksa. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 38–45, 2016.
[Halfaoui et al., 2016] Ibrahim Halfaoui, Fahd Bouzaraa, and Onay Urfalioglu. Cnn-based initial background estimation. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 101–106. IEEE, 2016.
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
[Huang et al., 2017] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
[Jang et al., 2018] Wonje Jang, Jhonghyun An, Sangyun Lee, Minho Cho, Myungki Sun, and Euntai Kim. Road lane semantic segmentation for high definition map. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 1001–1006. IEEE, 2018.
[Jung et al., 2015] Soonhong Jung, Junsic Youn, and Sanghoon Sull. Efficient lane detection based on spatiotemporal images. IEEE Transactions on Intelligent Transportation Systems, 17(1):289–295, 2015.
[Kim and Park, 2017] Jiman Kim and Chanjong Park. End- to-end ego lane estimation based on sequential transfer learning for self-driving cars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 30–38, 2017.
[Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[Knoop et al., 2017] Victor L Knoop, Peter F de Bakker, Christian CJM Tiberius, and Bart van Arem. Lane determination with gps precise point positioning. IEEE Transactions on Intelligent Transportation Systems, 18(9):2503–2513, 2017.
[Krizhevsky, 2014] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks, 2014.
[Lee et al., 2017] Seokju Lee, Junsik Kim, Jae Shin Yoon, Seunghak Shin, Oleksandr Bailo, Namil Kim, TaeHee Lee, Hyun Seok Hong, Seung-Hoon Han, and In So Kweon. Vpgnet: Vanishing point guided network for lane and road marking detection and recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 1947–1955, 2017.
[Li et al., 2016] Jun Li, Xue Mei, Danil Prokhorov, and Dacheng Tao. Deep neural network for structural prediction and lane detection in traffic scene. IEEE transactions on neural networks and learning systems, 28(3):690–703, 2016.
[Meyer et al., 2018] Annika Meyer, N Ole Salscheider, Piotr F Orzechowski, and Christoph Stiller. Deep semantic lane segmentation for mapless driving. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 869–875. IEEE, 2018.
[Nedevschi et al., 2004] Sergiu Nedevschi, Rolf Schmidt, Thorsten Graf, Radu Danescu, Dan Frentiu, Tiberiu Marita, Florin Oniga, and Ciprian Pocol. 3d lane detection system based on stereovision. In Proceedings. The 7th International IEEE Conference on Intelligent Transportation Systems (IEEE Cat. No. 04TH8749), pages 161–166. IEEE, 2004.
[Niu et al., 2016] Jianwei Niu, Jie Lu, Mingliang Xu, Pei Lv, and Xiaoke Zhao. Robust lane detection using two-stage feature extraction with curve fitting. Pattern Recognition, 59:225–233, 2016.
[Oliveira et al., 2016] Gabriel L Oliveira, Wolfram Burgard, and Thomas Brox. Efficient deep models for monocular road segmentation. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4885–4891. IEEE, 2016.
[pyt, 2019 accessed July 7 2019] Pytorch version 1.0, 2019 (accessed July 7, 2019).
[Rabe et al., 2016] Johannes Rabe, Marc Necker, and Christoph Stiller. Ego-lane estimation for lane-level navigation in urban scenarios. In 2016 IEEE Intelligent Vehicles Symposium (IV), pages 896–901. IEEE, 2016.
[Simonyan and Zisserman, 2014] Karen Simonyan and An- drew Zisserman. Very deep convolutional networks for large-scale image recognition, 2014.
[Son et al., 2015] Young Seop Son, Wonhee Kim, SeungHi Lee, and Chung Choo Chung. Predictive virtual lane method using relative motions between a vehicle and
lanes. International Journal of Control, Automation and Systems, 13(1):146–155, 2015.
[Tang et al., 2018] Jinjun Tang, Fang Liu, Wenhui Zhang, Ruimin Ke, and Yajie Zou. Lane-changes prediction based on adaptive fuzzy neural network. Expert Systems with Applications, 91:452–463, 2018.
[Xingjian et al., 2015] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.