FULLY autonomous driving requires understanding theenvironment around vehicles. Various perception modules are fused for this understanding, and many pattern recognition and computer vision techniques are applied for these perception modules [1], [2]. Lane detection, which can localize the drivable area on a road, is a major perception technique. There are many ways to recognize lanes, but most techniques utilize traffic line detection [3], [4] or road region segmentation [5], [6]. In this paper, we focus on traffic line detection for recognizing lanes. Fig. 1 shows the purpose of our proposed method, which predicts exact key points of lanes from input RGB images and, using embedding features extracted by the
This work was partly supported by Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea Government (MSIT) (No. 2014-3-00077, Development of Global Multi-target Tracking and Event Prediction Techniques Based on Real-time Large-Scale Video Analysis), the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIT) (No. 2019R1A2C2087489), and GIST Research Institute(GRI) grant funded by the GIST in 2019.
Y. Ko, Y. Lee, S. Azam, F. Munir and M. Jeon are with the School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju, 61005, South Korea (e-mail: {koyeongmin, brightyoun, shoaibazam, farzeen.munir, mgjeon}@gist.ac.kr).
W. Pedrycz is with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6R 2V4, Canada, with the Department of Electrical and Computer Engineering, Faculty of Engineering, King Abdulaziz University, Jeddah 21589, Saudi Arabia, and also with the Systems Research Institute, Polish Academy of Sciences, Warsaw 01-447, Poland (email: wpedrycz@ualberta.ca).
Fig. 1. System overview. The proposed framework predicts key points on traffic lines and distinguishes individual instances regardless of the number of traffic lines. In addition, if user wants to run the trained model on a system with weak computing power, like an embedded board, the network can be clipped and transferred without additional training.
proposed network, distinguishes key points into individual instances. In addition, the proposed network is trained end-to-end, and the network size can be modified according to the computing power of the target system without any change of the network architecture or additional training.
Most traditional methods of traffic line detection extract low-level traffic line features using various hand-craft features like color [7], [8], or edges [9], [10]. These low-level features can be combined using a Hough transform [11], [12] or Kalman filter [13]; the combined features generate traffic line segment information. These methods are simple and can be adapted to various environments without significant modification. Still, the performance of these methods depends on condition of the testing environment such as lighting and occlusion.
Deep learning methods show outstanding performance for complex scenes. Among deep learning methods, Convolutional Neural Network (CNN) methods are primarily applied for feature extraction in computer vision [14], [15]. Semantic segmentation methods [16], [17], [18], the major research area in computer vision, are frequently applied to traffic line detection problems to make inferences about shapes and locations [19], [20], [21], [22]. Some methods use multi-class approaches to distinguish individual traffic line instances. Therefore, even though these methods can achieve outstanding performance, they can only be applied to scenes that consist of fixed numbers of traffic lines. As a solution to this problem, instance segmentation methods are applied to distinguish individual instance. These semantic segmentation based traffic line
Fig. 2. Proposed framework with three main parts. size input data is compressed by the resizing network; the compressed input is fed to the predicting network, which includes four hourglass modules. Three output branches are applied at the ends of each hourglass block; they predict confidence, offset, and embedding feature. The loss function can be calculated from the outputs of each hourglass block. By clipping several hourglass modules, required computing resources can be adjusted.
detection methods require some post-processing to estimate the exact location values of the predicted traffic lines. Avoiding this post-processing of the semantic segmentation approach, several other methods directly predict traffic line location [23], [24].
The existing methods have certain limitations. The semantic segmentation methods require the labeling or pre-processing at the pixel level for training, which is cumbersome. These methods also predict many unnecessary points because semantic segmentation generates classified pixel images with sizes identical to the given input image, even though only a few points are required to recognize traffic lines. In addition, existing methods are not adaptive to various environments according to available computing power. To apply them to light systems like embedded boards, the entire architecture should be modified and trained again.
To overcome these limitations, our proposed method uses a deep learning model inspired by a stacked hourglass network to predict a few key points on traffic lines. The stacked hourglass network [25] is usually applied in key points estimation fields such as pose estimation [26] and object detection [27], [28]. Using sequence of down-sampling and up-sampling, the stacked hourglass network can extract information about various scales. Because the stacked hourglass network includes several hourglass modules that are trained by the same loss function, we can simultaneously obtain various models that have different parameter sizes by clipping some bays from the whole structure. Using the simple method inspired by point cloud instance segmentation, each key point is distinguished into individual instance [29].
Camera-based traffic line detection has been actively developed, and many state-of-the-art methods [30], [24] are almost completely effective for public data sets. However, some methods have higher rates of false positive. False negatives, traffic lines that the module fails to detect, do not suddenly change the control values, and correct control values can be predicted from other detected traffic lines or previous results. However, false positives can lead to severe risks; incorrect identification of traffic lines by the module can cause rapid changes of the control values.
In summary, Fig. 2 shows our proposed framework for traffic line detection. It has three output branches and predicts the exact location and instance features of points on traffic lines. More details are introduced in section III. These are the primary contributions of this study:
• Using the key points estimation approach, we propose a novel method for traffic line detection. It produces a more compact size prediction output than those of other semantic segmentation-based methods.
• The framework consists of several hourglass modules, and so we can obtain various models that have different sizes by simple clipping because each hourglass module is trained simultaneously using the same loss function.
• The proposed method can be applied to various scenes that include any orientation of traffic lines, such as vertical or horizontal traffic lines, and arbitrary numbers of traffic lines.
• The proposed method has lower false positives and the noteworthy accuracy performance. It guarantees the stability of the autonomous driving car.
A. Traffic Line Detection
Lane detection is an important research area in autonomous driving. Lane detection modules recognize drivable areas on roads from input data. Traffic line detection is considered a main method for lane detection. Traffic line detection usually localizes line markings that distinguish drivable areas on roads. Especially regarding RGB images as input data, various handcrafted features have been proposed to detect traffic lines [31], [32], [33], [34], [35]. However, these methods show limitations in complex scenarios.
Recently, deep learning has become a dominant method in computer vision research. Semantic segmentation [16], [17],
Fig. 3. Details of hourglass block consisting three types of bottle-neck layers: same bottle-necks, down bottle-necks, and up bottle-necks. Output branches are applied at ends of hourglass layers; confidence output is forwarded to the next block.
[18] [36] is a major topic in perception research; it can classify pixels of the input image into individual class. Generative methods [37], [38] can also perform a similar function. Therefore, semantic segmentation methods and generative methods are suitable for expressing complex shapes of lines. [20], [30], [39], and [40] show applications of semantic segmentation and the generative model for traffic line detection. Some methods use multi-class approaches to distinguish each instance; however, multi-class approaches can classify only fixed numbers of instances. Instance segmentation approaches are proposed as solutions to this limitation. Neven et al. [41] attempted to solve this problem of multi-class approaches with instance segmentation. Their proposed LaneNet has a shared encoder and two decoders. One of these decoders performs binary lane segmentation; the other predicts embedding features for instance segmentation.
Although semantic segmentation methods can predict lines that have complex shapes, during training and testing they require pixel-level labeled data and post-processing to extract exact points on lines. Some direct methods [23], [24] directly generate exact points on lines. [23] predicts exact starting and terminal points, and x-axis values of the fixed y-axis values for each traffic line. [24] presents the Line Proposal Unit (LPU) inspired by the Region Proposal Network (RPN) of Faster RCNN [42]. LPU predicts horizontal offsets for fixed y-axis values along certain pre-defined line proposals.
These approaches, the semantic segmentation method, the generative method, and the direct method, produce many unnecessary output values. In semantic segmentation and generative method, not all pixels are required to recognize traffic lines; an exact line can be predicted from a few key points. Direct methods also have certain unnecessary predictions like the length, starting points, and terminal points of the given target traffic lines that are unknown.
B. Key Points Estimation
Key points estimation techniques predict from input images certain important points called key points. Human pose estimation [26] is a major research topic in the key points estimation area. Stacked hourglass networks [25] consists of several hourglass modules that are trained simultaneously. The hourglass module can transfer various scales’ information to deeper layers, helping the whole network obtain both global and local features. Because of this property, an hourglass network is frequently utilized to detect centers or corners of objects in the object detection area. Not only network architecture or loss function but also refinement methods adapted to existing networks are developed for key point estimation. [43] suggests a feature aggregation and coarse-to-fine supervision method that can be applied to other multi-stage methods. [44] proposes the refinement network that improves the results of other existing models. In this paper, these refinement methods are not applied to indicate performance of our proposed framework; however, they can be applied to improve the performance.
For lane detection, we train a neural network that consists of several hourglass modules. The network, which we will refer to as the Point Instance Network (PINet), generates points on lanes and distinguishes predicted points into individual instance. To achieve these tasks, our proposed neural network includes three output branches, a confidence branch, offset branch, and embedding branch. The confidence and offset branches predict exact points of traffic lines; loss functions inspired from YOLO [45] are applied. The embedding branch generates the embedding features of each predicted point; the embedding feature is fed to the clustering process to distinguish each instance. The loss function of the embedding branch is inspired by an instance segmentation method. The Similarity Group Proposal Network (SPGN) [29], an instance segmentation frameworks for 3D point cloud, introduces a simple technique and a loss function for instance segmentation. Based on the contents proposed by SPGN, we design a loss function fitting to discriminate each instance of the predicted traffic lines. Section II-A introduces details of the main archi-
Fig. 4. Details of bottle-neck. The three kinds of bottle-neck have different first layers according to their purposes.
tecture; Section II-B consists of details about the loss function; and Section II-C shows the implementation in detail.
A. Architecture
Fig. 2 shows the proposed framework of the network. Input RGB image size is ; it is fed to the resizing network. This image is compressed to a smaller size (
) by the sequence of convolution layers in the resizing network; the output of the resizing network is fed to the predicting network. An arbitrary number of hourglass modules can be included in the predicting network; four hourglass modules are used in this study. All hourglass modules are trained simultaneously by the same loss function. After the training step, user can choose how many hourglass modules to use according to the computing power, without any additional training. The following sections provide details about each network.
1) Resizing Network: The resizing network reduces the input image’s size to save memory and inference time. First, the input RGB image size is . This network consists of three convolution layers. All convolution layers are applied with filter size
, stride 2, and padding size 1. Prelu [46] and batch normalization [47] are utilized after each convolution layer. Finally, this network generates resized output with
size. Table I shows details of the constituent layers.
TABLE I DETAILS OF RESIZING NETWORK
2) Predicting Network: The resizing network output is fed to the prediction part, which will be described in this section. This part predicts the exact points on the traffic lines and the embedding features for instance segmentation. This network consists of several hourglass modules, each including an encoder, decoder, and three output branches, as shown in Fig. 3. Some skip-connections transfer the information of the various scales to deeper layers. Each colored block in Fig. 3 is a bottle-neck module; these bottle-neck modules are described in Fig. 4. There are three kinds of bottle-neck: same, down, and up bottle-necks. The same bottle-neck generates output that has the same size as the input. The down bottle-neck is applied for down-sampling in the encoder; the first layer of the down bottle-neck is replaced by a convolution layer with filter size 3, stride 2, and padding 1. The transposed convolution layer with filter size 3, stride 2, and padding 1 is applied for the up bottle-neck in the up-sampling layers. Each output branch has three convolution layers, and generates a grid. Confidence values about key point existence, offset, and embedding feature of each cell in the output grid are predicted by the output branches. Table II shows details of the predicting network. Because a deeper network has better performance [25], it can act as a teacher network. Therefore, using knowledge distillation techniques, we can expect better performance for clipped short networks. The channel of each output branch is different (confidence: 1, offset: 2, embedding: 4), and the corresponding loss function is applied according to the goal of each output branch.
TABLE II DETAILS OF PREDICTING NETWORK
B. Loss Function
For training, four loss functions are applied to each output branch of the hourglass networks. The following sections provide details of each loss function. As in Table II, the output branch generates a 64 grid, and each cell in the output grid consist of the predicted values of 7 channels, including the confidence value (1 channel), offset (2 channel) value, and embedding feature (4 channel). Confidence value determines whether or not key points of the traffic line exist; offset value localizes the exact position of the key points predicted by the confidence value, and the embedding feature is utilized to distinguish key points into individual instance. Therefore, three loss functions, except for the distillation loss function, are applied to each cell of the output grid. The distillation loss function to distillate the knowledge of the teacher network is adapted to the distillation layer of each encoder, as shown in Table II. Details of each predicted value and feature are included by the following sections.
1) Confidence Loss: The confidence output branch predicts the confidence value of each cell. If a key point is present in the cell, the confidence value is close to 1, if not, it is 0. The output of the confidence branch has 1 channel, and it is fed to the next hourglass module. The confidence loss consists of two parts, existence loss and non-existence loss. The existence loss is applied to cells that include key points; the non-existence loss is utilized to reduce the confidence value of each background cell. The non-existence loss is computed at cells that predict confidence values higher than 0.01. Because cells away from key points converge rapidly, this technique helps the training concentrate on cells closer to the key points. The following shows the loss function of the confidence branch:
where denotes the number of cells that include key points,
denotes the number of cells that do not include any key points,
denotes a set of cells that consist of key points,
denotes a set of cells that consist of points,
denotes the predicted value of each cell in the confidence output branch, and
denotes the ground-truth value. The ground truth value of the cell that has key point is 1; otherwise it is 0. At inference time, if the confidence value is bigger than the pre-defined threshold, we consider that a key point exists at the cell. The second term of
is a regularization term.
2) Offset Loss: From the offset branch, PINet predicts the exact location of the key points for each output cell. The output of each cell has a value between 0 and 1; this value indicates the position related to the corresponding cell. In this paper, a cell is matched to 8 pixels of the input image. For example, if the predicted offset value is 0.5, the real position of the key point is 4 pixels away from the edge of the cell. The offset branch has two channels for predicting the x-axis and y-axis offsets. Equation 2 shows the loss function:
Because the ground truth does not exist at cells that include no key points, these cells are ignored when the offset loss is calculated.
3) Embedding Feature Loss: The loss function of this branch is inspired by SGPN, a 3D points cloud instance segmentation method [29]. The branch is trained to make the embedding feature of each cell closer if the embedding features are the same in this instance. Equations 3 and 4 show the loss function of the feature branch:
where denotes the predicted embedding feature of a cell i,
indicates whether cell i and cell j are same instance or not, and K is a constant such that K > 0. If
, the cells are the same instance, and if
, these cells are different instances. When the network is trained, the loss function makes features closer when each cell belongs to the same instance; it distributes features when cells belong to different instances. We can distinguish key points into individual instance using the simple distance-based clustering technique. In this study, if embedding features of certain predicted key points are within a certain distance, we consider that they are the same instance. The feature size is set at 4 in this study, but this size is observed to have no major effect on the performance.
4) Distillation Loss: According to Newell et al. [25], better performance is observed when more hourglass modules are stacked. Therefore, the deepest hourglass module can be a teacher network, and we expect that clipped short networks that are lighter than the teacher network will show better performance if a knowledge distillation method is applied. Zagoruyko & Komodakis [48] proposed a simple knowledge distillation method that can be applied to the CNN model. This method allows a student network to imitate a teacher network; Hou et al. [30] show that the method can improve the performance of the whole framework. Equation 5. shows the loss function for distillation:
where D denotes the sum of square, denotes the distillation layer output at the m-th hourglass modules, as shown in Table II, M denotes the number of hourglass modules,
denotes the i-th channel of
, and all operators like sum, power, and absolute value (
) are elementwise.
The total loss is equal to the weighted sum of the above four loss terms, and the whole network is trained using an end-to-end procedure with the following total loss:
In the training step, we set to 0.2,
to 0.5, and
to 0.1.
and
are described at Section IV. The proposed loss function is adapted to the output branch of each hourglass module; this helps the whole network to be trained stably.
TABLE III DATASET SUMMARY
Fig. 5. Data augmentation methods. (a) is the original image, and (b), (c), (d), (e), (f), and (g) show examples of the applied data augmentation methods.
C. Implementation Detail
All input images are resized to size and normalized from values of RGB of
to values of
before the data are fed to the proposed network in both training and testing. The two public datasets used for the evaluation of the proposed method, TuSimple [49] and CULane [20], provide x-axis values of traffic lines according to the fixed y-axis values. Due to the annotation method, some traffic lines close to the horizontal line are annotated sparsely. To solve this problem, we make additional annotations every 10 pixels of the x-axis by linear regression from the original data. Various data augmentation methods like shadowing, adding noise, flipping, translation, rotation, and intensity changing are also applied; these methods are shown in Fig. 5.
Additionally, the two public datasets include a lot of image frames; however, the data are imbalanced. For example, the testing set of the CULane dataset consists of various categories such as normal, night, and crossroad; the numbers of category frames are vary widely. The exact ratios of the CULane category can be found in Section IV-B, the results section. To resolve this issue, we sample hard data that show poor loss values in the training step, and increase the selection ratio of the hard data. The concept is similar to the hard negative mining technique.
We use one GPU (GTX 2080ti 11GB) for training and testing; source code is written in Pytorch. In the training step, each batch contains six images; hyper-parameters like thresholds and coefficients are determined experimentally.
The exact values of the hyper-parameters are shown in the following section. PINet predicts the exact position of key points on traffic lines, and the spline curve fitting method is applied to obtain a smoother curve.
In this section, we evaluate PINet on two public datasets, TuSimple [49] and CULane [20]. The following Section A introduces the overview and evaluation metric used for each dataset in the official evaluation methods. Section B shows the evaluation results of PINet; Section C includes an ablation study on the effect of the knowledge distillation method.
A. Dataset
Our proposed network, PINet, is trained on both TuSimple and CULane. Table III summarizes information of the two datasets. TuSimple is relatively simpler than CULane because the TuSimple dataset consists of only the highway environment and fewer obstacles. We use the official evaluation source codes to evaluate PINet; the details of the datasets and evaluation metrics are described in the following section.
1) TuSimple: TuSimple dataset consists of 3,626 training sets and 2,782 testing sets. Accuracy is the main evaluation metric of the TuSimple dataset, defined by the following equation according to the average number of correct points:
where denotes the number of points correctly predicted by the trained module for the given image clip, and
denotes the number of ground-truth points in the clip. The rates false negative (FN) and the false positive (FP) are also provided by the following equation:
where denotes the number of wrongly predicted lanes,
denotes the number of predicted lanes,
denotes the number of missed lanes, and
denotes the number of ground-truth lanes.
2) CULane: The CULane dataset includes 88,880 training images and 34,680 testing images. Unlike the TuSimple dataset, various road types such as urban and night are shown in the CULane dataset. We follow the official evaluation metric [20] for evaluation of the CULane dataset. According to [20], each traffic line is assumed to have 30 pixel width and we calculate the intersection-over-union(IoU) between the
TABLE IV EVALUATION RESULTS FOR CULANE DATASET. (FIRST AND SECOND BEST RESULTS ARE HIGHLIGHTED IN RED AND BLUE.)
prediction of the evaluated model and the ground truth. In CULane dataset, F1-measure is the major evaluation metric; it is defined as the following equation.
where and
. TP is a the true positive, which means a prediction that has larger IoU than the threshold, 0.5. FP is a false positive and FN is a false negative.
B. Result
1) TuSimple: Evaluation of the TuSimple dataset requires exact x-axis values for certain fixed y-axis values. The detailed evaluation results can be seen in Table V; Fig. 6 shows certain results for the TuSimple dataset. The value nH in Tables IV - VI means that the network consists of n hourglass modules. Though pre-trained weights and extra datasets are not used, PINet also shows high performance in term of accuracy and false positive rate. The false negative rate also shows a reasonable value.
Table VI shows the number of parameters and the fps on the GTX 2080ti GPU according to the number of hourglass modules. Most components of PINet are built of bottle-neck layers. This architecture can save a lot of memory. PINet can run at 25 fps when all hourglass networks are used, and if only one hourglass network is applied, the network works about 40 fps. When the short network is evaluated, the network is just clipped from the whole trained network, without any additional training. The deepest network has higher performance, but the performances of the clipped short networks show subtle differences from that of deepest network. The distance threshold is 0.08 to distinguish each instance; confidence thresholds are 0.35 (4H), 0.32 (3H), 0.30 (2H), and 0.52 (1H); and
are 1.0 and 1.0.
2) CULane: Table IV and Fig. 7 show detailed results of PINet on the CULane dataset. We observe three features in the result. The first is that PINet shows a particularly low false positive rate on the CULane dataset. This means that wrong prediction of lanes by our PINet is rarer than in other methods; this guarantees the safety performance. Second, the clipped networks 2H and 3H show a performance similar to that of the whole network; only 1H has poor performance. It
TABLE V EVALUATION RESULTS FOR TUSIMPLE DATASET. (FIRST AND SECOND BEST RESULTS ARE HIGHLIGHTED IN RED AND BLUE.)
TABLE VI PARAMETER SIZE AND FPS (ON GTX2080TI) OF PINET
looks as if the effect of distillation is optimal when the depth is three hourglass modules in our proposed architecture. Finally, PINet works better than other methods for the hard light condition. Night, and dazzle light categories in the CULane dataset include the hard light condition; PINet shows higher performance in these categories. However, because PINet is based on the key points estimation method, local occlusions or unclear traffic lines can negatively influence the performance. Crowed, arrow, and curve categories can be examples of PINet showing slightly lower performance in these categories. PINet shows the highest performance for the overall F1 measure on the CULane dataset. The distance threshold is 0.08 for distinguishing each instance; confidence thresholds are 0.94 (4H), 0.95 (3H), 0.96 (2H), and 0.97 (1H); and
are initially set by 1.0 and 1.0.
is changed from 1.0 to 2.5 at the last 40 epochs.
C. Ablation Study
We investigate the effects of the knowledge distillation method, whose purpose of this knowledge distillation method
Fig. 6. Results for TuSimple dataset. First row is ground truth; the second row is predicted results of PINet.
Fig. 7. Results of CULane dataset. First row is ground truth; the second row is predicted results of PINet.
is to reduce the gap between the clipped short network and the deepest network that acts as a teacher network. Table VII shows the results of the ablation study. The average performance gap is calculated using the following equation:
where denotes the average performance gap between 4H and nH, N denotes the total number of training epochs for this ablation study, and
denotes the performance of nH at the i-th epoch. The performance is evaluated on the tuSimple test set; we collect data for the first 30 epochs. When the distillation method is applied, the average performance gap between the whole network and the clipped short networks is lower when the distillation method is not applied. This means that the distillation method helps the clipped short network to mimic the teacher network well.
In this study, we have proposed a novel lane detection method, PINet, combining with the point estimation and the
TABLE VII AVERAGE PERFORMANCE GAP BETWEEN WHOLE NETWORK AND CLIPPED SHORT NETWORK ON TUSIMPLE DATASET (LOWER IS BETTER).
point instance segmentation method. Method can work in real-time. In addition, PINet can be clipped according to the computing power of the target system; the clipped network can be applied directly without any additional training. PINet achieves high performance and a lower rate of false positives; the low false positive rate guarantees the safety performance of autonomous driving cars because wrongly predicted lanes rarely occur. Particularly, PINet show better performance than other methods in difficult light conditions such as night, shadow, and dazzling light; however, PINet has limitations when local occlusions or unclear traffic lines exist. We have shown by ablation study that the knowledge distillation method improves the performance of the clipped short network. As a result, we have observed that the clipped short network’s performance is close to that of the whole network’s performance.
[1] Y. Lee, J. Lee, H. Ahn, and M. Jeon, “Snider: Single noisy image denoising and rectification for improving license plate recognition,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0, 2019.
[2] F. Munir, S. Azam, A. M. Sheri, Y. Ko, and M. Jeon, “Where am i: Localization and 3d maps for autonomous vehicles.,” in VEHITS, pp. 452–457, 2019.
[3] Y. Wang, E. K. Teoh, and D. Shen, “Lane detection and tracking using b-snake,” Image and Vision computing, vol. 22, no. 4, pp. 269–280, 2004.
[4] Z. Kim, “Robust lane detection and tracking in challenging scenarios,” IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 1, pp. 16–26, 2008.
[5] Y. Li, W. Ding, X. Zhang, and Z. Ju, “Road detection algorithm for autonomous navigation systems based on dark channel prior and vanishing point in complex road scenes,” Robotics and Autonomous Systems, vol. 85, pp. 1 – 11, 2016.
[6] H. Wang, Y. Sun, and M. Liu, “Self-supervised drivable area and road anomaly segmentation using rgb-d data for robotic wheelchairs,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4386–4393, 2019.
[7] Y. He, H. Wang, and B. Zhang, “Color-based road detection in urban traffic scenes,” IEEE Transactions on intelligent transportation systems, vol. 5, no. 4, pp. 309–318, 2004.
[8] K.-Y. Chiu and S.-F. Lin, “Lane detection using color-based segmen- tation,” in IEEE Proceedings. Intelligent Vehicles Symposium, 2005., pp. 706–711, IEEE, 2005.
[9] Y. Wang, D. Shen, and E. K. Teoh, “Lane detection using catmull-rom spline,” in IEEE International Conference on Intelligent Vehicles, vol. 1, pp. 51–57, 1998.
[10] C. Lee and J.-H. Moon, “Robust lane detection and tracking for real-time applications,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 12, pp. 4043–4048, 2018.
[11] R. O. Duda and P. E. Hart, “Use of the hough transformation to detect lines and curves in pictures,” Communications of the ACM, vol. 15, no. 1, pp. 11–15, 1972.
[12] S. Luo, X. Zhang, J. Hu, and J. Xu, “Multiple lane detection via combining complementary structural constraints,” IEEE Transactions on Intelligent Transportation Systems, pp. 1–10, 2020.
[13] A. Borkar, M. Hayes, and M. T. Smith, “Robust lane detection and tracking with ransac and kalman filter,” in 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 3261–3264, IEEE, 2009.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
[15] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
[16] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con- volutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[17] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147, 2016.
[18] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, pp. 234– 241, Springer, 2015.
[19] W.-J. Yang, Y.-T. Cheng, and P.-C. Chung, “Improved lane detection with multilevel features in branch convolutional neural networks,” IEEE Access, vol. 7, pp. 173148–173156, 2019.
[20] X. Pan, J. Shi, P. Luo, X. Wang, and X. Tang, “Spatial as deep: Spatial cnn for traffic scene understanding,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[21] W. Van Gansbeke, B. De Brabandere, D. Neven, M. Proesmans, and L. Van Gool, “End-to-end lane detection through differentiable leastsquares fitting,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0, 2019.
[22] Q. Zou, H. Jiang, Q. Dai, Y. Yue, L. Chen, and Q. Wang, “Robust lane detection from continuous driving scenes using deep neural networks,” IEEE transactions on vehicular technology, vol. 69, no. 1, pp. 41–54, 2019.
[23] Z. Chen, Q. Liu, and C. Lian, “Pointlanenet: Efficient end-to-end cnns for accurate real-time lane detection,” in 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 2563–2568, IEEE, 2019.
[24] X. Li, J. Li, X. Hu, and J. Yang, “Line-cnn: End-to-end traffic line detection with line proposal unit,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 1, pp. 248–258, 2020.
[25] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European conference on computer vision, pp. 483–499, Springer, 2016.
[26] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang, “Learning feature pyramids for human pose estimation,” in proceedings of the IEEE international conference on computer vision, pp. 1281–1290, 2017.
[27] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint triplets for object detection,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6568–6577, 2019.
[28] X. Zhou, J. Zhuo, and P. Krahenbuhl, “Bottom-up object detection by grouping extreme and center points,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 850–859, 2019.
[29] W. Wang, R. Yu, Q. Huang, and U. Neumann, “Sgpn: Similarity group proposal network for 3d point cloud instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2569–2578, 2018.
[30] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane detection cnns by self attention distillation,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1013–1021, 2019.
[31] H. Deusch, J. Wiest, S. Reuter, M. Szczot, M. Konrad, and K. Dietmayer, “A random finite set approach to multiple lane detection,” in 2012 15th International IEEE Conference on Intelligent Transportation Systems, pp. 270–275, IEEE, 2012.
[32] Y. U. Yim and S.-Y. Oh, “Three-feature based automatic lane detec- tion algorithm (tfalda) for autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 4, no. 4, pp. 219–225, 2003.
[33] A. Borkar, M. Hayes, and M. T. Smith, “A novel lane detection system with efficient ground truth generation,” IEEE Transactions on Intelligent Transportation Systems, vol. 13, no. 1, pp. 365–374, 2011.
[34] D. C. Andrade, F. Bueno, F. R. Franco, R. A. Silva, J. H. Z. Neme, E. Margraf, W. T. Omoto, F. A. Farinelli, A. M. Tusset, S. Okida, M. M. D. Santos, A. Ventura, S. Carvalho, and R. d. S. Amaral, “A novel strategy for road lane detection and tracking based on a vehicles forward monocular camera,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 4, pp. 1497–1507, 2019.
[35] J. M. lvarez, A. M. Lpez, T. Gevers, and F. Lumbreras, “Combining priors, appearance, and context for road detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 3, pp. 1168–1178, 2014.
[36] H. Choi, H. Ahn, K. Joonmo, and M. Jeon, “Adfnet: Accumulated decoder features for real-time semantic segmentation,” IET Computer Vision, 2020.
[37] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
[38] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, pp. 2223– 2232, 2017.
[39] S.-Y. Lo, H.-M. Hang, S.-W. Chan, and J.-J. Lin, “Multi-class lane semantic segmentation using efficient convolutional networks,” in 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6, IEEE, 2019.
[40] M. Ghafoorian, C. Nugteren, N. Baka, O. Booij, and M. Hofmann, “El- gan: Embedding loss driven generative adversarial networks for lane detection,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0, 2018.
[41] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and L. Van Gool, “Towards end-to-end lane detection: an instance segmentation approach,” in 2018 IEEE intelligent vehicles symposium (IV), pp. 286–291, IEEE, 2018.
[42] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, pp. 91–99, 2015.
[43] W. Li, Z. Wang, B. Yin, Q. Peng, Y. Du, T. Xiao, G. Yu, H. Lu, Y. Wei, and J. Sun, “Rethinking on multi-stage networks for human pose estimation,” arXiv preprint arXiv:1901.00148, 2019.
[44] G. Moon, J. Y. Chang, and K. M. Lee, “Posefix: Model-agnostic general human pose refinement network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7773– 7781, 2019.
[45] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788, 2016.
[46] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
[47] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML15, p. 448456, JMLR.org, 2015.
[48] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” arXiv preprint arXiv:1612.03928, 2016.
[49] “The tusimple lane challenge,” in http://benchmark.tusimple.ai/.
[50] S. Yoo, H. Seok Lee, H. Myeong, S. Yun, H. Park, J. Cho, and D. Hoon Kim, “End-to-end lane marker detection via row-wise clas-sification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 1006–1007, 2020.
Yeongmin Ko received the B.S. degree in School of Electrical Engineering from Gwangju Institute of Science and Technology (GIST), Gwangju, South Korea, in 2017. He is currently pursuing the Ph.D. degree with the School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology. His current research interests include computer vision, self-driving, and deep learning.
Younkwan Lee received the B.S. degree in computer science from Korea Aerospace University, Gyeonggi, South Korea, in 2016. He is currently pursuing the Ph.D. degree with the School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju, South Korea. His current research interests include computer vision, machine learning, and deep learning.
Shoaib Azam received the B.S. degree in Engineering Sciences from Ghulam Ishaq Khan Institute of Science and Technology, Pakistan in 2010, and MS degree in Robotics and Intelligent Machine Engineering from National University of Science and Technology, Pakistan in 2015. He is currently pursuing the Ph.D. degree with the Department of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South Korea. His current research interests include artificial intelligence and machine learning, computer vision, robotics and autonomous driving.
Farzeen Munir received the B.S degree in Electrical Engineering from Pakistan Institute of Engineering and Applied Sciences, Pakistan in 2013, and MS degree in System Engineering from Pakistan Institute of Engineering and Applied Sciences, Pakistan in 2015. Now she is pursing her PhD degree at Gwangju Institute of Science and Technology, Korea in Electrical Engineering and Computer Science. Her current research interest include, machine Learning, deep neural network, autonomous driving and computer vision.
Moongu Jeon received the B.S. degree in architectural engineering from Korea University, Seoul, South Korea, in 1988, and the M.S. and Ph.D. degrees in computer science and scientific computation from the University of Minnesota, Minneapolis, MN, USA, in 1999 and 2001, respectively. As the masters degree researcher, he was involved in optimal control problems with the University of California at Santa Barbara, Santa Barbara, CA, USA, from 2001 to 2003, and then moved to the National Research Council of Canada, where he was involved in the sparse representation of high-dimensional data and the image processing, until July 2005. In 2005, he joined the Gwangju Institute of Science and Technology, Gwangju, South Korea, where he is currently a Full Professor with the School of Electrical Engineering and Computer Science. His current research interests include machine learning, computer vision, and artificial intelligence.
Witold Pedrycz received the M.Sc., Ph.D., and D.Sc. degrees from the Silesian University of Technology, Gliwice, Poland. He is a Professor and the Canada Research Chair of Computational Intelligence with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada. He is also with the Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland. Dr. Pedrycz is a Foreign Member of the Polish Academy of Sciences and a fellow of the Royal Society of Canada. He has authored 17 research monographs and edited volumes covering various aspects of computational intelligence, data mining, and software engineering. His current research interests include computational intelligence, fuzzy modeling and granular computing, knowledge discovery and data science, fuzzy control, pattern recognition, knowledge-based neural networks, relational computing, and software engineering.
Dr. Pedrycz was a recipient of the Prestigious Norbert Wiener Award from the IEEE Systems, Man, and Cybernetics Society in 2007; the IEEE Canada Computer Engineering Medal; the Cajastur Prize for Soft Computing from the European Centre for Soft Computing; the Killam Prize; and the Fuzzy Pioneer Award from the IEEE Computational Intelligence Society. He is vigorously involved in editorial activities. He is an Editor-in-Chief of Information Sciences, Editor-in-Chief of WIREs Data Mining and Knowledge Discovery (Wiley), and International Journal of Granular Computing (Springer). He currently serves on the Advisory Board of IEEE Transactions on Fuzzy Systems and is a member of a number of editorial boards of other international journals.