Recent developments in field robotics inaugurated a new class of small-sized unmanned surface vehicles (USVs). These vessels are ideal for operation in coastal waters and narrow marinas due to their portability, and can be used for automated inspection of hazardous and difficult to reach areas. Uninterrupted and safe navigation requires a high level of autonomy. One of the main challenges in autonomous navigation is timely detection and avoidance of near-by obstacles. Various sensors have been considered for this task (e.g., RADAR [1], LIDAR [2], SONAR [3]), among which cameras have shown a great potential as affordable, lightweight and powerful obstacle detection devices [4], [5], [6], [7], [8].
Traditional maritime camera-based obstacle detection methods rely on background subtraction [7], but these are not appropriate for USVs due inherent scene dynamics, making the system non-robust and prone to false-positive detections. Stereo-based reconstruction methods [8], [9] address the dynamic environment, but require sufficiently textured scene and obstacles that significantly stick out of the water. Calm, poorly textured water and flat floating objects thus lead to detection failure. Stereo baselines have to be kept small to maintain the USV stability, which also reduces the detection
*This work was supported in part by the Slovenian research agency (ARRS) programmes P2-0214 and P2-0095, and the Slovenian research gency (ARRS) research project J2-8175.
Borja Bovcon and Matej Kristan are with University of Ljubljana, Faculty of Computer and Information Science, Slovenia
Fig. 1: Architecture of the proposed WaSR network. Encoder generates rich deep features, which are gradually fused in the decoder with a horizon mask computed from an IMU readout to boost detection and water edge estimation. A water-obstacle separation loss computed at the end of encoder drives learning of discriminative features, further reducing false positives and increasing true positives.
range. Semantic segmentation methods based on fitting a structured models to the image [4], [10], [6] have achieved excellent results and are currently the state-of-the-art on this domain. But these approaches rely on simple features which fail to fully capture the scene appearance diversity. Segmentation quality is thus degraded particularly in the presence of visual ambiguities and reflections [10].
Richer features can be learned by deep convolutional neural nets, and indeed developments in autonomous ground vehicles (AGVs) [11], [12], [13], [14], have demonstrated that these methods achieve remarkable semantic segmentation results. But due to many differences between the AGV and USV domain, these networks cannot be readily applied to USVs. Most obvious difference is that the navigable surface in a maritime domain (water) is non-flat, dynamic, varies significantly in appearance and is greatly affected by the weather conditions.
A recent study [15] analyzed the performance of state-of-the-art AGV segmentation networks on a maritime domain. The study has shown that these networks, when trained on a large maritime dataset, outperform, or perform on par, with the model-based segmentation approaches [4], [15], but several issues remain. The large water appearance variability causes poor estimation of the water edge, and produces many false positives. Worse yet, the networks were often missing small obstacles, which leads to dangerous false negative detections.
Following the findings from [10], [15] we propose a novel water-obstacle separation and refinement network (WaSR) designed as an encoder-decoder architecture (Figure 1). A deep encoder is used to extract rich features from the input image, while a shallow decoder is used to gradually refine the segmentation. Our first contribution is fusion of the external inertial sensory data from IMU with the visual information in the decoder, leading to a more accurate water edge segmentation and overall improved detection. Our second contribution is a new water-obstacle separation loss that aids learning of features that compactly encode a range of the water appearances, while enforcing a separation from the features corresponding to the obstacles. The loss is applied in a late stage of the encoder to foster learning discriminative features and simplify learning of the subsequent classifiers in the decoder. WaSR shows impressive results on the currently most challenging USV dataset and sets a new state-of-the-art in USV obstacle detection.
Cameras, combined with computer vision algorithms, have proven as a powerful yet affordable obstacle detection devices [16], [4], [5], [6], [7], [8]. Numerous image-processing methods for obstacle detection have been proposed. Indepth experimental evaluation of background subtraction methods [7], has shown that misleading dynamics of water cause a great amount of false-positive detections. Stereo reconstruction methods [17], [9], [8], are only capable of detecting obstacles well above the water surface. Poorly textured and partially submerged objects are likely to be undetected, while the detection of distant obstacles largely depends on the stereo baseline. On the other hand, semantic segmentation methods, based on fitting a structured models to the image [4], [10], [6], have achieved promising results and are capable of detecting obstacles protruding through the water surface, as well as the floating and distant ones. However, these approaches rely on simple features which fail to correctly address the diversity of a marine scene, thus leading to a poor segmentation in the presence of visual artefacts on water (wakes, sea foam, glitter, reflections, etc.).
Deep convolutional neural networks enable extraction of richer features, mandatory for accurate segmentation in the presence of visual ambiguities. Their training procedure requires a huge amount of carefully annotated data. Therefore, a large variety of urban datasets [18], [19], [20] have greatly contributed to a rapid development of deep neural nets [11], [13], [14], [21] for AGVs which achieve astonishing segmentation results. However, due to many differences between the AGV and USV domain, these networks cannot be readily applied to USVs. For instance, the navigable surface in a maritime domain (water) is non-flat, extremely dynamic and varies significantly in its appearance. Moreover, turbulent waters cause USVs to rotate around roll-axis, while ground vehicles do not experience this phenomenon.
Nonetheless, [22] and [23] proposed using Faster RCNN [24] in their approach to detect and classify different types of ships. However, Faster R-CNN cannot detect arbitrary obstacles without providing additional training data. Alternatively, [25] suggested an online segmentation approach for water component extraction. The segmentation accuracy continuously improves as online training progresses. However, their method requires a long and non-autonomous “calibration” procedure to start producing satisfactory results.
Recently, two separate studies [26], [15] have evaluated the performance of commonly used deep segmentation networks from AGV domain on the task of obstacle detection in maritime surveillance. Cane et al. [26] used a filtered ADE20k [12] dataset for training and several maritime datasets (MODD [4], IPATCH [27], SEAGULL [28] and SMD [29]) for evaluation. On the other hand, Bovcon et al. [15] trained the nets on their proposed pixel-wise annotated maritime dataset (MaSTr1325) and perfomed the evaluation on MODD2 [10]. In both studies, evaluated methods have shown consistent drawbacks in water component segmentation and mis-classification of smaller obstacles.
The architecture of WaSR is described in Section III-A, a new water-obstacle separation loss is described in Sec- tion III-B, and Section III-C describes conversion of the segmentation result into obstacle detection output.
A. Architecture overview
The proposed WaSR (Figure 1) architecture consists of a contracting path (encoder) and an expansive path (decoder). The purpose of the encoder is construction of deep rich features, while the primary task of the decoder is fusion of inertial and visual information, increasing the spatial resolution and producing the segmentation output.
Following the recent analysis [15] of deep networks on a maritime segmentation task, we base our encoder on the low-to-mid level backbone parts of DeepLab2 [11], i.e., a ResNet-101 [30] backbone with atrous convolutions. In particular, the model is composed of four residual convolutional blocks (denoted as res2, res3, res4 and res5) combined with maxpooling layers (see Figure 1). Hybrid atrous convolutions are added to the last two blocks for increasing the receptive field and encoding a local context information into deep features.
One of primary tasks of the decoder is fusion of visual and inertial information. We introduce the inertial information by constructing an IMU feature channel that encodes location of horizon at a pixel level. In particular, cameraIMU projection [10] is used to estimate the horizon line and a binary mask with all pixels below the horizon set to one is constructed (Figure 1). This IMU mask serves a prior probability of water location and for improving the estimated location of the water edge in the output segmentation.
The IMU mask is treated as an externally generated feature channel, which is fused with the encoder features at multiple levels of the decoder. However, the values in the IMU channel and the encoder features are at different scales. To avoid having to manually adjust the fusion weights, we apply an approach called Attention Refinement Modules (ARM) proposed by [13] to learn an optimal fusion strategy.
The decoder starts with the ARM1 block (Figure 2), which differs from ARM [13] in the way the input is pre-processed. The IMU mask is resized and concatenated with the encoder output features. The remaining steps follow [13]: global average pooling followed by depth reduction and normalization is used to learn channel weights, which are subsequently used to re-weight the concatenated feature channels. The resulting features are further fused with res3 output features and the IMU mask using another ARM block called ARM2 (Figure 2). ARM2 first applies an ARM1 block to fuse the IMU mask and the features from lower part of the decoder. This is followed by a set of convolutions to double the number of feature channels, which are per-channel summed with the res3 features from the encoder.
Yu et al. [13] have argued the benefits of using a learnable fusion technique called Feature Fusion Module (FFM) for fusing low-level and high-level features in CNNs. In contrast to ARM, this module can implement fusion pathways of higher complexity. Our decoder thus up-samples the ARM2 output features and concatenates them with the res2 features and IMU mask. The depth of the resulting feature channels is halved by convolution block and normalized by a batch-normalization block. A weight vector is then computed similarly to ARM1 and used to re-weight the features, leading to feature selection and fusion.
Our recent study [15] has shown that Atrous Spatial Pyramid Pooling [11] (ASPP) leads to significant improvements in segmentation of small structures, yet entails only a small computational overhead. Thus the ASPP block, followed by a softmax, is added as the final block of our decoder. The resolution of the output features is quarter of the input resolution, forming a truncated U-shape net with skip connections. A smaller, non-symmetrical decoder contributes to the speed due to a lower amount of up-sampling procedures and convolutions. Finally, the decoder output is up-sampled by a factor of four to match the input resolution.
B. Enforcing water-obstacle features separation
Care has to be taken when designing a loss function for maritime environment. While some obstacles may be large, the majority of pixels in a typical marine scene belong either to water or sky. This leads to a class imbalance, which overwhelms the classical cross entropy loss. Furthermore, segmentation difficulty vastly ranges between different water regions. For example, it may be easy to classify regions of mildly rippled blue water, but it is much more difficult to classify glitter and mirrored reflections of objects in the water as the water component. Therefore, to adjust the focus of the network to challenging regions during training, we employ
Fig. 2: Attention refinement modules ARM1, ARM2 and feature fusion module FFM adjust the scale of heterogeneous input feature channels and gradually fuse inertial and visual information in the WaSR decoder.
a focal loss [31], , adapted for segmentation. A classical L
loss,
, is added for weight regularisation [32].
Our recent study [15] has shown that water appearances like glitter and object mirroring pose a significant challenge to water segmentation networks. While mistaking water for sky does not pose a threat, mistaking obstacles for water and vice versa does lead to a potential USV collision or frequent false alarms, rendering the network useless for practical navigation. To avoid this, the network should ideally learn encoding in early layers such that it produces very similar features for a variety of water appearances and very different features for obstacles. This makes the subsequent learning of the classifier in the higher layers of the network easier.
We propose enforcing early feature separation by designing a novel loss. Let and
be features in channel c belonging to pixels in the water region W and the obstacle regions O, respectively. Since we would like to enforce clustering of water features, we can approximate their distribution by a Gaussian with per-channel means
and variances
, where
is the number of channels, and we assume channel independence for computational tractability. Similarity of all other pixels corresponding to obstacles can be measured as a joint probability under this Gaussian, i.e.,
We would like to enforce learning of features that minimize this probability. By expanding the equation for water per-channel standard deviations, taking the log of (1), flipping the sign and inverting, we arrive at the following equivalent obstacle-water separation loss
where the and
are added as normalisation constants making the scale independent of the number of channels and obstacle pixels in individual frames. The final loss is
Fig. 3: Raw image captured by the USV (left), WaSR segmentation output (middle) and post-processed segmentation output (right). Water, sky and obstacles are depicted with cyan, deep blue and yellow colour respectively. Extracted water-edge and obstacles are denoted with a pink line and yellow bounding boxes, respectively.
a weighted summation of individual losses
where and
are the weights.
C. Segmentation post-processing
The model proposed in Section III-A outputs a segmentation mask where each pixel belongs to exactly one semantic component (water, sky or environment). Pixels marked with water label are used to construct the water-region mask as described in [10]. The largest connected component in the water-region mask represents the navigable surface of the USV, and its upper edge corresponds to the edge of the water. The list of potential obstacles is obtained by extracting blobs of pixels marked with environment label within the water-region. The post-processing procedure and its results are illustrated in Figure 3.
The dataset and the evaluation protocol are described in Section IV-A, the implementation details are given in Sec- tion IV-B and comparison to state-of-the-art and ablation study are given in Section IV-C and Section IV-D, respectively.
A. Performance evaluation protocol and the dataset
We follow the recent protocol for evaluation of segmentation-based obstacle detectors in marine environment [15]. The network is trained on the MaSTr1325 dataset [15], which is currently the largest annotated maritime segmentation dataset. The dataset was captured in a coastal sea area with a real USV during a period of 24 months and consists of 1325 high-resolution images (pixels) of various representative marine environment scenes (see Figure 4 top row). Each image is perpixel manually segmented by human annotators into three semantic components: sea, sky and obstacles. The edges of the semantically different components are labelled with the “unknown” category in order to address the annotation uncertainty and to allow automatic exclusion of these pixels from learning. Each image is equipped by a read-out from an IMU sensor on-board the USV.
Fig. 4: MaSTr1325 [15] (top) and Modd2 [10] (bottom) datasets exhibit a large scene and appearance variability.
Performance is evaluated on the Modd2 dataset [10], which is currently the most challenging public USV dataset due to a large variety of scenarios (object mirroring, glitter, and various weather conditions) present. Examples of images from this dataset are shown in the bottom of Figure 4. The dataset consists 28 stereo sequences, time-synchronised with measurements of the on-board IMU. Following the guidelines from [15], the left-camera image is used for evaluation. Obstacles and the water edge are manually annotated with bounding boxes and a polygon, respectively.
As in MaSTr1325 [15], we use the standard performance evaluation measures from [4]. The accuracy of water-edge estimation is reported by mean-squared error computed over all sequences, while the accuracy of detected obstacles is measured by the number of true positives (TP), false positives (FP), false negatives (FN) and by the overall F-measure, i.e., a harmonic mean of precision and recall.
B. Implementation details
Fast and accurate detection is crucial for autonomous systems. To gain speed, all input images were scaled to the resolution pixels by bilinear interpolation. This resolution retains all hazardous obstacles visible. Detections, as well as ground truth obstacles, with surface area of less than
pixels were ignored, since they do not pose a threat at the given resolution.
Dataset augmentation is used to increase generalisation capability of the trained networks. We applied vertical mirroring and central rotations of degrees on whole training images, while elastic deformation was applied solely on the water component of training images. Following [15], we also applied colour-transfer augmentation, resulting in total of 54325 training images.
All networks were trained using a RMSProp optimizer with a momentum 0.9, initial learning rate and standard polynomial reduction decay of 0.9. The weights of ResNet-101 backbone were pre-trained on ImageNet [33], while the remaining additional trainable parameters of our model (e.g., those from adding IMU channel and those in FFM, ARM and ASPP) were randomly initialised using Xavier [34]. The networks were fine-tuned on augmented training set for five epochs.
WaSR was implemented in Tensorflow1 and all experiments were run on a desktop computer with Intel Core i7-7700 3.6GHz CPU and nVidia GTX1080 Ti GPU with 11GB GRAM.
C. Comparison with the state-of-the-art
WaSR from Section III was compared to five recent state-of-the-art networks: PSPNet [12], SegNet [35] and BiSeNet [13] were selected since they obtain state-of-the-art performance on segmentation tasks for autonomous cars, DeepLab3+ [36] (denoted as DL3+) was selected as state-of-the-art general-purpose segmentation network and a DeepLab variant called DeepLab2[15] (denoted as DL2
) was chosen since it achieved the best performance on a maritime segmentation problem [15] among several networks. The results are summarised in Table I.
On the task of water-edge estimation, the proposed WaSR outperforms all other networks by a large margin. The second best is BiSeNet, lagging behind by three pixels worse accuracy, followed by DL2, SegNet, PSPNet and DL3+. Visual inspection shows that other networks struggle with accurately estimating the water edge in presence of haze on the horizon, while WaSR does neither overshoot nor undershoot its location. Some examples are shown in Figure 5. WaSR also shows impressive robustness to severe environmental mirroring in the water and estimates the water edge accurately even under these conditions (Figure 5 third row), while operating in real-time at approximately 10 frames-per-second.
WaSR detects the highest number of true positives, followed by PSPNet, SegNet, BiSeNet, DL3+ and DL2. Qualitative comparison (Figure 5 second, third and fourth row) shows that WaSR detects smaller obstacles more accurately than the other networks. While most of the other networks produce false positives on glitter, reflections and wakes, WaSR is largely robust to these and does not mistake them for obstacles (Figure 5 fourth and fifth row). A closer observation of first two rows in Figure 5 shows that the other networks perform poorly in presence of distinct wakes caused by boats. This results either in deteriorated water-edge estimation or false detections on the wake edges. Several networks experience noisy false detections across the image when the USV faces a hazy open-sea (Figure 5 fourth row), while WaSR remains unaffected. In fact, WaSR obtains the second-lowest false positive rate, tightly following DL2
, however, this is because DL2
is prone to poor detection of isolated obstacles, leading to a high false negative rate and relatively low true positive rate.
D. Ablation study
The two major novelties in the WaSR architecture are (i) the object-water separation loss (2), and (ii) fusion of the external IMU sensor with the image data (Section III). To evaluate the contribution of each, two variants of the WaSR
TABLE I: Results on Modd2 [10] report water-edge estimation error in pixels, the number of true positives (TP), false positives (FP), false negatives (FN) and the F-measure.
TABLE II: Ablation study results on Modd2 [10], determining the importance of the IMU information and water-obstacle separation loss in the proposed architecture. We report the water-edge estimation error , measured in pixels, the number of true positive (TP), false positive (FP), false negative (FN) detections and the F-measure.
were created and evaluated by the procedure from Sec- tion IV-C. The first variant was WaSR with the water-object separation loss removed (WaSR) and the second variant was WaSR with the IMU fusion removed (WaSR
). Results in Table II indicate that both, the separation loss and the IMU fusion importantly improve the performance.
A detailed inspection of Table II shows that the water-obstacle separation loss significantly improves the detection accuracy, resulting in increase of true positives and a notable reduction of false positives. This is illustrated in Figure 6 (second row), where a small buoy in the distance is detected only by the network variants that use the separation loss during training (WaSRand WaSR). The separation loss also improves segmentation of near-by large objects, which is illustrated on an example of a pier in Figure 6 (third row). Benefits are also apparent in water-edge estimation accuracy when the USV faces towards mainland or large proximal obstacles.
Improvements of water-edge estimation from IMU fusion are most apparent when the USV faces the open water. An example in Figure 6 (first row) shows that the water edge is strongly overestimated when not using the IMU, leading to miss-classifying an entire island on the far-left side. Similarly, in Figure 6 (second row) the water edge above the dinghy is more accurately estimated when the IMU is used.
A failure case is illustrated in Figure 6 (last row). All variants of WaSR experience segmentation difficulties. Even though the IMU fusion improves the estimated water edge, part of it is still under-estimated and a small false-positive is detected on the wake. While this type of miss-classification does not lead to USV collision it clearly shows room for further improvements.
Fig. 5: Qualitative comparison of segmentation quality. The sky, obstacles and water components are denoted with deep-blue, yellow and cyan colour, respectively. Correctly detected obstacles are marked with green bounding box, false positive detections with orange bounding box and undetected obstacles with red bounding box.
Fig. 6: Qualitative analysis of the effects of using the water-obstacle separation loss and IMU fusion. The sky, obstacles and water are denoted by deep-blue, yellow and cyan, respectively. Detected obstacles are denoted by green (true positive), orange (false positive) and red (false negative).
A novel obstacle detection deep neural network, WaSR, for USV navigation was presented. WaSR improves the water-edge segmentation and overall obstacle detection by fusing visual information with inertial sensory data from an on-board IMU. A deep encoder extracts rich visual features from the input image, while a non-symmetric and shallow decoder fuses the visual features with inertial data. Additional robustness is achieved by introducing a novel water-obstacle separation loss at the end of the encoder, which enforces learning a feature space in which separation between water and obstacle appearances is increased.
Experimental results show that WaSR outperforms the state-of-the-art by over 14% in F-measure. Compared to the second-best method BiSeNet [13], WaSR increases true positives by 8%, and reduces false-positives and false-negatives by 64% and 69%, respectively. Water edge estimation accuracy is increased by three pixels, which means that the obstacle localization error is reduced by several hundred of meters for the obstacles close to horizon. Ablation study further validated the importance of individual design choices of WaSR, in particular, the new water-obstacle segmentation loss and IMU fusion pipeline.
Our future work will focus on further speeding up the segmentation, while maintaining the accuracy. Given a sig-nificant performance boost on the USV domain, it will be interesting to test whether the architecture generalises to other, non-USV, maritime [29], [27] and AGV [18] scenarios.
[1] C. Onunka and G. Bright, “Autonomous marine craft navigation: On the study of radar obstacle detection,” in ICCAR 2010, Dec 2010, pp. 567–572.
[2] A. R. J. Ruiz and F. S. Granja, “A short-range ship navigation system based on ladar imaging and target tracking for improved safety and efficiency,” ITS, vol. 10, no. 1, pp. 186–197, March 2009.
[3] H. K. Heidarsson and G. S. Sukhatme, “Obstacle detection and avoidance for an autonomous surface vehicle using a profiling sonar,” in ICRA 2011, May 2011, pp. 731–736.
[4] M. Kristan, V. S. Kenk, S. Kovaˇciˇc, and J. Perš, “Fast image-based obstacle detection from unmanned surface vehicles,” IEEE TCYB, vol. 46, no. 3, pp. 641–654, 2016.
[5] T. Cane and J. Ferryman, “Saliency-based detection for maritime object tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 18–25.
[6] B. Bovcon and M. Kristan, “Obstacle detection for usvs by joint stereo-view semantic segmentation,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018.
[7] D. K. Prasad, C. K. Prasath, D. Rajan, L. Rachmawati, E. Rajabally, and C. Quek, “Object detection in a maritime environment: Performance evaluation of background subtraction methods,” IEEE Transactions on Intelligent Transportation Systems, pp. 1–16, 2018.
[8] J. Muhoviˇc, B. Bovcon, M. Kristan, J. Perš, et al., “Obstacle tracking for unmanned surface vessels using 3-d point cloud,” IEEE Journal of Oceanic Engineering, 2019.
[9] H. Wang and Z. Wei, “Stereovision based obstacle detection system for unmanned surface vehicle,” in ROBIO, 2013, pp. 917–921.
[10] B. Bovcon, J. Perš, M. Kristan, et al., “Stereo obstacle detection for unmanned surface vehicles by IMU-assisted semantic segmentation,” Robotics and Autonomous Systems, vol. 104, pp. 1–13, 2018.
[11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE TPAMI, vol. 40, no. 4, pp. 834–848, 2018.
[12] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2881–2890.
[13] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 325–341.
[14] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
[15] B. Bovcon, J. Muhoviˇc, J. Perš, and M. Kristan, “The mastr1325 dataset for training deep usv obstacle detection models,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019.
[16] J. Larson, M. Bruch, R. Halterman, J. Rogers, and R. Webster, “Advances in autonomous obstacle avoidance for unmanned surface vehicles,” SPAWAR San Diego, Tech. Rep., 2007.
[17] T. Huntsberger, H. Aghazarian, A. Howard, and D. C. Trotz, “Stereo vision–based navigation for autonomous surface vessels,” Journal of Field Robotics, vol. 28, no. 1, pp. 3–18, 2011.
[18] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
[19] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference
on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3354– 3361.
[20] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving video database with scalable annotation tooling,” arXiv preprint arXiv:1805.04687, 2018.
[21] C. Y. Jeong, H. S. Yang, and K. D. Moon, “Horizon detection in maritime images using scene parsing network,” Electronics Letters, vol. 54, no. 12, pp. 760–762, 2018.
[22] S.-J. Lee, M.-I. Roh, H.-W. Lee, J.-S. Ha, I.-G. Woo, et al., “Imagebased ship detection and classification for unmanned surface vehicle using real-time object detection neural networks,” in The 28th International Ocean and Polar Engineering Conference. International Society of Offshore and Polar Engineers, 2018.
[23] J. Yang, Y. Li, Q. Zhang, and Y. Ren, “Surface vehicle detection and tracking with deep learning and appearance feature,” in 2019 5th International Conference on Control, Automation and Robotics (ICCAR). IEEE, 2019, pp. 276–280.
[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
[25] W. Zhan, C. Xiao, Y. Wen, C. Zhou, H. Yuan, S. Xiu, Y. Zhang, X. Zou, X. Liu, and Q. Li, “Autonomous visual perception for unmanned surface vehicle navigation in an unknown environment,” Sensors, vol. 19, no. 10, p. 2216, 2019.
[26] T. Cane and J. Ferryman, “Evaluating deep semantic segmentation networks for object detection in maritime surveillance,” in 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2018, pp. 1–6.
[27] L. Patino, T. Nawaz, T. Cane, and J. Ferryman, “Pets 2017: Dataset and challenge,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), July 2017, pp. 2126–2132.
[28] R. Ribeiro, “The seagull dataset,” http://vislab.isr.ist.utl.pt/ seagull-dataset, [Online; accessed 26-February-2019].
[29] D. K. Prasad, D. Rajan, L. Rachmawati, E. Rajabally, and C. Quek, “Video processing from electro-optical sensors for object detection and tracking in a maritime environment: a survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 8, pp. 1993–2016, 2017.
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[31] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[32] A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in Advances in neural information processing systems, 1992, pp. 950–957.
[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.
[34] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.
[35] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[36] H. Liu, Y. Liu, X. Gu, Y. Wu, F. Qu, and L. Huang, “A deeplearning based multi-modality sensor calibration method for USV,” in 2018 IEEE Fourth International Conference on Multimedia Big Data