Visual perception is at the heart of autonomous systems and vehicles [18], [17]. This field has seen tremendous progress during the recent wave of Deep Neural Network (DNN) architectures and methods [52], [16], [14]. The large majority of computer vision benchmarks are currently dominated by diverse and increasingly effective models encouraging further use in practical applications, e.g., automatic diagnosis for healthcare, traffic surveillance, autonomous vehicles, etc. Such methods reach top performances on individual tasks by leveraging multi-million parameter models requiring powerful hardware, usually for training, but also for predictions. Perception systems in autonomous vehicles must analyse and understand their surroundings at all time in order to support the multiple micro-decisions needed in traffic, e.g., steering, accelerating, braking, signaling, etc. Consequently, a plethora of specific tasks must be addressed simultaneously, e.g., object detection, segmentation [47], depth estimation [27], motion estimation [48], [56], localization [37], soiling detection [54]. Meanwhile hardware constraints in vehicles are limiting significantly DNN capacity and the number of tasks that can be solved. Using a DNN for each individual task becomes an unfeasible direction. Thus Multi-Task Learning (MTL) is an appealing solution striking a good compromise over constraints: reliable, high performance, limited hardware.
Multi-task networks consist of a shared network backbone followed by a collection of “heads”, typically one for each task. The flexibility of DNNs, makes it easy for practitioners to envision diverse architectures according to the available
Fig. 1: Overview of Task Weighting Methods
data and annotations. A major advantage of this unified model is computational efficiency [50], [49]. Moreover, such models save development and training time as shared layers replace learning of multiple sets of parameters in different models. Unified models learn features across tasks, increasing robustness to over-fitting by acting as a regularizer, as shown in previous multi-task networks [24], [39], [53].
However, multi-task networks are typically difficult to train as different tasks need to be adequately balanced such that learned parameters are useful across all tasks. Furthermore, tasks might have different difficulties and learning paces [13] and negatively impact each other once a task starts overfitting before others. Multiple MTL approaches have recently been attempted to mitigate this problem through optimization of multi-task architectures [36], [35], learning relationships between tasks [34], [51] or, most commonly, by weighting the task losses [3], [22], [32] (Figure 1). In most works a new problem and task configuration is proposed and only a few baselines are considered. For a new problem and dataset, it is a priori difficult to decide which technique is better. In this work we benchmark multiple task-weighting methods for a better view on the progress so far.
Meta-learning derived techniques are increasingly popular for solving the tedious and difficult task of tuning hyper-parameters for training a network. Recent methods show encouraging results in finding the network architecture for a given task [59], [31]. We propose an evolutionary meta-learning strategy for finding the optimal task weights and exploit our proposed benchmark for emphasizing the interest of such an approach for this problem.
In summary, the contributions of our work are: (1) We conduct a thorough evaluation of several popular and highperforming task-weighting approaches on a two-task setup across three automotive datasets. We notice that among state-of-the-art methods there is no clear winner across datasets as methods are relatively close in performance (including simple baselines) and the ranking is varying. (2) We propose a simple weight learning technique for the two-task setting, where the network learns the task weights by itself. (3) We propose learning the optimal task weights by combining evolutionary meta-learning with task-based selective backpropagation (deciding which tasks to be turned off for a number of iterations). This method outperforms baseline methods across tasks and datasets.
Multi-task learning. MTL is not a novel problem and has been studied before the deep learning revival [2]. MTL has been applied to various applications outside computer vision, e.g. natural language processing [7], speech processing [20], reinforcement learning [28]. For additional background on MTL we refer the reader to this recent review [44].
Multi-task networks. In general, MTL is compatible with several computer vision problems where the tasks are rather complementary and help out optimization. MultiNet [53] introduces an architecture for semantic segmentation, object detection and classification. With UberNet [24], Kokkinos tackles 7 computer vision problems over the same backbone architecture. CrossStich networks [38] learn to combine multi-task neural activations at multiple intermediate layers. Progressive Networks [45] consist of multiple sequentially added neural networks with new tasks, and transfer knowledge from previously trained networks to the newly added one (previous networks are frozen each time a new task and network are added). In PackNet [36], a network is trained over a sequence of tasks and for each new task only the least-active neurons are trained selectively. AuxNet [5] uses auxiliary tasks to improve the performance of the main task using multi-task learning. Rebuffi et al. [40] train a network over 10 datasets and tasks, and for each task require a reduced set of parameters attached to several intermediate layers. In some cases, a single computer vision problem can be transformed into a MTL problem, e.g., Mask R-CNN [14] decomposes instance segmentation into object detection + classification + semantic segmentation. This approach is common in object detection [42].
Task loss weighting. Initial Deep MTL networks made use of a weighted sum of individual task losses [53], [25]. Recently, more complex heuristics have started to emerge for balancing the task weights using: per-task uncertainty estimation [22], difficulty of the tasks in terms of precision and accuracy [13], statistics from task losses over time [32] or from their corresponding gradients [3].
Meta-learning is a learning mechanism that uses experience from other tasks. The most common use-case of meta-learning is the automatic adaptation of an algorithm for a task at hand. More specifically, meta-learning can be used for hyper-parameter optimization [29], for exploring network architectures [59], [31], [21] or various non-trivial combinations of variables, e.g. data augmentation [9]. In this line of research, we adapt an evolutionary meta-learning strategy for finding the optimal task weights along with the strategy for asynchronously training the two tasks.
In the following, we provide a formal definition of the MTL setting which will allow us to provide a common background and easier understanding of the multiple task weighting approaches compared and proposed in this work. Consider an input data space X and a collection of T tasks with corresponding labels
. In MTL problems, we have access to a dataset of N i.i.d. samples
, where
is the label of the data point
for the task
. In computer vision
usually corresponds to an image, while
can correspond to a variety of data types, e.g., scalar(s), class label, 2D heatmap/class map, etc.
The main component in any MTL is a model , which in our case is a CNN with learnable parameters
. The most commonly encountered approach for MTL in neural networks is hard parameter sharing [2], where there is a set of hidden layers shared between all tasks, i.e., backbone, to which multiple task-specific layers are connected. Formally, the model f becomes:
For clarity, we denote as the set of parameters coming from all task-specific layers
. Each task has its own specific loss function
attached to both its specific layers
and the common backbone
. The optimization objective for f boils down to the joint minimization of all the T task losses as following:
where are per-task weights that can be static, computed dynamically or learned by f, in which case
.
Weighted losses for MTL are intuitive and easy to formulate, however they are more difficult to deploy. The main challenge is related to computing . This is non-trivial as the optimal weights for a given task can evolve in time depending on the difficulty of the task and of the content of the train set [3], [13], e.g., diversity of samples, class imbalance, etc. Also, the task weights can depend on the affinity between the considered tasks [58] and the way they complement [51] or counter each other [46], relationships that potentially evolve across training iterations. Recent moment-based optimization algorithms with adaptive updates, SGD with momentum, and adaptive step-size, e.g., ADAM [23], can also influence the dynamics of the MTL, by attenuating the impact of a wrongly tuned weight or on the contrary by keeping the bias of a previously wrong direction active for more iterations. In practice, this challenging problem is solved via lengthy and expensive grid search or alternatively via a diversity of heuristics with varying degrees of complexity. In this work, we rather explore the former type of approaches and propose two heuristics for estimating optimal weights to improve performances namely simple dynamic task weighting loss approaches and a meta-learning based approach with asynchronous backpropagation.
In this section, we first review the most frequent task weighting methods encountered in literature and in practice (IV-A), and then describe our contributed approaches for this problem (
IV-B,
IV-C,
IV-D). Here we consider a two-task setup, where we train a CNN for joint object detection and semantic segmentation (Figure 2). In the following we will adapt the definitions of the task weighting methods to this setup with T = {det, seg}.
A. Baselines
1) No task weighting: An often encountered approach in MTL is to not assign any weights to the task losses [53], [39], [25]. The optimized loss is then just the sum of individual task losses with all task weights set to 1.0. This can occur also when the practitioner adds an extra-loss at the output of the network, not necessarily realising that the problem has become MTL. While very simple, there are a number of issues with this approach. First the network is now extremely sensitive to imbalances in task data, task loss ranges and scales (cross entropy, , etc). Due to these variations and desynchronization, some of the task losses advance faster than the others. Consequently by the time the “slower” task converges, the “faster” task will have already overfitted. This highlights the necessity of balancing losses during training.
2) Handcrafted task weighting: Here, the loss weights are found and set manually. We can achieve this by inspecting the value of the loss for several samples. Then the losses are weighted such that they are brought to the same scale: this is computed using the values of the loss at first iterations and remains constant during the training.1
where and
are the weights,
and
the losses for the semantic segmentation branch and object detection respectively, while
is the loss for task
at the first training iterations.
3) Dynamic task loss scaling: For this method, we take into account the evolution of per-task losses during training. We compute task weights dynamically, at the end of every
Fig. 2: Multi-task visual perception network architecture
training epoch as follows:
where is the average
loss over the previous epoch.
4) Uncertainty-based weighting: Kendall et al. [22] propose looking into aleatoric uncertainty for computing the task weights adaptively during training. They argue that each task has its own homoscedastic uncertainty which can be learned by the network for each task during training (
). Since they are based on homoscedastic uncertainty, the task weights are not input-dependent and converge to a constant value after a number of iterations [22]. The Gaussian likelihood is used as loss function for this method.
5) GradNorm: This method from [3] views multi-task network training as a problem of unbalanced gradient magnitudes back propagated through the shared layers (encoder). This solution normalizes the unbalanced task gradients by optimizing a new gradient loss that controls the task loss weights. Task loss weights are updated using gradient descent of this new loss.
6) Geometric loss: The Geometric Loss Strategy [4] is a parameter free loss function for overcoming the manual fine tuning of task weights. It consists of a geometric mean of losses instead of the usual weighted arithmetic mean. For example a T task loss function can be expressed as,
The loss strategy was tested with a three task network on KITTI [11] and Cityscapes [8] datasets. The loss function acts as a dynamically adapted weighted arithmetic sum in log space, these weights act as regularizers and control the rate of convergence between the losses.
In the following we describe our proposed approaches for task weighting.
B. Weight learning
Doersch and Zisserman [10] use weighted cross connections between the shared encoder and task specific decoders, adjusted via learning. In [22] task weighting parameters are learned during training. Inspired by these two approaches, we propose a single parameter learning strategy for a two-task network as follows:
where is the weight balancing term computed from the learnable parameter
, which is updated by backpropagation at each training iteration. Note that here the task weights are updated after each mini-batch.
This simple weight learning method enables the network to adjust by itself the pace of learning of the two tasks. The sigmoid outputting the term serves as a gating mechanism [6] to balance the two tasks while taking into account the interactions between the two. Bounding the weights in [0, 1] implicitly regularizes learning by removing the risk of having extremely unbalanced task weights.
C. Task weighting using Evolutionary Meta-learning
Task weighting can be understood viewed as a hyper-parameter optimization problem with T numeric variables equal to the number of tasks. We use as base method an efficient and extended version of Evolution Strategies [41] (ES). The extensions of ES allow the optimization of linearly and exponentially scaled numerical variables as well as categorical variables simultaneously [1]. All variables are treated in an independent way so that the system can handle any number of variables. Furthermore, the variable gradient information is exploited in a semi-greedy way in the mutation operation which is inspired from Natural Evolution Strategies [55]. The gradient towards the last most promising direction with respect to the target metric is added as a bias for every numerical value. Together with the random noise of the mutation, the algorithm can escape local minima while converging fast. Finally, in order to prevent repeated evaluations of the same region in the search space, a Tabu search method [12] is applied. A history of all tested configurations is stored and a distance metric between them is defined for all numerical variables with respect to the relative differences normalized to the search space range. The mutation operation then generates candidates that have to fulfill a minimum distance of at least 0.1% of the search space range towards already tested solutions.
The search space is defined as numerical variable for each task as
with
. The weight is optimized on an exponential scale as the optimal weight ratio can be non-linear. Furthermore, the final task weight coefficients are normalized such that their sum is one with the goal to leave the overall magnitude in the loss unchanged, i.e.
.
In order to guide the optimization to an equilibrium between the tasks, the geometric mean between the detection mAP and the segmentation mIoU is used as target metric.
We accelerate optimization by adopting a relaxed version of network morphisms [19] that can be understood as a
Fig. 3: Performance of meta-learning task weighting with asynchronous backpropagation method on WoodScape.
soft weight transfer that reuses the weights of the last best model as initialization for the offspring networks during the training. This enables to apply only a finetuning to the offspring networks and achieves a factor of four as speedup compared to from-scratch trainings.
For each new configuration of hyper-parameters, we don’t start from scratch, but instead train from the previously best model. In this way the number of epochs for each run can be effectively reduced (e.g., to 8 epochs for WoodScape dataset) by doing continuous finetuning while simultaneously tuning the hyperparameters. One drawback of the meta-learning approach is increased computational cost as quite many partial trainings need to be performed to find the optimal solutions. This can lead to 4-6 times longer total runtimes compared to a single training. However, the ES optimization can well exploit multiple GPUs for speeding up. While training might be longer on particular known datasets, on the long run for new datasets for which typical training heuristics must be adapted and tested, meta-learning approaches clearly prove their effectiveness and utility. All following experiments to new architectures and/or partially changed data can start from known parameter values to allow shorter optimization runtimes.
D. Asynchronous backpropagation with task weighting using Evolutionary Meta-learning
In order to balance the pace of optimization of the tasks, one method can be to control the backpropagation frequency of the tasks [53]. In this way, a task that converges faster is updated less often than a task that takes more time to learn.
Fig. 4: Task weights & asynchronous frequency of detection task with Meta Asynchronous method on WoodScape dataset.
An implementation trick is to set the task loss weight to 0.0 for the epochs for which we want to slow down training for the fast task.
with the update frequency of the detection task. This frequency is optimized by the meta-learning method described in the previous section using a numeric variable in the range of 1 to 10, followed by a rounding operation to an integer. As the segmentation takes longer to converge,
is set to 1. Note that this scheduling can be coupled with data for which annotations for only one of the tasks are available, e.g., segmentation.
We conduct experiments on three automotive datasets. The proposed meta-learning method (IV-D) outperforms the state of the art techniques [3] and [22] on all the three datasets with a 3-4% margin. We describe below the datasets we considered for this study, the evaluation protocol and metrics, and the results along some insights into the effect of our meta-learning method.
A. Datasets
KITTI [11] dataset for object detection consists of 7481 training images splitted into training and validation set. The dataset has bounding box annotations for cars, pedestrians and cyclists. For semantic segmentation task we have used [26] that provided 445 images. Instead of 11 semantic classes we used only road, sidewalk and merged the other classes into void. This helps to simplify the analysis as semantic data is already highly imbalanced with much less data than for object detection.
Cityscapes dataset [8] consists of 5000 images with pixel level annotations. We extracted bounding boxes and semantic annotations from the provided polygon annotations. As the test data is not defined for bounding box regression, we have used at 60/20/20 split of the provided 5000 images for training, validation and testing.
WoodScape [57] is an automotive fisheye dataset with annotations for multiple tasks like detection, segmentation and motion estimation. The dataset consists of 6K training, 2K validation and 2K test images. Instead of the 40 available semantic classes, we used only road, lanemarks, curb, person, two and four wheelers.
The proposed asynchronous meta-learning method will be particularly useful on unbalanced datasets like KITTI as it avoids overfitting of segmentation task on a small training set. Even for balanced datasets like Cityscapes and WoodScape, the method helps to regulate task convergence issues for detection task.
B. Implementation details
Network architecture. We have tested all the task weighting methods discussed in the previous section with a two-task network. We have designed a simple model which is suitable for low-power hardware. It consists of ResNet10 as a shared encoder, a light version (10 layers) of residual networks with rapid convergence [15]; YOLO style bounding box decoder [43] and FCN8 style semantic segmentation decoder [33]. Our YOLO decoder composed of two convolutional layers is much simpler and faster to train than two-stage approach object detectors like Faster-RCNN. Figure 2 shows our network architecture. The Encoder head is pre-trained on ImageNet for all the experiments.
Training settings. We used the loss from YOLO for object detection which is a combination of squared error losses and categorical cross-entropy loss for semantic segmentation. For all the experiments, we train using ADAM optimizer with a learning rate of 0.0001 and we use a mini-batch size of 8. We train for 60 epochs on KITTI and Cityscapes and 50 epochs on WoodScape until convergence of the tasks. Except for meta-learning experiments that take longer, we reduced the number of epochs to 30 on Cityscapes and 16 or 8 on WoodScape. All the experiments run on a single GTX 1080Ti 11GB GPU except the meta-learning ones that exploit multiple GPUs. The training pipeline uses Tensorflow Keras framework.
Meta-learning configuration. All the methods optimize two parameters simultaneously namely segmentation loss weight and detection loss weight
. However, meta-learning is combined with asynchronous backpropagation and it optimizes two additional parameters namely asynchronous frequency for segmentation
and detection
.
TABLE I: Task weights and asynchronous backpropagation frequencies computed by several task-weighting methods.
TABLE II: Comparison of various task-weighting methods for two-task network training.
Fig. 5: Quantitative results on WoodScape (top) and Cityscapes (bottom) validation dataset.
The variable ranges for the two task weights are [0.1, 1000] for segmentation and [0.1, 100] for detection as segmentation task usually profits from a higher weight due to longer convergence time. Table I shows the optimal values found out via optimization. The values represented are normalized between 0-1. The following optimization parameters for these experiments are determined empirically: size of initial population: 4, number of newly generated configuration: 4, number of parents per generated configuration: 2.
C. Evaluation metrics
All the experiments are evaluated using standard metrics on the validation set: mAP (mean Average Precision) [30] is used for object detection and mIoU (mean Intersection over Union) is used for semantic segmentation. For object detection training and evaluation, small objects whose area is under 300 pixels are filtered as they are typically too far from the ego vehicle and hence unimportant. Finally, we use geometric mean G(mAP, mIoU) (Table I) as the combined metric for the two tasks to enable comparison.
D. Results
We benchmarked the nine task weighting methods on the validation set of each dataset, results are detailed in Table II. The proposed meta-learning combined with asynchronous backpropagation method outperforms the others on the three datasets. Table I shows the optimized task loss weights for meta-learning methods whose values are normalized (IV-C and
IV-D). We added the static weights computed by the handcraft task weighting method as a reference for comparison as it gives an order of magnitude of scale between segmentation and detection loss. Detection loss is x70 bigger than segmentation loss on KITTI, x40 on Cityscapes and x100 on WoodScape. All the methods in general try to weigh segmentation more than detection to compensate for this imbalance as seen in Table I. Asynchronous version works better because it slowed down segmentation training on KITTI by a factor x7 to avoid overfitting on the small training set. Even for balanced datasets (same number of samples for both tasks) where segmentation does not show overfitting, detection usually converges faster than segmentation and might overfit. In this case, the asynchronous version slows down detection training by a factor x5 on Cityscapes and x2 on WoodScape as seen in Table I.
Insights into the meta-learning method In order to understand the optimization of the proposed meta-learning approach, some insights into the results on the WoodScape dataset are discussed in the following. Figure 3a shows the target metric over tested configurations for the optimization of task weights and the asynchronous backpropagation parameter. From initially low values, a slow but steady increase is observed. The best configuration is obtained after 44 iterations.
Figure 3b shows the progression of the metrics of the two tasks during optimization. The segmentation performance is initially low and noisy and then steadily increases. The detection metric reaches it maximum early then degrades slightly to allow a compromise in favor of the segmentation towards the end of the optimization. Figure 4a shows the progression of the task loss weights, and Figure 4b the progress of the asynchronous backpropagation parameter over time during optimization. Figure 5 contains qualitative examples on WoodScape and Cityscapes validation dataset demonstrating improvements by the proposed method.
Multi-task learning provides promising performances in autonomous driving applications and is key in enabling efficient implementations at a system level. In this work, we take a closer look at this paradigm, which albeit popular has been rarely benchmarked across the same range of tasks and datasets. We thus evaluate nine different weighting strategies for finding the optimal method of training an efficient two-task model. We further propose two novel methods for learning the optimal weights during training: an adaptive one and one based on metalearning. Our proposed method outperforms state-of-the-art approaches by 3% in compromise value. In future work, we intend to extend our benchmarking to additional tasks, e.g. on the wide range of tasks from the WoodScape dataset [57].
[1] F. B¨urger and J. Pauli. Understanding the interplay of simultaneous model selection and representation optimization for classification tasks. In ICPRAM, pages 283–290, 2016.
[2] R. Caruana. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning, pages 41–48. Morgan Kaufmann, 1993.
[3] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, 2018.
[4] S. Chennupati, G. Sistu, S. Yogamani, and S. A Rawashdeh. Multinet++: Multi-stream feature aggregation and geometric loss strategy for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.
[5] S. Chennupati, G. Sistu., S. Yogamani., and S. Rawashdeh. Auxnet: Auxiliary tasks enhanced semantic segmentation for automated driving. In Proc of the 14th International Conference on Computer Vision Theory and Applications (VISAPP), 2019.
[6] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
[7] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
[8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[9] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
[10] C. Doersch and A. Zisserman. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2051–2060, 2017.
[11] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[12] F. Glover. Future paths for integer programming and links to artificial intelligence. Computers & operations research, 13(5):533–549, 1986.
[13] M. Guo, A. Haque, D.-A. Huang, S. Yeung, and L. Fei-Fei. Dynamic task prioritization for multitask learning. In European Conference on Computer Vision, pages 282–299. Springer, 2018.
[14] K. He, G. Gkioxari, P. Doll´ar, and R. B. Girshick. Mask r-cnn. corr abs/1703.06870 (2017). arXiv preprint arXiv:1703.06870, 2017.
[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[17] M. Heimberger, J. Horgan, C. Hughes, J. McDonald, and S. Yogamani. Computer vision in automated parking systems: Design, implementation and challenges. Image and Vision Computing, 68:88–101, 2017.
[18] J. Horgan, C. Hughes, J. McDonald, and S. Yogamani. Vision-based driver assistance systems: Survey, taxonomy and advances. In 2015 IEEE 18th International Conference on Intelligent Transportation Systems, pages 2032–2039. IEEE, 2015.
[19] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pages 1314–1324, 2019.
[20] Z. Huang, J. Li, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. Lee. Rapid adaptation for deep neural networks through multi-task learning. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[21] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
[22] A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[24] I. Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6129–6138, 2017.
[25] I. Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5454–5463, July 2017.
[26] I. Kreˇso, D. ˇCauˇsevi´c, J. Krapac, and S. ˇSegvi´c. Convolutional scale invariance for semantic segmentation. In German Conference on Pattern Recognition, pages 64–75. Springer, 2016.
[27] V. R. Kumar, S. Milz, C. Witt, M. Simon, K. Amende, J. Petzold, S. Yogamani, and T. Pech. Monocular fisheye camera depth estimation using sparse lidar supervision. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018.
[28] A. Lazaric and M. Ghavamzadeh. Bayesian multi-task reinforcement learning. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 599–606, 2010.
[29] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016.
[30] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
[31] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 82–92, 2019.
[32] S. Liu, E. Johns, and A. J. Davison. End-to-end multi-task learning with attention. arXiv preprint arXiv:1803.10704, 2018.
[33] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
[34] M. Long, Z. Cao, J. Wang, and S. Y. Philip. Learning multiple tasks with multilinear relationship networks. In Advances in neural information processing systems, pages 1594–1603, 2017.
[35] A. Mallya, D. Davis, and S. Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
[36] A. Mallya and S. Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
[37] S. Milz, G. Arbeiter, C. Witt, B. Abdallah, and S. Yogamani. Visual slam for automated driving: Exploring the applications of deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 247–257, 2018.
[38] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
[39] D. Neven, B. D. Brabandere, S. Georgoulis, M. Proesmans, and L. V. Gool. Fast scene understanding for autonomous driving, 2017.
[40] S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, pages 506–516, 2017.
[41] I. Rechenberg. Evolutionsstrategien. In Simulationsmethoden in der Medizin und Biologie, pages 83–114. Springer, 1978.
[42] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[43] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
[44] S. Ruder, J. Bingel, I. Augenstein, and A. Søgaard. Learning what to share between loosely related tasks. arXiv:1705.08142, 2017.
[45] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
[46] O. Sener and V. Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, pages 525–536, 2018.
[47] M. Siam, S. Elkerdawy, M. Jagersand, and S. Yogamani. Deep semantic segmentation for automated driving: Taxonomy, roadmap and challenges. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pages 1–8. IEEE, 2017.
[48] M. Siam, H. Mahgoub, M. Zahran, S. Yogamani, M. Jagersand, and A. El-Sallab. Modnet: Motion and appearance based moving object detection network for autonomous driving. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 2859– 2864. IEEE, 2018.
[49] G. Sistu, I. Leang, S. Chennupati, S. Yogamani, C. Hughes, S. Milz, and S. Rawashdeh. Neurall: Towards a unified visual perception model for automated driving. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 796–803. IEEE, 2019.
[50] G. Sistu, I. Leang, and S. Yogamani. Real-time joint object detection and semantic segmentation network for automated driving. arXiv preprint arXiv:1901.03912, 2019.
[51] T. Standley, A. R. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese. Which tasks should be learned together in multi-task learning? arXiv preprint arXiv:1905.07553, 2019.
[52] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[53] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun. Multinet: Real-time joint semantic reasoning for autonomous driving. In 2018 IEEE Intelligent Vehicles Symposium (IV), 2018.
[54] M. Uˇriˇc´aˇr, P. Kˇr´ıˇzek, G. Sistu, and S. Yogamani. Soilingnet: Soiling detection on automotive surround-view cameras. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 67–72. IEEE, 2019.
[55] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949–980, 2014.
[56] M. Yahiaoui, H. Rashed, L. Mariotti, G. Sistu, I. Clancy, L. Yahiaoui, V. R. Kumar, and S. Yogamani. Fisheyemodnet: Moving object detection on surround-view cameras for autonomous driving. arXiv preprint arXiv:1908.11789, 2019.
[57] S. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, D. O’Dea, M. Uricar, S. Milz, M. Simon, K. Amende, C. Witt, H. Rashed, S. Chennupati, S. Nayak, S. Mansoor, X. Perrotton, and P. Perez. Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
[58] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3712–3722, 2018.
[59] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.