Surgical data science is an evolving research field that aims to ”observe all that is occurring within and around the treatment process” in order to ”improve the quality of interventional healthcare and its value by capturing, organizing, analyzing and modelling data” [17]. An international consortium comprising leading researchers from engineering and medicine suggested that context-aware assistance in minimally-invasive surgery may be a key clinical application of surgical data science [17]. The computer vision challenges in this context include detection, segmentation and tracking of medical devices in endoscopic video data, organ classification, and surgical action/phase recognition. While extremely promising results can be obtained with state-of-the-art supervised machine learning approaches, typically, the methods do not generalize well. An example is provided in Fig. 1, in which a state-of-the-art convolutional neural network (CNN) performs well when trained and tested on the data from the MICCAI instrument tracking challenge 20171, organized as part of the MICCAI endoscopic vision challenge 20172. However, mean performance drops by more than 50% when applied to endoscopic video data from another site. This is an important limitation as curation of (sufficient) training data is extremely labor-intensive and is currently hindering progress in the field. Related methods address this challenge with crowdsourcing-based approaches [15, 16], i.e., methods which outsource annotation tasks to masses of anonymous workers in an online community. In this paper, we investigate an entirely new approach that has been inspired by recent achievements in the field of self-supervised learning (see e.g. [1, 4, 22, 23, 30]) and is based on the observation that it is often the small amount of annotated medical image data rather than the amount of raw medical data that causes the bottleneck related to training data acquisition in surgical data science. Our hypothesis is that masses of unlabeled video data can be used to learn a representation of the target domain that can boost the performance of state-of-the-art machine learning algorithms when used for pre-training. We investigate this hypothesis by using the CNN-based medical instrument segmentation as an example. Sec. 2 presents the general concept, a first prototype implementation and the study design for hypothesis validation. The results are presented in sec. 3 followed by a discussion of our findings in the context of related work in sec 4.
Fig. 1: Limited generalization capabilities of state-of-the-art machine learning algorithms. When training and testing are performed on data from the same hospital, state-of-the-art segmentation performance is achieved with the algorithm presented in sec. 2.2. However, when the same model is applied to data from a different site, accuracy drops dramatically.
2.1 Concept overview
Our approach, which we refer to as Pre-training with Auxiliary Task (PAT), is illustrated in Fig. 2 and has the following components:
– Target task: Endoscopic vision task to be solved by the algorithm, e.g. segmentation of medical instruments from endoscopic video data
– Unlabeled data: Large number Nunlabeled of unlabeled endoscopic images Iunlabeled that are representative of the target domain (e.g. laparosopic video data from a specific hospital for a specific application)
– Labeled data: Comparatively small number Nlabeled << Nunlabeled of images Ilabeled labeled according to the target task.
– Architecture for target task: A CNN-based architecture designed to solve the target task using Ilabeled, e.g. a U-Net [25]
– Auxiliary task: Task designed to leverage information in the unlabeled data (e.g. image re-colorization as described below).
– Architecture for auxiliary task: A CNN-based architecture designed to solve the target task with a self-supervised learning approach using Iunlabeled.
The core of the method is the auxiliary task which leverages the information available in unlabeled image data from the target domain for to improve the generalization capabilities of CNNs. In this paper, we use an adversarial approach to train the target-task network to re-colorize grayscale images. Labeled training data is then used to refine the model for the task of interest (here: segmentation). The following section gives a concrete example on how to instantiate the concept.
2.2 Prototype implementation
We implemented the concept proposed using re-colorization as auxiliary tasks (sec. 2.2.1) and medical instrument segmentation (for which CNNs are currently the most widespread
Fig. 2: Our approach applied to the specific task of instrument segmentation in endoscopic video data. A pre-training step leverages information available in unlabeled video data from the target domain. In this study, GAN-based re-colorization of video data (i.e. mapping the l-channel to the a, b - channel) was chosen as auxiliary task. The labeled training dataset is then used for fine-tuning the net according to the target application (i.e. segmentation).
used method [5,6,21,29]) as target task (sec. 2.2.2). Important design decisions are the usage of a combined reconstruction and adversarial loss for realistic re-colorization results and a U-Net which is a commonly used target-task network for the segmentation task.
2.2.1 Auxiliary task: re-colorization
The complete architecture of our pre-training is illustrated in Fig. 3. To train re-colorization, we first transform all images into the CIE 1976 L*a*b* Color space. The axes (L,a,b) of the color space are defined by the luminescence (L-channel), the color gradient from green to red (a-channel) and the the color gradient from from blue to yellow (b-channel). Using the L-channel as input, we train the network to predict the resulting a- and b-channels [30]. In this Context it is worth noting that quantitative automatic assessment of image similarity is challenging due to the lack of appropriate metrics. These often suffer from semantically valid changes which are imperceptible to humans (like slightly shifted pixels) [12]. To address this challenge, we adopt an additional Generative Adversarial Network (GAN) approach, as described in Larsen et al. [12]. GANs consist of two competing neural networks, a generator (see Par. Generator in sec. 2.2.1) and a discriminator (see Par. Discriminator in sec. 2.2.1) [8]. The discriminators role is to distinguish between real and fake images, while the generator tries to create fake images which fool the discriminator. This encourages the generator to produce better images and approximates the local data distribution [26]. Since most of the low level semantic information is already encoded in the L-channel, we only use the discriminator output. This is similar to a conditional GAN approaches like the pix2pix architecture [8,11].
Generator As generator G, we use a U-Net [25] which, given the luminescence channel Il as input, predicts the corresponding a and b channels ˆIa. In contrast to the original published U-Net architecture our blocks consists of two consecutive convolutional, followed by a batch normalization layer. Our final output layer is a tanh normalization. We train the generator U-Net to generate realistically re-colorized images with a loss function that is composed of three terms:
with and
as weighting factors. The first loss term L1 (see eq. 2) is the commonly used least squared GAN loss [18], defined by the output of the discriminator D(ˆI) for a fake image ˆI = G(Il) with the label Y Dreal. L1 includes the output of the discriminator and forces so the production images which can fool the discriminators decision.
Due to a unbalanced distribution of color values and to improve the correct colorization of instruments or medical equipment, we extend the loss function of the generator LG with the term L2 (eq. 3). The distribution of ab values of endoscopic images is strongly biased towards red, yellow, and black values, due to the appearance of background such as adipose tissue and blood, and the fact circular content area of most images is enclosed within a black background (see Fig. 4). Re-balancing is required to compensate for this and prevent the re-colorized images from being dominated by the most frequent values. Inspired by Zhang et al. [30], we obtain the empirical color distribution ˜pc for each channel c = {a,b} separately, with a quantization of the color space by a grid size of 1. The loss function L2, as defined in eq. 3, ensures a high penalty for wrong values in image regions with rare values:
where, given by ˜pc, the weighting factor Pc is the relative color frequency for each value in the input image. A quantized heatmap of the color distribution ˜pc for the a and b-channel can be seen in Fig. 4. To prevent the learning from just rare values which would result in miss colorized images (for example a purple colouring), as an antagonist to L2, we define L3 to force the network to learn rare values and still be able to produce a valid colorization, similar to the original image.
While the pre-training is only based on the l-channel, the target task makes additional use of the a and b channel. To this end, we added two zero initialized dummy input layers for the pre-training. After the pre-training, the network had learned to ignore the additional two empty channels, and during the target task training, these two layers were filled and the network learned to include them in its decision.
Discriminator We use an untrained ResNet18 [9] with the output of Das discriminator D(I). For training purposes, we show the network real images I and re-colorized images ˆI. We use the mean squared error (MSE) with the labels Y Dfake = 0 and Y Dreal = 1 [18] as loss function w.
Fig. 3: Pre-training using self-supervised learning and a generative adversarial network (GAN) approach. First the image is transformed into the LAB color space. The luminescence layer l is fed into the generator G(Il) (U-Net) which is trained to generate the corresponding ˆa and ˆb channels. The discriminator D (ResNet18) is trained to differentiate between real images I = {l,a,b} from the target domain and fake images ˆI = {l, ˆa, ˆb} produced by the generator.
2.2.2 Target task: Instrument segmentation
For target task training, we propose two variants of our method. The first one (PAT) does not require any additional labeled data while the second one (PAT-Ext) uses labeled data from a different but similar domain (in this instance: DS MICCAI L, see par. Validation Data.).
1. PAT: After pre-training our model as described in sec. 2.2.1, we use the U-Net that was pre-trained on the re-colorization task and fine-tune it for image segmentation. For this purpose, we adopt all pre-trained layers and only randomly initialize the last layer, which outputs the final segmentation c INSTRUMENT,BACKGROUND} for each pixel. We implement the cross entropy (see eq. 6) as loss function between the output and the groundtruth.
2. PAT-Ext: We use additional available labeled data from a similar domain to extend the pre-training for the PAT-Ext model. To this end we extract the trained U-Net from the previously performed re-colorization task and re-initialize the last layer, similar to PAT. Following this, we re-train our U-Net on our additional available segmentation data as described above, before we finally fine-tune it on our sparsely labeled dataset.
2.3 Experiments
Based on endoscopic video data from two different sites (par. Validation Data), we (1) assessed the performance of our method as a function of the number of labeled images (par. Effect of training data size),(2) evaluate the effect of data augmentation on our method (par. Effect of data augmentation), (3) investigated the effect of different data domains used for pre-training with our method (par. Effect of the unlabeled data domain) and (4) compared our pre-training methods to related work using labeled data (par. Comparison to other pre-training methods).
Validation Data We used the following datasets (L: labeled; UL: unlabeled) for validation purposes:
– DS COCO L: All 2,818 images of cats from the COCO dataset [14] and the corresponding segmentations. We chose cats as target class because unlike other classes of the COCO dataset, the corresponding images do not suffer from poor references or ambiguities [10] and have the target object in the foreground (similar to medical instruments in endoscopic data). Note, however, that the color distribution of DS COCO L is comparable to that of the whole COCO dataset.
– DS COCO UL: 20k natural images from the COCO dataset [14]. Comprises all 2,818 cat images and an additional 16,692 images selected randomly from the remaining 91 classes.
– DS MICCAI L: 2,400 endoscopic images with the corresponding instrument segmentations as used by the robotic instrument segmentation challenge3 that was part of the MICCAI endoscopic vision challenge 2017. The sets training/testing images are disjunct and already predefined by the challenge.
– DS MICCAI VID: 21 unpublished endoscopic videos, which the images of DS MICCAI L were extracted from.
– DS HD UL: 30 endoscopic videos used in the surgical workflow challenge4 that was part of the MICCAI endoscopic vision challenge 2017.
– DS HD L [3]: 809 annotated images from 6 surgeries with the corresponding binary instrument segmentations. The images were extracted from the DS HD UL dataset. We split our data into a training set of 413 (three surgeries), validation set of 119 (one surgery), and test set of 277 images (two surgeries). The sets of videos corresponding to testing images and training/validatation images are disjunct and randomly chosen.
Training hyperparameters Unless otherwise stated in the method section, all networks ( ResNet18, U-Net) that were trained for the experiments have the same architecture as described in their original publications. [9,25]. In this manuscript, we only mention important deviations from the original implementations. Hyperparameters were optimized with 80% of the training data using a fixed validation data set of 15 of the training data. We used a preliminary hyperparameter space search and kept the parameters fixed for all consecutive experiments. For the re-colorization task we used the adam optimizer with a batch size of 12 and a learning rate of 0.0005 for the generator and 0.002 for the discriminator. Following visual exploration of our loss function on validation data, we stopped the pre-training after 20 epochs. For our experiments, we set the parameter and
such that the scale of L1,L2 and L3 are equally ranked. For every unlabeled dataset (dataset name ending with UL, see par. Validation Data) we achieved a color distribution ˜p which was used later on for the re-colorization training on the corresponding dataset. The training of the the instrument segmentation was done with the adam optimizer and a learning rate of 0.0005. We used a scheduler that reduced the learning rate by a factor of 0.1 after a loss plateau lasting for 10 epochs. The batch size was 6. We stopped the training routine after 150 epochs.
Effect of training data size To investigate the performance of our PAT method and its variant PAT-Ext (cf. sec. 2.2.2) as a function of the number of labeled training images, we used the Dice Similarity Coefficient (DCS) and the Intersection over Union (IoU) as target metrics. Based on DS HD L, we generated training datasets of size k N with k
and N = 413 denoting the total number of labeled training images available. For each k, five randomly selected disjunct subsets of training and validation data were generated (if possible). For PAT-Ext, we used the labeled dataset DS MICCAI L for model refinement. Testing for all our experiments was done on the complete DS HD L test dataset.
Effect of data augmentation Data augmentation is commonly used, especially if just a small amount of training data is available [7]. To investigate whether the method proposed complement the benefits of data augmentation, we repeated the experiments described in paragraph Effect of training data size with the original training data complemented via data augmentation. We used mirroring, rotation (90, 180
, 270
) and adding of Gaussian noise (20%). All transformations were randomly applied by a 50% chance.
Effect of the unlabeled data domain We compared the performance of the PAT method when instantiated with different datasets, which were from the target domain (DS HD UL), a similar medical domain (DS MICCAI UL) and a non-medical domain (DS COCO UL), to examine the effect of the different domains. After pre-training all models were fine-tuned on DS HD L with the varying amount of data and tested as described in par. Effect of data augmentation.
Comparison with other pre-training methods We implemented two commonly applied state-of-the-art pre-training methods with labeled data:
1. SOA (non-medical): We performed a segmentation pre-training with the non-medical dataset (DS COCO L) and then fine-tuned the net with data from DS HD L, as described in par. Effect of training data size.
2. SOA (medical): Similarly to SOA(non-medical), we performed pre-training with a medical dataset representing a similar domain (DS MICCAI L).
These methods were compared with our approach using unlabeled data (PAT) only, while using unlabeled and labeled data for pre-training (PAT-Ext). For a description of PAT and PAT-Ext see sec. 2.2.2).
Performance of re-colorization According to our experiments, re-colorization of images using the proposed GAN-based approach work produces realistically looking images when trained on medical data (see Fig. 4). In contrast, training on natural images as provided by the DS COCO UL dataset results in re-colored images that do not resemble the endoscopic images encountered in practice. Quantitative assessment of the method is (indirectly) provided in the following paragraphs when investigating the effects of the pre-training method on the segmentation method.
Effect of training data size In order to investigate whether the results of our methods are statistically significantly better than those of the baseline method, we calculated the mean DSC across all 5 splits for each test image of a fraction separately and performed an arcsine transformation to obtain normally distributed data. Subsequently, we fitted linear mixed models [19] on each fraction 120 to 12 with the training method and the image as fixed and ran- dom effect, respectively. Resulting p-values from comparisons of baseline, PAT and PAT-Ext were adjusted by Dunnett’s test and Bonferroni-Holm correction for multiple testing with
Fig. 4: Log color distributions of the a and b channels in the Lab colorspace of the corresponding dataset. Based on this distribution, the re-coloring has been done and leaded to the different color reconstructions as not all colors are equally represented. Red and yellow values occur more frequently in endoscopic videos than outside of the endoscopic context (DS COCO UL)
fractions and across fractions, respectively. The analysis shows that both PAT and PAT-Ext are significantly better (05) than the baseline within each fraction between 120 and 13 with p-values < 0.001, except for the fraction 120 in the PAT method with a p-value of 0.01. The p-values for fraction 12 are 0.89 and 0.53 for PAT-Ext and the baseline. The median difference of DSC between the compared methods for fraction 120 to 13 lies within the range [0.04,0.06] for PAT and for PAT-Ext within [0.04,0.13]. For fraction 12 differences in perfor- mance are negligible. The performance of our method compared to the baseline method (no pre-training) is shown in Fig. 5a. In all experiments, our pre-training method based solely on unlabeled data (PAT) clearly boosted the performance of the segmentation method. When using 116th of the training images (i.e. 25 images) we obtained a higher median performance than the baseline method trained with 18 of the data. This corresponds to a decrease in the manual annotation effort of over than 60%. Analogously, we reduced the laboring effort by more than 50% and by around 25% when training on 18th and 14th of the training data. Even better results can be obtained when combining our pre-training for unlabeled data with pre-training using labeled data for the target task from a similar domain (PAT-Ext). Tab. 1 provides descriptive statistics for both target metrics (DSC and IoU) when using 116th and 1
Effect of data augmentation As shown in Fig. 5b, data augmentation leads to a similar increase in performance for both the baseline method and the methods proposed. However, we could explore better results by combining our pre-training for unlabeled data with a pre-training using labeled data from the target task from a similar domain (PAT-Ext), than by just applying data augmentation, especially in fractions with less than 16 of the training data. The best performance was achieved by extending the training of PAT with data augmentation during the fine-tuning step (PAT-Ext augmented).
Effect of pre-training domain Fig. 6 shows the performance of our method for different domains used in the pre-training process. Regardless of the number of labeled training images used for model fine-tuning, the best results are achieved when the pre-training is performed on the target domain. Using non-medical data decreases accuracy but still provides better performance than the baseline (no pre-training).
Fig. 5: Median Dice Similarity Coefficient (DSC) and the Interquartile Range (IQR) as a function of training data size as described in par. Effect of training data size. Our method clearly outperforms the baseline method without pre-training and is even better than data augmentation by very small numbers of available training data.
Fig. 6: Effect of pre-training domain. For small training datasets, medical images yield better results than non-medical images.
Comparison and combination with other pre-training methods We extended our comparison for the performance of our method to state-of-the-art pre-training methods that rely on labeled data. Our method outperformed the state-of-the-art approach of applying pre-trained nets from a non-medical domain (SOA (non-medical)) that have been trained with labeled images for the target task. However, training on labeled images of a similar domain (SOA (medical)) generally yielded better results than pre-training exclusively on unlabeled data (PAT). The best results were achieved when combining state-of-the-art pre-training on medical data with our approach (PAT-Ext). Detailed results are listed in table 1. The more training images used for fine-tuning, the more similar the results of all methods are.
While approaches to semi-supervised learning, which typically handle unlabeled and labeled data from the same data distribution simultaneously, are increasingly common in the field of Medical Image Computing (MIC) [2], we are, to our knowledge, the first to investigate
Table 1: Descriptive statistics for the target metrics DSC and IoU when using 116th, 18th of the training set images (i.e. 102, 50 and 25 labeled images for final model refinement). Two state-of-the-art (SOA) methods (see par. 2.3) are compared to three variants of our method (see sec. 2.2.2). The mean values are shown along with the improvement in % compared to the baseline method (no pre-training).
the concept of self-supervised learning to reduce manual labeling effort in medical image segmentation. In contrast to state-of-the-art pre-training methods [24,27,28,32], we initialized our model on the target domain using only unlabeled data rather than on a different domain with labeled data. This is achieved with an auxiliary task that can be assumed to learn a representation of the target domain that is well-suited for the target task. According to the experiments in this study, our approach is suitable for leveraging information in unlabeled endoscopic video data to reduce the amount of labeled training required. Our method not only outperformed the baseline method without pre-training by a large margin, but also yielded better results than the state-of-the-art pre-training method requiring labeled data. This is particularly apparent in small sets of labeled data. The related literature on pre-training with self-supervised learning is very recent (with some of it being produced in parallel to our work) and is mainly proposed by the computer vision community. Analysis of various auxiliary tasks (inpainting [22], re-colorization [13,30,31], classification [20,30], re-ordering [20] and prediction [1]) for multiple applications suggests that re-colorization is the most promising approach for a number of applications. The closest work to ours was recently authored by Bodenstedt et al. [4], who introduced an auxiliary task that estimates the order of appearance of two video frames in order to pre-train a CNN for surgical phase recognition. To our knowledge, however, using re-colorization as auxiliary task has not yet been investigated in the field of medical image analysis. Our results suggest that the method proposed complements the benefits gained from data augmentation. If we compare the improvement in performance resulting from the two complementary methods, it can be concluded that our method is particularly well-suited to situations where only little training data (up to 16 of the training data) is available. In these cases, the benefits of PAT-Ext pre- training are greater than those for data augmentation. It should be pointed out that the goal of this work was not to optimize the performance of an algorithm for a specific application. Instead, our aim was to explore ways to make optimal use of available data sources. An additional increase in accuracy could, for example, be gained by optimizing the weights in our loss function. However, an interesting side-effect is that our method achieves state-of-the-art performance on the most recent MICCAI endoscopic vision dataset for instrument segmentation (Fig. 1), even without data augmentation. In contrast, absolute performance is much worse on our own dataset. We attribute this to the comparatively low variability of the MICCAI images as well as the challenging nature of our own images. The auxiliary task chosen in this paper (GAN-based re-colorization) appears to be a very good match for the target task of medical instrument segmentation, as suggested by the experimental results. We are currently planning to test our method on further target tasks. Future work should be focused on finding optimal auxiliary tasks for a given target application.
In conclusion, we have developed a pre-training approach that makes optimal use of all the available data sources: both, public and non-public, in addition to labeled and unlabeled. As it can potentially be applied to a wide range of target tasks, the potential impact on the research community and possible clinical applications is high.
Acknowledgements We acknowledge the support of the European Research Council (ERC-2015-StG-37960). This work was support by Intuitive Surgical who providing us with the raw video data, from which the MIC-CAI2017 robotic challange data were extracted. We further acknowledge the support of the Federal Ministry of Economics and Energy (BMWi) and the German Aerospace Center (DLR) within the OP 4.1 projekt. Finally, we would like to thank Simon Kohl inspiring us to this paper. Conflict of Interest: The authors declare that they have no conflict of interest. Ethical approval: For this type of study formal consent is not required. Informed consent: This article contains patient data from publically available datasets.
1. Agrawal, P., Carreira, J., Malik, J.: Learning to See by Moving. In: Proceedings of the IEEE Internat. Conference on Computer Vision, pp. 37–45 (2015)
2. Baur, C., Albarqouni, S., Navab, N.: Semi-supervised deep learning for fully convolutional networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 311– 319. Springer (2017)
3. Bittel, S., Roethlingshoefer, V., Kenngott, H., Wagner, M., Bodenstedt, S., Ross, T., Speidel, S., MeierHein, L.: How to Create the Largest In-Vivo Endoscopic Dataset
4. Bodenstedt, S., Wagner, M., Kati´c, D., Mietkowski, P., Mayer, B., Kenngott, H., M¨uller-Stich, B., Dillmann, R., Speidel, S.: Unsupervised temporal context learning using convolutional neural networks for laparoscopic workflow analysis. arXiv:1702.03684 [cs] (2017). ArXiv: 1702.03684
5. Garcia-Peraza-Herrera, L.C., Li, W., Fidon, L., Gruijthuijsen, C., Devreker, A., Attilakos, G., Deprest, J., Poorten, E.V., Stoyanov, D., Vercauteren, T., Ourselin, S.: ToolNet: Holistically-Nested Real-Time Segmentation of Robotic Surgical Tools. arXiv:1706.08126 [cs] (2017). ArXiv: 1706.08126
6. Garc´ıa-Peraza-Herrera, L.C., Li, W., Gruijthuijsen, C., Devreker, A., Attilakos, G., Deprest, J., Poorten, E.V., Stoyanov, D., Vercauteren, T., Ourselin, S.: Real-Time Segmentation of Non-rigid Surgical Tools Based on Deep Learning and Tracking. In: Computer-Assisted and Robotic Endoscopy, Lecture Notes in Computer Science, pp. 84–95. Springer, Cham (2016). DOI 10.1007/978-3-319-54057-3 8
7. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). http://www. deeplearningbook.org
8. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets pp. 2672–2680 (2014)
9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
10. Heim, E., Seitel, A., Andrulis, J., Isensee, F., Stock, C., Ross, T., Maier-Hein, L.: Clickstream analysis for crowd-based object segmentation with confidence. arXiv:1611.08527 [cs] (2016). ArXiv: 1611.08527
11. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004 (2016)
12. Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. In: International Conference on Machine Learning, pp. 1558–1566 (2016)
13. Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. arXiv preprint arXiv:1703.04044 (2017)
14. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision, pp. 740–755. Springer (2014)
15. Maier-Hein, L., Mersmann, S., Kondermann, D., Bodenstedt, S., Sanchez, A., Stock, C., Kenngott, H.G., Eisenmann, M., Speidel, S.: Can Masses of Non-Experts Train Highly Accurate Image Classifiers? In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2014, Lecture Notes in Computer Science, pp. 438–445. Springer, Cham (2014). DOI 10.1007/978-3-319-10470-6 55
16. Maier-Hein, L., Ross, T., Gr¨ohl, J., Glocker, B., Bodenstedt, S., Stock, C., Heim, E., G¨otz, M., Wirkert, S., Kenngott, H., Speidel, S., Maier-Hein, K.: Crowd-Algorithm Collaboration for Large-Scale Endoscopic Image Annotation with Confidence. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016, Lecture Notes in Computer Science, pp. 616–623. Springer, Cham (2016). DOI 10.1007/978-3-319-46723-8 71
17. Maier-Hein, L., Vedula, S.S., Speidel, S., Navab, N., Kikinis, R., Park, A., Eisenmann, M., Feussner, H., Forestier, G., Giannarou, S., Hashizume, M., Katic, D., Kenngott, H., Kranzfelder, M., Malpani, A., M¨arz, K., Neumuth, T., Padoy, N., Pugh, C., Schoch, N., Stoyanov, D., Taylor, R., Wagner, M., Hager, G.D., Jannin, P.: Surgical data science for next-generation interventions. Nature Biomedical Engineering 1(9), 691 (2017). DOI 10.1038/s41551-017-0132-7
18. Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Smolley, S.P.: Least squares generative adversarial networks. In: 2017 IEEE Internat. Conference on Computer Vision (ICCV), pp. 2813–2821. IEEE (2017)
19. McCulloch, C.E., Neuhaus, J.M.: Generalized linear mixed models. Wiley Online Library (2001)
20. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision, pp. 69–84. Springer (2016)
21. Pakhomov, D., Premachandran, V., Allan, M., Azizian, M., Navab, N.: Deep Residual Learning for Instrument Segmentation in Robotic Surgery. arXiv:1703.08580 [cs] (2017). ArXiv: 1703.08580
22. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
23. Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th internat. conference on Machine learning, pp. 759–766. ACM (2007)
24. Ravishankar, H., Sudhakar, P., Venkataramani, R., Thiruvenkadam, S., Annangi, P., Babu, N., Vaidya, V.: Understanding the Mechanisms of Deep Transfer Learning for Medical Images. In: Deep Learning and Data Labeling for Medical Applications, Lecture Notes in Computer Science, pp. 188–196. Springer, Cham (2016). DOI: 10.1007/978-3-319-46976-8 20
25. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Springer (2015)
26. Sønderby, C.K., Caballero, J., Theis, L., Shi, W., Husz´ar, F.: Amortised map inference for image superresolution. arXiv preprint arXiv:1610.04490 (2016)
27. Tajbakhsh, N., Shin, J.Y., Gurudu, S.R., Hurst, R.T., Kendall, C.B., Gotway, M.B., Liang, J.: Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Transactions on Medical Imaging 35(5), 1299–1312 (2016). DOI 10.1109/TMI.2016.2535302
28. Tajbakhsh, N., Shin, J.Y., Gurudu, S.R., Hurst, R.T., Kendall, C.B., Gotway, M.B., Liang, J.: On the Necessity of Fine-Tuned Convolutional Neural Networks for Medical Imaging. In: Deep Learning and Convolutional Neural Networks for Medical Image Computing, Advances in Computer Vision and Pattern Recognition, pp. 181–193. Springer, Cham (2017). DOI: 10.1007/978-3-319-42999-1 11
29. Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., Mathelin, M.d., Padoy, N.: EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos. IEEE Transactions on Medical Imaging 36(1), 86–97 (2017). DOI 10.1109/TMI.2016.2593957
30. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: European Conference on Computer Vision, pp. 649–666. Springer (2016)
31. Zhang, R., Isola, P., Efros, A.A.: Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 645– 654 (2017). DOI 10.1109/CVPR.2017.76
32. Zhou, Z., Shin, J., Zhang, L., Gurudu, S., Gotway, M., Liang, J.: Fine-tuning convolutional neural networks for biomedical image analysis: actively and incrementally. In: IEEE conference on computer vision and pattern recognition, Hawaii, pp. 7340–7349 (2017)