There are growing demands for tumor identification in pathology for time consuming tasks such as measuring tumor burden, grading tissue samples, determining cell cellularity and many others. Recognizing tumor in histology images continues to be a challenging problem due to complex textural patterns and appearances in the tissue bed. With the addition of tumor, subtle changes which occur in the underlying morphology are difficult to distinguish from healthy structures and require expertise from a trained pathologist to interpret. An accurate automated solution for recognizing tumor in vastly heterogeneous pathology datasets would be of great benefit, enabling high-throughput experimentation, greater standardization and easing the burden of manual assessment of digital slides.
Deep convolutional neural networks (CNNs) are now a widely adopted architecture in machine learning. Indeed, CNNs have been adopted for tumor clas-sification in applications such as analysis of whole slide images (WSI) of breast
Fig. 1. Digital slide shown at multiple resolutions. Regions-of-interest outlined in red are shown at greater resolutions from left to right.
tissue using AlexNet [9] and voxel-level analysis for segmenting tumor in CT scans [16]. Such applications of CNNs continue to grow and the traditional architecture of a CNN has also evolved since its origination in 1998 [5]. A basic CNN architecture encompasses a combination of convolution and pooling operations. As we traverse deeper in the network, the network size decreases resulting in a series of outputs, whether that be classification scores or regression outcomes. In lower layers of a typical CNN, fully-connected (FC) layers are required to learn non-linear combinations of learned features. However the transition between a series of two-dimensional convolutional layers to a one-dimensional FC layer is abrupt, making the network susceptible to overfitting [7]. In this paper we propose a method for transitioning between convolutional layers and FC layers by introducing a framework which encourages generalization. Different from other regularizers [11,4], our method congregates high-dimensional data from features maps produced in convolutional layers in an efficient manner before flattening is performed.
To ease the dimensionality reduction between convolutional and FC layers we propose a transition module, inspired by the inception module [14]. Our method encompasses convolution layers of varying filter sizes, capturing learned feature properties at multiple scales, before collapsing them to a series of average pooling layers. We show that this configuration gives considerable performance gains for CNNs in a tumor classification problem in scanned images of breast cancer tissue. We also evaluate the performance of the transition module compared to other commonly used regularizers (section 5.1).
In the histopathology literature, CNN architectures tend to follow a linear trend with a series of layers sequentially arranged from an input layer to a softmax output layer [9,8]. Recently, however, there have been adaptations to this structure to encourage multiscale information at various stages in the CNN design. For example Xu et al. [17] proposed a multichannel side supervision CNN which merges edges and contours of glands at each convolution layer for gland segmentation. In cell nuclei classficiation, Buyssens et al. [2] learn multiple CNNs in parallel with input images at various resolutions before aggregating classi-fication scores. These methods have shown to be particularly advantageous in histology images as it mimics pathologists’ interpretation of digital slides when viewed at multiple objectives (Fig. 1).
However capturing multiscale information in a single layer has been a recent advancement after the introduction of the inception modules (Section 3.1). Since then, there have been some adaptations however these have been limited to convolution layers in a CNN. Liao and Carneiro [6] proposed a module which combines multiple convolution layers via a max-out operation as opposed to concatenation. Jin et al. [3] designed a CNN network in which independent FC layers are learned from the outputs of inception layers created at various levels of the network structure. In this paper, we focus on improving the network structure when changes in dimensionality occur between convolution and FC layers. Such changes occur when the network has already undergone substantial reduction and approaches the final output layer therefore generalization is key for optimal class separation.
3.1 Inception Module
Inception modules, originally proposed by Szegedy et al. [14], are a method of gradually increasing feature map dimensionality thus increasing the depth of a CNN without adding extreme computational costs. In particular, the inception module enables filters to be learned at multiple scales simultaneously in a single layer, also known as a sub-network. Since its origination, there have been multiple inception networks incorporating various types of inception modules [15,13], including GoogleNet. The base representation of the original inception module is shown in Fig. 2 (left).
Each inception module is a parallel series of convolutional layers restricted to filter sizes 1x1, 3x3 and 5x5. By encompassing various convolution sub-layers in a single deep layer, features can be explored at multiple scales simultaneously. During training, the combination of filter sizes which result in optimal performance are weighted accordingly. However on its own, this configuration results in a very large network with increased complexity. Therefore for practicality purposes, the inception module also encompasses 1x1 convolutions which act as dimensionality reduction mechanisms. The Inception network is defined as a stack of inception modules with occasional max-pooling operations. The original implementation of the Inception network encompasses nine inception modules [14].
3.2 Transition Module
In this paper, we propose a modified inception module, called the “transition” module, explicitly designed for the final stages of a CNN, in which learned fea-
Fig. 2. Original inception module (left) and the proposed transition module (right).
tures are mapped to FC layers. Whilst the inception module successfully captures multiscale from input data, the bridge between learned feature maps and classi-fication scores is still treated as a black box. To ease this transition process we propose a method for enabling 2D feature maps to be downscaled substantially before tuning FC layers. In the transition module, instead of concatenating outcomes from each filter size, as in [14], independent global average pooling layers are configured after learning convolution layers which enable feature maps to be compressed via an average operation.
Originally proposed by Lin et al. [7], global average pooling layers were introduced as a method of enforcing correspondences between categories of the classification task (i.e. the softmax output layer) and filter maps. As the name suggests, in a global averaging pooling layer a single output is retrieved from each filter map in the preceeding convolutional layer by averaging it. For example, if given an input of 256 3x3 feature maps, a global average pool layer would form an output of size 256. In the transition module, we use global average pooling to sum out spatial information at multiple scales before collapsing each averaged filter to independent 1D output layers. This approach has the advantage of introducing generalizability and encourages a more gradual decrease in network size earlier in the network structure. As such, subsequent FC layers in the network are also smaller in size, making the task of delineating classification categories much easier. Furthermore there are no additional parameters to tune.
The structure of the transition module is shown in Fig. 2 (right). Convolution layers were batch normalized as proposed in Inception-v2 for further regularization.
We evaluated the performance of the proposed transition module using a dataset of 1229 image patches extracted and labelled from breast WSIs scanned at x20 magnification by a Scanscope XT (Aperio technologies, Leica Biosystems) scanner. Each RGB patch of size 512x512 was hand selected from 31 WSIs, each one from a single patient, by a trained pathologist. Biopsies were extracted from patients with invasive breast cancer and subsequently received neo-adjuvant therapy; post neoadjuvant tissue sections revealed invasive and/or ductal carcinoma in situ. 5-fold cross validation was used to evaluate the performance of this dataset. Each image patch was confirmed to contain either tumor or healthy tissue by an expert pathologist. “Healthy” refers to patches which are absent of cancer cells but may contain healthy epithelial cells amongst other tissue structures such as stroma, fat etc. Results are reported over 100 epochs.
We also validated our method on a public dataset, BreaKHis [10] (section 5.3) which contains scanned images of benign (adenosis, fibroadenoma, phyllodes tumor, tubular adenoma) and malignant (ductal carcinoma, lobular carcinoma, mucinous carcinoma, papillary carcinoma) breast tumors at x40 objective. Images were resampled into patches of dimensions 228x228, suitable for a CNN, resulting in 11, 800 image patches in total. BreaKHis was validated using 2-fold cross validation and across 30 epochs.
In section 5.2, results are reported for three different CNN architectures (AlexNet, ZFNet, Inception-v3), of which transition modules were introduced in AlexNet and ZFNet. All CNNs were trained from scratch with no pretraining. Transition modules in both implementations encompassed 3x3, 5x5 and 7x7 convolutions, thus producing three average pooling layers. Each convolutional layer has a stride of 2, and 1024 and 2048 filter units for AlexNet and VFNet, respectively. Note, the number of filter units were adapted according to the size of the first FC layer proceeding the transition module.
CNNs were implemented using Lasagne 0.2 [1]. A softmax function was used to obtain classification predictions and convolutional layers encompassed ReLU activations. 10 training instances were used in each batch in both datasets. We used Nestorov Momentum [12] to perform updates with a learning rate of 1.
5.1 Experiment 1: Comparison with Regularizers
Our first experiment evaluated the performance of the transition model when compared to other commonly used regularizers including Dropout [11] and crosschannel local response normalization [4]. We evaluated the performance of each regularizer in AlexNet and report results for a) a single transition module added before the first FC layer, b) two Dropout layers, one added after each FC layer with p = 0.5, and lastly c) normalization layers added after each max-pooling operation, similar to how it was utilized in [4].
The transition module achieved an overall accuracy rate of 91.5% which when compared to Dropout (86.8%) and local normalization (88.5%) showed considerable improvement, suggesting the transition module makes an effective regularizer compared to existing methods. When local response normalization was used in combination with the transition module in ZFNet (below), we achieved a slightly higher test accuracy of 91.9%.
0.0 0.2 0.4 0.6 0.8 1.0 1 - Specificity
Fig. 3. ROC curves for AlexNet [4] and ZFNet [18], with and without the proposed transition module, and Inception-v3 [15]. ROC curves are also shown for the transition module with and without average pooling.
5.2 Experiment 2: Comparing Architectures
Next we evaluated the performance of the transition module in two different CNN architectures: AlexNet and ZFNet. We also report the performance of Inception-v3 which already has built-in regularizers in the form of 1x1 convolutions [15], for comparative purposes. ROC curves are shown in Fig. 3.
Both AlexNet and ZFNet benefited from the addition of a single transition module, improving test accuracy rates by an average of 4.3%, particularly at lower false positive rates. Smaller CNN architectures proved to be better for tumor classification in this case as overfitting was avoided, as shown by the comparison with Inception-v3. Surprisingly, the use of dimensionality reduction earlier in the architectural design does not prove to be effective for increasing classification accuracy. We also found that the incorporation of global average pooling in the transition module improved results slightly and resulted in 3.1% improvement in overall test accuracy.
5.3 Experiment 3: BreaKHis
We used the same AlexNet architecture used above to also validate BreaKHis. ROC curves are shown in Fig. 4. There was a noticeable improvement (AUC+=0.06)
Fig. 4. ROC curves for BreaKHis [10] dataset with and without the proposed transition module.
when the transition module was incorporated, suggesting that even when considerably more training data is available a smoother network reduction can be beneficial.
The transition module achieved an overall test accuracy of 82.7% which is comparable to 81.6% achieved with SVM in [10], however these results are not directly comparable and should be interpreted with caution.
In this paper we propose a novel regularization technique for CNNs called the transition module, which captures filters at multiple scales and then collapses them via average pooling in order to ease network size reduction from convolutional layers to FC layers. We showed that in two CNNs (AlexNet, ZFNet) this design proved to be beneficial for distinguishing tumor from healthy tissue in digital slides. We also showed an improvement in a larger publically available dataset, BreaKHis.
This work has been supported by grants from the Canadian Breast Cancer Foundation, Canadian Cancer Society (grant 703006) and the National Cancer Institute of the National Institutes of Health (grant number U24CA199374-01).
1. Lasagne. http://lasagne.readthedocs.io/en/latest/, Accessed: 2017-02-18
2. Buyssens, P., Elmoataz, A., L´ezoray, O.: Multiscale convolutional neural networks for vision-based classification of cells, pp. 342–352. Springer Berlin Heidelberg (2013)
3. Jin, X., Chi, J., Peng, S., Tian, Y., Ye, C., Li, X.: Deep image aesthetics classifica- tion using inception modules and fine-tuning connected layer. In: 8th International Conference on Wireless Communications and Signal Processing (WCSP) (2016)
4. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con- volutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25. pp. 1097–1105 (2012)
5. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
6. Liao, Z., Carneiro, G.: Competitive multi-scale convolution. CoRR abs/1511.05635 (2015), http://arxiv.org/abs/1511.05635
7. Lin, M., Chen, Q., Yan, S.: Network in network. In: Proc. ICLR (2014)
8. Litjens, G., S´anchez, C.I., Timofeeva, N., Hermsen, M., Nagtegaal, I., Kovacs, I., Hulsbergen-van de Kaa, C., Bult, P., van Ginneken, B., van der Laak, J.: Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Scientific Reports 6 (2016)
9. Spanhol, F., Oliveira, L.S., Petitjean, C., Heutte, L.: Breast cancer histopathologi- cal image classification using convolutional neural network. In: International Joint Conference on Neural Networks. pp. 2560–2567
10. Spanhol, F.A., Oliveira, L.S., Petitjean, C., Heutte, L.: A dataset for breast cancer histopathological image classification. IEEE Transaction on Biomedical Engineering 63, 1455–1462 (2016)
11. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research pp. 1929–1958 (2014)
12. Sutskever, I.: Training recurrent neural networks (2013), Thesis
13. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, Inception- ResNet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261 (2016)
14. Szegedy, C., L., W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., E., D., Van- houcke, V., Rabinovich, A.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR) (2015)
15. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep- tion architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2818–2826 (2016)
16. Vivanti, R., Ephrat, A., Joskowicz, L., Karaaslan, O.A., Lev-Cohain, N., Sosna, J.: Automatic liver tumor segmentation in follow up CT studies using convolutional neural networks. In: Proc. Patch-Based Methods in Medical Image Processing Workshop, MICCAI
17. Xu, Y., Li, Y., Wang, Y., Liu, M., Fan, Y., Lai, M., Chang, E.I.C.: Gland instance segmentation using deep multichannel neural networks. CoRR abs/1611.06661 (2016)
18. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Computer Vision and Pattern Recognition (CVPR) (2013)