This work is a workshop challenge- Retinal Fundus Glaucoma Challenge (REFUGE). The goal of the challenge is to evaluate and compare automated algorithms for glaucoma detection and optic disc/cup segmentation on a common dataset of retinal fundus images. We proposed two solutions that achieved the top performance for both segmentation and classification tasks. The solutions have the potential to be extended to either a novel methodology or an application. The details can be found from https://refuge.grand-challenge.org/Home/ or the paper [7] published on TMI2019.
2.1 Segmentation
We employ a u-net [8] like architecture to learn the different pixel-level features. We modify the u-net to have multiple inputs (3 in our case) so that the network can receive more original raw pixel information during training. This strategy can reduce the risk of overfitting and enhance the network’s learning capability. We refer to this architecture as X-Unet. Moreover, we embed the squeeze-and-excitation blocks [6] into our X-Unet to weight the features from different convolutional layers’ channels. In particular, we utilize a mechanism that allows the network to selectively amplify the valuable channel-wise features and suppress the useless feature from global information. In addition, we use deconvolution in the network decoder part to refine the decoding capability by refusing the features between different level encoded features and the corresponding level decoded features. The figure 2.1 shows our X-Unet’s architecture.
Fig. 1. X-Unet’s architecture includes squeeze-and-excitation blocks.
Regression We consider the segmentation task as an image regression instead of pixel classification problem, which in deep learning usually needs to transform the low-level pixel information to high-level features. However, for the disc and cup binary segmentation tasks, low-level pixel-wise features are more important. In contrast to learning to classify the pixels, mapping a retinal image to its corresponding label directly can keep more low-level pixel-wise features.
Loss function The major pixel-wise similarities in training images allow us to adopt mean absolute error (MAE) as our loss function to calculate the pixel-wise difference between label and prediction.
where n is the number of pixels; ˆis the predicted pixels;
is the actual pixels.
Fig. 2. Consider the segmentation as an image regression instead of pixel-classification problem.
2.2 Classification
The region and around-area of optic cup/disc contain the key pixel-wise features, such as vertical disc diameter, the oval shape of disc/cup, ISNT rule [4], and yellow-orange rim, that are mainly used for distinguishing glaucoma. The various scale pixel-wise features, including pixel color and location, are more important than high-level features, which can be learned by very deep convolutional neural networks (e.g., Resnet [5]). Atrous (dilated) convolution [9] is a key method that can extract different scale features and keep locations information simultaneously.
Fig. 3. Deeplab+3 variant architecture for glaucoma classification. We replace the last upsample layer with an average pooling layer and a fully connected layer.
We modify DeepLab+3 [2] to be a classifier by replacing the last layer with a global average pooling layer followed by a fully connected layer for predicting the risk possibility of glaucoma. DeepLab+3 includes one encoder and one decoder. The encoder embeds atrous spatial pyramid pooling (ASPP) [1] and convolutions in cascade to extract various scale context pixel-wise information. The decoder refuses the low- level features learned by atrous convolutions with the various scale context features of the encoder.
Loss function We utilize binary entropy function to calculate the difference between the predicated class (possibility) and actual class.
where BE presents the value of binary entropy loss; y is the binary indicator (0 or 1); p is the predicted probability.
Data prepossessing We reduce the variance between training and validation images by cropping 600600 size of region of interest (ROI) patches with the pretrained model: Disc-aware Ensemble Network (DE-Net) [3]. This data processing also can allow the model to focus on learning the most important pixel-wise information.
Fig. 4. The steps of data prepossessing for both segmentation and classification training and testing data.
We use data augmentation skills, such as image Rotation -90/180/270 various angles and image flipping, to increase the number of training images. In total, 3,200 ROI images are generated for segmentation and classification training.
In order to ensure the network’s receptive field is sufficient, we resize the training patches to be smaller size 128 128 as the segmentation task training inputs. For classification task, we use the same method (Fig.3) in segmentation to crop out the ROIs for training and testing. We resize the original cropped region images to be various sizes, such as 216
216, 256
256, 286
286, 324
324, and 360
360, for multiple deep networks training. We average the models outputs as the final prediction result.
Fig. 5. The steps of data prepossessing for classification training data.
In addition, we need to handle with the image size difference between training and validation images. Hence, in testing stage on validation images, we crop out 500500 (not the 600
600 in training) ROIs for segmentation task and 800
800 (not the 1024
1024 in training) ROIs for classification task. This can make sure the inputs to the network are similar to training images as much as possible.
Others For training platform, we use Keras + tensorflow + python2.7. The Adam optimizer is used and Learning rate is 0.0001.
For segmentation, on training set, mean Optic Cup Dice is 0.9626, mean Optic Disc Dice is 0.9876, and MAE CDR is 0.0161. On validation set, mean Optic Cup Dice = 0.8498, mean Optic Disc Dice = 0.9433, and MAE CDR is 0.0444. Best rank (results-online): 8th.
For classification, on training set AUC: 1.0 and Sensitivity: 1.0. Potential Overfitting is occured. On validation set AUC is 0.9708 and Sensitivity: 0.95. The latest (results-online) Rank: 2nd.
In this work, we proposed two deep learning networks for retinal fundus glaucoma segmentation and detection. To overcome the major challenge, such as the variance (color and size due to different acquisition equipment) between training and testing images, we adopted pixel-wise learning and attention strategy, which can allow the networks focuses on learning the key features directly for the pixel-wise accurate predication. In particular, we proposed a multiple-input UNet, named as X-Unet, to enlarge the raw image pixel information for low-level feature regression and prediction. For classification, we proposed how to learn pixel-wise features for classification problems. In detailed, a dilated (Atrous) convolution based network can extract different scale features and keep locations information simultaneously. Atrous spatial pyramid pooling (ASPP) and convolution can extract various scale context pixel-wise information. The encoder part learns various pixel-wise features and the decoder part refuses the low- level features learned by dilated convolutions with the various scale context features of the encoder.
Our proposed methods can overcome the variance issue between training and testing data. However, we believe the best way to have a robust model is to standardize the image quality for any deep learning based model. This may need more efforts from both deep learning theorem and data acquisition community.
1. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40(4), 834–848 (2018)
2. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
3. Fu, H., Cheng, J., Xu, Y., Zhang, C., Wong, D.W.K., Liu, J., Cao, X.: Disc-aware ensemble network for glaucoma screening from fundus image. IEEE Transactions on Medical Imaging (2018)
4. Harizman, N., Oliveira, C., Chiang, A., Tello, C., Marmor, M., Ritch, R., Liebmann, J.M.: The isnt rule and differentiation of normal from glaucomatous eyes. Archives of ophthalmology 124(11), 1579–1583 (2006)
5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
6. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507 7 (2017)
7. Orlando, J.I., Fu, H., Breda, J.B., van Keer, K., Bathula, D.R., Diaz-Pinto, A., Fang, R., Heng, P.A., Kim, J., Lee, J., et al.: Refuge challenge: A unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Medical image analysis 59, 101570 (2020)
8. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
9. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)