Lung cancer is the leading cause of cancer death, regardless of gender or ethnicity. Only 19% of all people diagnosed with lung cancer will survive after 5 years, but this percentage improves dramatically when the disease is diagnosed at early stages (Noone et al., 2018).
Small lung nodules are the most common expression of early lung cancer. Their variability in size, texture, and morphology makes it difficult to detect them even for clinical specialists. The use of thin-slice helical chest computed tomography (CT) together with the recommenda-
tions established by clinical guidelines, such as those of the Fleischner Society (MacMahon et al., 2017), has allowed improving nodule detection rates as well as better identifying the malignancy of incidental nodules. However, recommendations for borderline and complex cases are still vague and open to the judgment and experience of the clinicians.
Current clinical criteria for assessing pulmonary nodule changes rely on visual comparison and diameter measurements from the axial slices of the initial and follow-up CT images (Larici et al., 2017). Three-dimensional assessment provides more accurate and precise nodule measurements, especially for small nodules (Ko et al., 2012). However, it requires the segmentation of the nodule, which is a time-consuming process and highly subjected to intra- and inter- observer variability. This is why
it is rarely used in a typical clinical workflow.
Computer-assisted diagnosis (CAD) systems are expected to assist in clinical decision by providing relevant information such as accurate growth rates, increase in solid component, or change in density of the nodules. This information could help specialists to reduce the number of studies for a problematic nodule, decreasing the diagnostic time and, hopefully, reducing the classification of the neoplasm, which should lead to a reduction in morbid mortality (American College of Radiology, 2014).
Recent advances in deep neural networks (Goodfellow et al., 2016) have allowed increasing substantially the performances reported by conventional image processing methods in nodule detection (Setio et al., 2017), segmentation (Messay et al., 2015), and malignancy classification (Ciompi et al., 2017). Some of the main advantages of using deep neural networks rely in their ability to learn and extract, in a very effective way, intricate patterns from the raw data without any previous feature engineering, reuse these patterns in different locations of the image, and even transfer them to different domains (Weiss et al., 2016). Despite the recent explosion of methods based on deep neural networks in the lung cancer domain, most of them are focused on the analysis of a single CT scan.
Few CAD systems (Ardila et al., 2019) have been proposed for the automatic support of lung cancer follow-up. Major developments in the field are mainly limited by the lack of open datasets with annotated series of CTs. To analyze series of CT scans, prior and follow-up lung exams have to be initially registered to facilitate, for instance, the correct re-identification of pulmonary nodules. Several factors compromise the effectiveness of the registration process, such as the variability in the image size and resolution originated by the use of different CT scans, and the variability in the position and breath cycle of the patients when performing the scanning.
Although current medical image registration methods (Song et al., 2017), especially non-linear (R¨uhaak et al., 2017), report accurate CT alignments, they are still slow and introduce some distortions in the intrinsic structure of the lung, hindering their wide clinical acceptance (Viergever et al., 2016). In addition, other complexities must be addressed, regardless of the quality of the image registration, to enable a proper nodule re-identification, such as the existence of several nodules close to each other, and/or the alteration in texture, size, and even location of the nodules due to disease progression. Therefore, more research is still needed to reliably include the nodule re-identification in different CT scans, in automated tools to support physicians in the analysis of longitudinal studies of lung cancer.
This work aims to take a step in this direction, and proposes a novel approach for the re-identification of pulmonary nodules. In particular, we propose a 3D Siamese neural network Koch (2015) to predict the most likely matching nodules from a series of lung CT scans of the same patient. This approach does not require prior registration of the CT scans, avoiding some of the shortcomings that it entails. In addition, to demonstrate the value of this approach, we integrate it into an automated pipeline aimed to detect the growth of pulmonary nodules over time.
The contributions of this paper with respect to previous works is two-fold. First, we investigate and provide several models for re-identifying lung nodules in CT scans series, relying directly on 3D volumetric data, transfer learning, and siamese neural networks. In this sense, to the best of our knowledge, this would be the first time that the problem of pulmonary nodule re-identification is addressed through deep learning techniques. Secondly, we build and evaluate an automatic pipeline that integrates the proposed models to predict nodule growth from logitudinal CTs.
2.1. Automated nodule re-identification
Lung nodule re-identification (i.e. matching) between current and former CT examinations is necessary for assessing nodule growth or shrinkage. While the majority of lung cancer CAD systems found in the literature focus on the nodule detection task (Loyman and Greenspan, 2019), relatively few automated nodule matching systems have been proposed (partly because of the limited availability of follow up datasets).
In Lee et al. (2007) a matching rate of 67% was obtained in a sample of 30 patients with metastatic pulmonary nodules. In screening datasets, higher matching rates are usually reported. In 54 pairs of low-dose multi detector CTs, a CAD system successfully matched 91.3% of nodules 4mm (Beigelman-Aubry et al., 2007). In a low-dose multi detector CT screening study of 40 subjects with non-calcified nodules, a matching rate of 92.7% across three time points was found (Tao et al., 2009). Another CAD system (Koo et al., 2012) was evaluated for automated lung nodule matching using annotations from 4 experts in 57 patients. Performances obtained were between 79% and 92% of accuracy scores. These CAD systems relied on conventional computer vision techniques. Deep learning-based CAD systems for analysis of longitudinal lung cancer studies are practically nonexistent in literature. An exception is in (Ardila et al., 2019), where a CAD system for end-to-end lung cancer screening is proposed. However, nodule matching was not directly tackled in the study.
All these CAD systems rely on registration of the lungs in the different examinations. Performing an accurate registration of lung images is particularly challenging due to the high deformability of the lung tissue and the volume changes during the breathing cycle (Murphy et al., 2010). Several methods have been studied for the lung CT series analysis registration (Song et al., 2017). The choice of the right registration method and of the correct evaluation metric to assess its performance are of crucial importance as they can affect the results of the analysis.
2.2. Siamese Neural Networks
The problem of nodule re-identification can be closely related to the one of recognizing the same object in different images. This type of problems has been successfully addressed by siamese neural networks (Bromley et al., 1994) (SNNs). They are designed as two sibling networks, connected by a distance layer at the top, trained to predict matching or mismatching between two input images. The original architecture, first introduced for the problem of signature verification, was later extended by Koch (2015) using convolutional layers and adjusting the optimization metric with a weighted L1 distance between the twin feature vectors of both networks.
SNNs have been extensively used in computer vision matching problems such as tracking objects in videos (Tao et al., 2016), matching pedestrians across multiple camera views (Varior et al., 2016), and matching corresponding patches in satellite images (Hughes et al., 2018).
In the medical image domain, SNNs have been used primarily to extract a latent representation for content-based image retrieval. For instance, Chung and Weng (2017) proposed a SNN, pre-trained on the ImageNet dataset and using a contrastive loss function (Hadsell et al., 2006) to retrieve similar images to the query, using a publicly available dataset of diabetic retinopathy fundus images. Another example is the work by Cai et al. (2019), which applied SNNs to retrieve similar images from several medical image databases of lung, pancreas, and brain. As far as we know, SNNs have not yet been applied to re-identify nodules in a series of lung CT scans.
3.1. Nodule re-identification
To solve the problem of nodule re-identification in a pair of CTs of the same patient taken at different time points, we propose building a SNN (Koch, 2015). An appealing characteristic of SNNs is that they rely on a distance metric computed on features extracted automatically by a deep learning network. This should allow greatly accelerating and simplifying the nodule re-identification process avoiding to introduce a registration technique as source of variability and error in the analysis.
Siamese neural networks are composed of a feature extraction component in which two subnetworks (with shared architecture and weights) process a pair of images at a time to produce two embedding feature vectors directly from the images. A second component (i.e. the head of the network) aims to classify whether the two embedding feature arrays are similar or not. To assess this, the features are passed to a pairwise distance layer that computes a similarity score.
In a previous study (Bonavita et al., 2019), we trained a deep convolution neural network (CNN) for nodule clas-sification able to effectively reduce the number of false positives in the nodule detection problem. In the present work, we have adjusted that network improving its final performance. In particular, we propose a 3D CNN based on a ResNet-34 architecture that expects nodule patches of 32x32x32. As described in the original paper, the patches are pre-processed crops done around the center of the annotated nodules of the lung CT. The nodule clas-sification network was trained from scratch using a large amount of nodule candidates (> 750K) from the LUNA-16 challenge dataset (Setio et al., 2017). Further details on its architecture and performance are shown in the supplementary material (S1).
In the current study, we removed the fully connected layers of the nodule classification network to use it as the backbone of the sibling networks of the feature extraction component of the SNNs. Figure-1 shows the SNN architecture for the nodule re-identification problem. In this figure, we can observe the two components. First, the feature extraction component, which pre-processes the input nodule patches (i.e. taken at different time points, T1 and T2) and uses the sibling network to extract the corresponding feature maps. Second, the classification component composed of the head of the network that predicts if both feature maps are similar or not. These feature maps (solid arrows in Figure-1) come from different levels of the pre-trained sibling networks. Further details about the feature maps and the network heads are described in Subsection 3.1.2 and 3.1.3, respectively.
Figure 1: Siamese network proposed for lung nodule re-identification. The network is composed of a feature extraction and a basic head network to perform the prediction.
Figure 2: Alternative head networks to configure different siamese networks.
Different SNNs configurations were proposed (Table-1) to gain further insights into the best parameterizations. To allow a fair comparison of the configurations, we trained the SNNs with the same parameter values. Concisely, the number of epochs was set to 150, the learning rate to 0.0001, the batch size to 8, dropout to 0.3, the early stopping at 10 epochs without any significant improvement, and Adam (Kingma and Ba, 2014) was used for optimization. Finally, random rotation, flip, and zoom were applied for data augmentation.
Table 1: List of the different siamese network configurations. The column with acronyms is the result of joining the first letter of the options placed in the next 4 columns.
Below we describe in more detail the main configura-tions and parameters used in the experiments.
3.1.1. Pre-trained network weights
Two configuration values were proposed for this setting: frozen and unfrozen. Usually, the weights of the pre-trained networks in a SNN remain frozen. In this study the pre-trained network had a related but slightly different learning goal than the target (siamese) network. Thus, we allowed also the option of unfreezing the weights of the pre-trained network and updating them during the backpropagation steps of the siamese network training process.
3.1.2. Feature maps
We proposed two options: using the feature maps individually and combining the feature maps together. Feature maps extracted from the first layers of a CNN refer to low-level and less domain-specific representations (e.g lines, circles, spikes), whereas features extracted from deeper layers are generally more high level and domain related representation (e.g. morphology, texture). To analyze the potential of both general and more specific nodule features, we propose to use features from different depths of the network (i.e., from the last layer of each of the 4 convolution blocks that holds the pre-trained Resnet-34 network). The resulting feature maps were obtained after a forward-passing through the network for each of the nodule images of the whole dataset. Table-2 shows the layer name, the number of filters per layer, the output dimension of each filter, and the total number of parameters for each of the selected feature maps.
Table 2: Layers selected from the pre-trained part of the SNNs.
In total, we configured 4 experiments for the individual option (one for each of the 4 feature maps proposed), and 11 cases for the different possible combinations of the feature maps.
3.1.3. Siamese heads
We proposed four different head networks, one meant to follow a more conventional siamese architecture and the others with more exploratory purposes, more precisely:
1. A basic head network (Figure-1) composed of a flat-ten (to homogenize all features to one dimension) and a pairwise distance (i.e. L1) layer, just after the feature extraction part of the network.
2. A fully connected (FC) head network (Figure-2b) composed of a pairwise distance, a flatten, and a FC block layer. The FC block comprises a FC layer (with 64 units), a batch norm, a ReLU, a dropout layer and a final FC layer (with one unit). This clas-sifier head aims at finding non-linear patterns among the merged features (from both sibling networks).
3. A CNN head network (Figure-2c) composed of a pairwise distance layer and a clean (without pre-trained weights) ResNet-34 CNN. Several arrows connect the pairwise distance layer with this clean ResNet-34. There are as many arrows as pre-trained layers used to extract the features. The arrows redirect the features to a specific part of the clean ResNet-34. The redirection had to make compatible the dimensions of the output from the previous layer with the layers of the input. For instance, features extracted from last layer of block1 were linked to the
initial layer of the block2, features from layer2 were linked to the initial layer of the block3 and so on. This head network aimed at exploring non-linear patterns between features but without loosing the space dimension (i.e. no flattening of the features was done between the pairwise layer and the clean ResNet-34).
4. A multi-features combined (MFC) head network (Figure-2d) composed of a pairwise distance layer, a flatten layer, a concatenation layer (to merge all features), and a FC (already described above). This head network aimed at exploring combination of features from different parts of the network.
3.1.4. Loss functions
We explored two options: a contrastive loss and a binary cross entropy (BCE) loss function. Traditionally, SNNs are trained using a contrastive loss (Hadsell et al., 2006) function. This function encodes both similarity and dissimilarity (between the feature maps) independently in a loss function. It ensures that semantically similar pairs are embedded close together while forcing the dissimilar pairs to be apart from each other. Another option to train these networks is through a prediction error-based approach. For our case we adopted the binary cross entropy loss. This implied to apply a sigmoid function on the outputs to transform them into probability values (between 0 and 1).
3.2. Nodule growth detection pipeline
In order to evaluate our nodule re-identification approach in a more realistic and practical scenario, we integrated it into a pipeline that, given a pair of CTs of the same patient taken at two different time points (T1, T2), automatically assesses the nodule growth.
The pipeline (Figure-3) comprises two components: 1) a nodule detector that, given a CT, generates a list of nodule candidates, and 2) a nodule matching component (embedding the siamese networks) that, given the list of nodule candidates of the CTs at the two time points, matches the nodules and computes the difference in diameter between them.
3.2.1. Nodule detector
To build the nodule detector, we followed the work of Liao et al. (2019), with which they won the Data Science
Figure 3: Nodule growth detection pipeline.
Bowl lung cancer challenge1. The authors proposed a 3D Faster-RCNN (Ren et al., 2015) scheme for nodule detection. The backbone of the network was similar to the Unet (Ronneberger et al., 2015) architecture, in which the information flows not only in a classical bottom-up way but also between the encoder and decoder parts of the network thanks to some symmetric links (or short-cuts) that bound both parts of the network. The output of this network were probability feature maps, useful for the lung cancer classification problem.
To the original network, we proposed attaching a double CNN head as in (Ren et al., 2015). One head was used for regression and the other for classification. The regression branch infers the center (x,y,z locations) and the diameter of the nodule, while the classification branch predicts the probability of being a nodule.
The input lung CT was pre-processed before entering the nodule detection network. The image was resampled to an isotropic resolution (1x1x1mm), pixel intensities clipped between [-1000, 600] HU and normalized. The image was then split in overlapping patches (due to memory constraints) of 128x128x128 with an overlap of 32 pixels per dimension. Following Liao et al. (2019), each patch was fed to the network together with a second input of size (32x32x32x3) which contains the relative locations of the patch image with respect to the whole scan. The final network architecture used for nodule detection as well as the performance obtained in LUNA-16 (Setio et al., 2017) dataset can be found in the supplementary material (S2).
3.2.2. Nodule matching
This component performs the re-identification of the nodules among all CT pairs. To do this, for each pair of CTs, we took each candidate found at T1 and we paired with each of the candidates found at T2. The pairs were pre-processed following the specifications described in Section 4.1, and then they were fed to the SNN. The network, trained off-line, provided a matching probability for each pair of candidates. The pairs with the highest probability were selected as the matching ones.
To assess the performance of this process, we computed for each pair of CTs, whether the candidate at T2 predicted with highest probability by the SNN, matched with the annotated nodule at T2. Additionally, we computed the time required for finding the matching nodules. We repeated this process for each of the SNN configurations.
Once having predicted all matching nodules for each pair of CTs, the pipeline returns the nodule growth along with the location and diameter of the matching nodules. The nodule growth is calculated directly by the difference between the predicted nodule diameters at T1 and T2 for each pair of lung CTs.
To evaluate the nodule growth detection, we selected all the correctly matched CT pairs and compared whether the nodule growth difference was of the same sign in both ground truth and predicted. True positive (TP) and false negative (FN) cases were those that had (in both ground truth and predicted) positive and negative growth differences, respectively. A false positive (FP) case was considered when the predicted growth difference was positive and the ground truth one was negative; and a false negative (FN) was considered in the opposite case.
4.1. Evaluation datasets
4.1.1. LUNA-16
In this work we used an updated version of the LIDC dataset (Armato III et al., 2015) provided in the LUNA16 challenge (Setio et al., 2017), which includes only scans with at least one lesion of size 3 mm marked as a nodule by at least three of the four radiologists. The LUNA16 dataset consists of 888 CT scans comprising a total of 1186 nodules. Annotations with coordinates of each nodule in the three spatial axes inferred from the original LIDC annotations are also provided.
4.1.2. VH-Lung
This dataset was designed specifically to identify and follow up lung nodules in time. Ethics approval was obtained from the Medication Research Ethics Committee of Vall d’Hebrn University Hospital (Barcelona) with reference number PR(AG)111/2019 presented on 01/03/2019.
Inclusion criteria were patients without a previous neoplasia, with a confirmed diagnosis, and with visible nodules (5 mm) in at least two consecutive CT scans separated in time by more than six months. These nodules were located in the three spatial axes by two different specialists at each time point and quantified by another experienced radiologist.
In total the dataset contains 151 cases with two thoracic CT scans. For each case, the clinicians annotated only one relevant nodule in both CT scans. The dataset was divided into two subsets, one for training (113 patients) with 70 cancer and 43 benign cases, and other for testing (38 patients) with 25 cancer and 13 benign cases.
4.2. Nodule re-identification
To train the different SNNs, we built a new balanced dataset from the VH-Lung including 302 instances. As positive cases (N=151, label=1), we used the annotated locations of the matching nodules at T1 and T2. As negative cases (N=151, label=0), we used the nodule locations at T1 together with a random nodule location of the annotated nodule locations at T2 (avoiding to select correct nodule location). Random stratified sampling was used to partition the data into training (75% of whole data) with 212 CT pairs (113 positive matching nodules) and testing sets, 90 CT pairs (with 38 positive cases).
We optimized the different SNNs (Table-1) with the training data using a stratified 10-fold cross-validation, and we tested them with the testing set. Results of the best SNNs configurations are shown in Table-3.
4.3. Nodule growth detection pipeline
To evaluate the performance of the nodule growth detection pipeline, we used the VH-Lung test set. First, we analyzed whether the relevant nodules (one per CT) annotated by the radiologists were correctly found. As the nodule detector outputs several candidates with an associated nodule probability, we explored the minimum number of candidates that allowed detecting the maximum number of annotated nodules. Thus, we computed a FROC-curve (Setio et al., 2017) for both train and test datasets to inspect the sensitivity of finding the (only) annotated nodule per scan at different FP rates. As we can observe in Figure-4, the model achieves high sensitivity scores with very few FP. The FROC-curve allowed us to set the threshold at 32 FPs, with only 11 annotated nodules out of 226 missed in the train set, which represented a sensitivity of 0.9513. In the test set (at 32 FP as threshold) only 2 out of 76 nodules were missed, resulting in a sensitivity of 0.9736.
Figure 4: FROC-curve of the malignant nodule detection algorithm for train and test partition.
To gain insight into the complexity of the re-identification problem, we computed how many candidates were located within a chosen Euclidean distance from the nodule ground truth position (Figure-5). We de-fined 5 different distance thresholds: radius squared Euclidean distance (as used in the LUNA-16 challenge to accept a nodule detection as correct) and 4 fixed Euclidean distances (30, 20, 15 and 10 mm). For every distance, we computed the number of CTs in which 0, 1, 2, 5 or more than 10 candidates fell within the distance. Moreover, we computed an accuracy of detection for every distance choice by dividing the number of CTs for which only one candidate is within the distance by the total number of CTs. Results are shown in Table-4.
Next, we evaluated the performance of the best SNN (Table-3) for nodule re-identification using the location of the nodule candidates provided by the nodule detector. The best results were achieved by the FIFB network with only 4 CT-pairs incorrectly matched and an accuracy of
Table 3: Performance results of the different SNN configurations
Figure 5: Candidates predicted (i.e. yellow marks) at a maximum distance from the ground truth centroid (i.e. red circle).
Table 4: Detected candidates (N) per each CT at T2 at different Euclidean distances
0.888. All results are presented in Table-5.
Then, we evaluated the performance of the best pipeline (i.e. the pipeline configured with the FIFB network) for the nodule growth detection task. As explained in Section 4.2.2, a correct prediction was achieved when the difference on diameters between predicted and ground truth nodules had both the same sign. In this way, having 32 correctly identified cases (out of 36), we obtained a 0.92 of recall, a 0.88 of precision and a 0.90 of F1-score. The
Table 5: Results of the different nodule re-identification pipelines
confusion matrix is shown in Figure-6.
Additionally, we assessed the quality of the diameter measure prediction. Agreement between the predicted and ground-truth nodule growth vectors was assessed with a Bland-Altman (Altman and Bland, 1983; jaketmp, 2018) plot (Figure-7). The mean difference between the two measurements was 0.17 mm with a 95% confidence interval (from -3.35 to 3.70mm). The mean value of the difference was not significantly different from 0 on the basis of a 1-sample t-test (p-value = 0.99), for which a previous logarithm transformation and a data inspection was carried out to ensure the prerequisite assumptions of normality. Also, we computed the mean absolute error of the predicted nodule growths (1.38 1.17 mm), their mean squared error (3.26
5.30 mm) and its coefficient of determination (r2=0.71). Finally, Figure-8 shows the predicted and real difference of diameters for all CT pairs of the test dataset. To support the interpretation of this fig-ure, we have included the axial slice with major diameter taken at time points T1 and T2 of an illustrative subset of nodules.
Figure 6: Confusion matrix for nodule growth prediction
Figure 7: BlandAltman plot for agreement between predicted and ground truth nodule growth
In this article, we provide a novel way to address the nodule re-identification problem. In particular, we propose a deep SNN that can directly re-identify nodules located in a series of pairs of CT scans without the need for any image registration.
The SNN allows matching pulmonary nodules in different CTs in a single stage by outputting a similarity score (i.e. the probability of being the same nodule). In contrast, standard techniques require at least two stages: first registering the image and then identifying matching nodules with some distance function. Moreover, with the proposed solution, no additional deformations/perturbations of the lung scan are performed, so that nodule measurements can be done directly from the image itself. Another advantage is that the re-identification process is fast since all weights of the network have already been calculated during the training phase.
We designed and tested several SNN architectures in order to fully understand the complexities of the problem and find the best network configuration. Results (Table-3) show that, in general (7 out of 8 experiments), the networks obtained high accuracy scores, above 85% in validation and 80% in test. Indeed, several of the SNN config-urations (e.g. FIFB, UCMB) achieved accuracy scores in test above 92%, in agreement with the state-of-the-art performances reported for automated matching of pulmonary nodules (Tao et al., 2009; Koo et al., 2012). One of the main factors contributing to the good performance is the use of transfer learning, namely initializing the backbone of the different SNNs with the weights of a previously trained 3D network. This can explain why, even the simpler network (FIBC), without the addition of extra layers and without any fine adjustments, achieved nevertheless good performances (77.5% in validation and 71% in test).
Regarding the loss functions configured in the different experiments, the methods using the BCE loss (which are based on probabilities) slightly outperformed the ones using the contrastive loss (which is based on distances). This can be seen in the difference in accuracy (3.5% in validation and 12% in test) obtained by the best network configured with probability-based loss function (FIFB) compared to the best network configured with loss function based on distance (UIBC).
Another finding was that unfreezing the weights of the pre-trained networks usually allowed for better performances. This is particularly evident in the UIBC case, which exceeded of almost 10% in validation and testing the corresponding frozen configuration (FIBC). Somehow, this finding was expected as weights were transferred from networks trained in a different, although closely related, domain.
With respect to the features used by the networks, we can observe (Table-3) that, in almost all the methods, the best performance was achieved by using features extracted by layer1 and/or layer2, while only for two methods it was achieved using features from layer3 and avgpool (i.e. the global average pooling). This may suggest that features encoding simple patterns (from earlier lay-
Figure 8: Comparison between real and predicted cases. Upper panel: diameter differences for all test set. Lower panel: axial slices at two time points of different nodules.
ers) are preferred for this problem, whereas layers that contains more specific features (from the last layers) are less useful. It is also worth noticing that networks combining features from different layers did not clearly outperform networks using features from a single layer. This is the case of UCMB in which the reported validation performances are just a bit lower (0.2%) than the performances reported by the FIFB configuration, although in test, UCMB outperformed by 0.4% the performance of FIFB.
Concerning the type of heads with which the networks were configured, the best option was using fully connected layers (FC head). Surprisingly, networks with extra convolution layers before the fully connected layers (CNN head) achieved worse performances (1% and 6% less in validation and test, respectively) than networks with FC heads. This might suggest that adding extra convolution layers to find patterns between locally connected features increases the complexity of the model, leading to more weights to adjust but with the same amount of training data.
To test these networks in a more realistic scenario, we integrated them into automatic two-stage pipelines aimed at first re-identifying pulmonary nodules and then predicting nodule growth given series of CTs of the same patient. In this setting, the performances achieved by the different nodule re-identification networks (Table-5) were a bit lower than in training phase. This may be due to the fact that the patched images were cropped around the position predicted by the nodule detector, as opposed to in training which they were cropped around the centroid of the nodules. This may have occluded some parts of the nodules, making its correct matching more difficult. However, 5 out of 8 networks reached a nodule matching accuracy score above 80%. As in training, the network with the best performance was the FIFB, with a 88.8% of accuracy. This performance score is slightly worse than current state-of-the art (> 92%). However, we must highlight that, in our approach, the position of the nodules were automatically provided by the nodule detector (without any prior human intervention), whereas in (Tao et al., 2009; Koo et al., 2012) the position of the reference nodule to match is given by the radiologist.
In terms of computational time, our approach achieved satisfactory performances being able to re-identify the nodules of the complete test set, in times ranging from half a minute (in the worst case, UIBC) to less than 10 seconds (for the best configuration, FIFB), as can be seen in Table-5. This is a particularly appealing feature of our method, since even the most recent techniques for registration of lung CT images, necessary by any standard pipeline for nodule re-identification, require significantly more time, for instance 5 minutes according to R¨uhaak et al. (2017) or approximately 1 minute by Zikri et al. (2019) per case. These processing times fluctuate substantially depending on the technique and the quality of the image registration.
Although the focus of the paper is the nodule re-identification, to explore the potential applications of the method, we also quantified and assessed nodule growth. To do this, we used the best network for nodule re-identification (FIFB) and integrated it in the nodulegrowth pipeline. In total, nodule growth was correctly detected in 27 cases and erroneously in 5 cases. However, only 2 of these errors were false negatives (that is, the pipeline failed to predict growth); one of them was on a benign nodule (B01) with growth difference of less than 1 mm, whereas the other was on a malignant nodule (C50) with growth difference of 1.8 mm. As shown in Figure-7, there is an agreement when comparing predicted and real nodule growths as most of the measures fall between the two standard deviations of the mean, there is a non-significant difference between them (p=0.99), and they show a good correlation score (r2=0.71). Despite this positive results, the values obtained for the 95% limits of agreement (> 3mm) are still high. Thus, emphasis should be done in improving the current nodule detector to provide more accurate diameters. Nevertheless, from a clinical point of view, the majority of the nodule differences were correctly classified (growth, no-growth) as shown in Figure-6, and we reported a mean absolute error of 1.38 1.17mm in diameter with respect to the ground truth, which indeed is slightly less than the 1.73 and 2.2mm of the variability error reported in different retrospective analysis (Revel et al., 2004; Kim et al., 2016) measuring changes in solid and subsolid nodules (<2cm) using only their diameter.
This study, however, is subject to several limitations. In the medical domain, data is a scarce and difficult resource to obtain. Having an insufficiently large dataset can negatively impact the performance of deep learning-based models. This is even more concerning for re-identification of lung nodules, since for each patient, twice as many images and annotations are needed. Another main limitation of the study is that the only expert annotation provided for nodule quantification was the major axial diameter. Although the diameter is the most common radiological measure used in practice for nodule growth assessment, using 3D measurements could lead to a more accurate quantification. In addition, if we would have had nodule measurements from more experts, we could have better explained the clinical variability, reporting more accurately the performance of our pipeline with respect to nodule growth prediction. Finally, in this work, we focus on training and evaluating several SNNs to explore different configurations. Finer tuning of hyperparameters (e.g. the learning rates, batch sizes or dropout values) may lead to improved results.
Several future works are envisaged to extend the research presented in this paper. Applying different feature fusion techniques, introducing different manners to weigh the feature maps, applying new techniques to reduce the dimensionality of the problem, as well as the use of segmentation are just some of the research lines that can be explored beyond the presented work.
In this paper, we address the problem of automatic re-identification of pulmonary nodules in lung cancer follow-up studies, using siamese neural networks (SNNs) to rank similarity between nodules, which overpasses the need of image registration. This change of paradigm avoids possible image disturbances and provides computationally faster results. Different configurations of the conventional SNN were examined, ranging from the application of transfer learning, using different loss functions, to the combination of several feature maps of different network levels. The best results during the off-line training of the SNNs reached accuracies (0.89 in cross-validation and 0.92 in test) similar to those reported by state of the art registration mechanisms. Finally, we embedded the best SNN into a two-stage nodule growth detection pipeline. Nodule re-identification results reported by the pipeline in an independent test set were fast (<10 seconds, matching 38 pairs of CTs) and precise (0.88 accuracy score). Nodule growth predictions were also accurate (0.92 sensitivity score), and both the predicted and the ground truth measurements were not significantly different (p=0.99).
This work was partially funded by the Industrial Doctorates Program (AGAUR) grant number DI079, and the Spanish Ministry of Economy and Competitiveness (Project INSPIRE FIS2017-89535-C2-2-R, Maria de Maeztu Units of Excellence Program MDM-2015-0502).
Altman, D.G., Bland, J.M., 1983. Measurement in medicine: the analysis of method comparison studies. Journal of the Royal Statistical Society: Series D (The Statistician) 32, 307–317.
Ardila, D., Kiraly, A.P., Bharadwaj, S., Choi, B., Reicher, J.J., Peng, L., Tse, D., Etemadi, M., Ye, W., Corrado, G., et al., 2019. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine 25, 954.
Armato III, S.G., McLennan, G., Bidaut, L., McNittGray, M.F., Meyer, C.R., Reeves, Anthony P, .C.L.P., 2015. Data from LIDC-IDRI. the Cancer Imaging Archive http://doi.org/10.7937/K9/TCIA. 2015.LO9QL9SX.
Beigelman-Aubry, C., Raffy, P., Yang, W., Castellino, R.A., Grenier, P.A., 2007. Computer-aided detection of solid lung nodules on follow-up MDCT screening: evaluation of detection, tracking, and reading time. American Journal of Roentgenology 189, 948–955.
Bonavita, I., Rafael-Palou, X., Ceresa, M., Piella, G., Ribas, V., Gonz´alez Ballester, M.A., 2019. Integration of convolutional neural networks for pulmonary nodule malignancy assessment in a lung cancer clas-sification pipeline. Computer Methods and Programs in Biomedicine 185, 1–9.
Bromley, J., Guyon, I., LeCun, Y., S¨ackinger, E., Shah, R., 1994. Signature verification using a ”siamese” time delay neural network, in: Advances in Neural Information Processing Systems, pp. 737–744.
Cai, Y., Li, Y., Qiu, C., Ma, J., Gao, X., 2019. Medical image retrieval based on convolutional neural network and supervised hashing. IEEE Access 7, 51877–51885.
Chung, Y.A., Weng, W.H., 2017. Learning deep representations of medical images using siamese CNNs with application to content-based image retrieval. Advances in Neural Information Processing Systems. Workshop on Machine Learning for Health (ML4H) .
Ciompi, F., Chung, K., Van Riel, S.J., Setio, A.A.A., Gerke, P.K., Jacobs, C., Scholten, E.T., SchaeferProkop, C., Wille, M.M., Marchiano, A., et al., 2017. Towards automatic pulmonary nodule management in lung cancer screening with deep learning. Scientific Reports 7, 46479.
Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep learning. Cambridge: MIT press.
Hadsell, R., Chopra, S., LeCun, Y., 2006. Dimensionality reduction by learning an invariant mapping, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), IEEE. pp. 1735–1742.
Hughes, L.H., Schmitt, M., Mou, L., Wang, Y., Zhu, X.X., 2018. Identifying corresponding patches in sar and optical images with a pseudo-siamese CNN. IEEE Geoscience and Remote Sensing Letters 15, 784–788.
jaketmp, 2018. jaketmp/pycompare: Looks both ways. URL: https://doi.org/10.5281/zenodo. 1256204, doi:10.5281/zenodo.1256204.
Kim, H., Park, C.M., Song, Y.S., Sunwoo, L., Choi, Y.R., Im Kim, J., Kim, J.H., Bae, J.S., Lee, J.H., Goo, J.M., 2016. Measurement variability of persistent pulmonary subsolid nodules on same-day repeat CT: what is the threshold to determine true nodule growth during follow-up? PLoS One 11, e0148853.
Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization, in: Proceedings of the 3rd International Conference on Learning Representations.
Ko, J.P., Berman, E.J., Kaur, M., Babb, J.S., Bomsztyk, E., Greenberg, A.K., Naidich, D.P., Rusinek, H., 2012. Pulmonary nodules: growth rate assessment in patients by using serial CT and three-dimensional volumetry. Radiology 262, 662–671.
Koch, G., 2015. Siamese neural networks for one-shot image recognition, in: International Conference on Machine Learning. Workshop on Deep Learning, vol. 2.
Koo, C.W., Anand, V., Girvin, F., Wickstrom, M.L., Fantauzzi, J.P., Bogoni, L., Babb, J.S., Ko, J.P., 2012. Improved efficiency of CT interpretation using an automated lung nodule matching program. American Journal of Roentgenology 199, 91–95.
Larici, A.R., Farchione, A., Franchi, P., Ciliberto, M., Cicchetti, G., Calandriello, L., del Ciello, A., Bonomo, L., 2017. Lung nodules: size still matters. European Respiratory Review 26, 170025.
Lee, K.W., Kim, M., Gierada, D.S., Bae, K.T., 2007. Performance of a computer-aided program for automated matching of metastatic pulmonary nodules detected on follow-up chest CT. American Journal of Roentgenology 189, 1077–1081.
Liao, F., Liang, M., Li, Z., Hu, X., Song, S., 2019. Evaluate the malignancy of pulmonary nodules using the 3-D deep leaky noisy-or network. IEEE Transactions on Neural Networks and Learning Systems 30, 3484– 3495.
Loyman, M., Greenspan, H., 2019. Lung nodule retrieval using semantic similarity estimates, in: Medical Imaging 2019: Computer-Aided Diagnosis, International Society for Optics and Photonics. p. 109503P.
MacMahon, H., Naidich, D.P., Goo, J.M., Lee, K.S., Leung, A.N., Mayo, J.R., Mehta, A.C., Ohno, Y., Powell, C.A., Prokop, M., et al., 2017. Guidelines for management of incidental pulmonary nodules detected on CT images: from the fleischner society 2017. Radiology 284, 228–243.
Messay, T., Hardie, R.C., Tuinstra, T.R., 2015. Segmentation of pulmonary nodules in computed tomography using a regression neural network approach and its application to the lung image database consortium and image database resource initiative dataset. Medical Image Analysis 22, 48–62.
Murphy, K., Van Ginneken, B., Reinhardt, J., Kabus, S., Ding, K., Deng, X., Pluim, J., 2010. Evaluation of methods for pulmonary image registration: The em-pire10 study. Grand Challenges in Medical Image Analysis 2010, 11–22.
Noone, A., Howlader, N., Krapcho, M., Miller, D., Brest, A., Yu, M., Ruhl, J., Tatalovich, Z., Mariotto, A., Lewis, D., et al., 2018. Seer cancer statistics review, 1975-2015. Bethesda, MD: National Cancer Institute .
American College of Radiology, 2014. Lung CT screening reporting and data system (lung-RADS). Reston, VA: American College of Radiology .
Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster RCNN: Towards real-time object detection with region proposal networks, in: Advances in Neural Information Processing Systems, pp. 91–99.
Revel, M.P., Bissery, A., Bienvenu, M., Aycard, L., Lefort, C., Frija, G., 2004. Are two-dimensional CT measurements of small noncalcified pulmonary nodules reliable? Radiology 231, 453–458.
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 234–241.
R¨uhaak, J., Polzin, T., Heldmann, S., Simpson, I.J., Handels, H., Modersitzki, J., Heinrich, M.P., 2017. Estimation of large motion in lung CT by integrating regularized keypoint correspondences into dense deformable registration. IEEE Transactions on Medical Imaging 36, 1746–1757.
Setio, A.A.A., Traverso, A., De Bel, T., Berens, M.S., van den Bogaard, C., Cerello, P., Chen, H., Dou, Q., Fantacci, M.E., Geurts, B., et al., 2017. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge. Medical Image Analysis 42, 1–13.
Song, G., Han, J., Zhao, Y., Wang, Z., Du, H., 2017. A review on medical image registration as an optimization problem. Current Medical Imaging Reviews 13, 274–283.
Tao, C., Gierada, D.S., Zhu, F., Pilgram, T.K., Wang, J.H., Bae, K.T., 2009. Automated matching of pulmonary nodules: evaluation in serial screening chest CT. American Journal of Roentgenology 192, 624–628.
Tao, R., Gavves, E., Smeulders, A.W., 2016. Siamese instance search for tracking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1420–1429.
Varior, R.R., Haloi, M., Wang, G., 2016. Gated siamese convolutional neural network architecture for human re-identification, in: European Conference on Computer Vision, Springer. pp. 791–808.
Viergever, M.A., Maintz, J.A., Klein, S., Murphy, K., Staring, M., Pluim, J.P., 2016. A survey of medical image registration under review. Medical Image Analysis 33, 140 – 144. URL: http://www.sciencedirect.com/science/ article/pii/S1361841516301074, doi:https: //doi.org/10.1016/j.media.2016.06.030. 20th anniversary of the Medical Image Analysis journal (MedIA).
Weiss, K., Khoshgoftaar, T.M., Wang, D., 2016. A survey of transfer learning. Journal of Big data 3, 9.
Zikri, Y.K.B., Helguera, M., Cahill, N.D., Shrier, D., Linte, C.A., 2019. Toward an affine feature-based registration method for ground glass lung nodule tracking, in: ECCOMAS Thematic Conference on Computational Vision and Medical Image Processing, Springer. pp. 247–256.
Table S1: Confusion matrix results for the 3D ResNet network.
Table S2: Classification results for the 3D ResNet network.
Figure S1: Architecture for the nodule detector.
Table S3: Performances of the lung nodules detector at different FP in average per scan.
Figure S2: FROC curve of the lung nodule detector computed for the test set.