In the field of radiotherapy (RT), nomenclature standardization is the process of imposing a unified and structured labeling system on anatomical structures [1, 2, 3]. This is a prerequisite for clinical data curation and data-driven research, especially in the era of big data and artificial intelligence [1, 4, 5, 6, 7]. However, because of differences in local policies, vendors, and language environments, structure labels are often inconsistent [8, 9]. A large number of retrospective RT datasets [10, 11] cannot be shared and reused without consistent labels, and manually cleaning RT data is very expensive and time-consuming [8, 9, 12, 13, 14]. Therefore, it is necessary to develop software tools to
automate nomenclature standardization to facilitate data-driven clinical research.
Previous works have proposed standardizing the nomenclature of anatomical structures via text-based methods that rely on label matching and clinicians’ intervention to correct mismatched labels at a single institution [8, 15, 16]. However, language constantly changes, and different naming conventions make the semantic information in labels difficult to recognize automatically. As a result, text-based methods cannot be applied to datasets collected even from a single institution, let alone to cross-institutional datasets.
The labels of organs at risk (OARs) have a one-to-one correspondence with the images (such as Computed Tomography [CT] scans and segmentation masks), and the image data contain invariant semantic information that can standardize nomenclature in multilingual environments. Methods that leverage this image information to tackle cross-institutional RT datasets are called image-based methods. Existing image-based methods try to automatically standardize nomenclature by exploiting semantic invariance in the image [17, 18, 19, 20]. Among these methods, algorithms that leverage atlas-based registration can also be used to determine the category of the structure and then relabel it [18, 19]. However, atlas-based registration is unstable and time-consuming. Other image-based methods convert the task of structure nomenclature standardization to OAR classification based on deep learning (DL) frameworks
[17, 20]. Nonetheless, these methods have largely overlooked the problems caused by imbalance and poor delineation in real RT datasets, especially for small-volume OARs with similar positions, shapes, and sizes, such as the pituitary and optic chiasm. RT datasets are imbalanced not only in the number of OARs but also in the size of each OAR. For example, in Fig. 1 (a), the volume of the brain is much larger than that of the pituitary. Models built on such datasets tend to be biased and inaccurate. Poor delineation of OARs increases the inter-class similarity and the intra-class variation. For example, in Fig. 1 (b), the pituitary and optic chiasm are very similar, but the larynx varies greatly across patients. Both imbalance and poor delineation will bias the classifier, which will lead to incorrect predictions for small-volume OARs.
(a) (b)
FIGURE 1. Characteristics of OARs in real RT data. (a) The size of the OARs is extremely imbalanced. (b) The poor delineation in RT data: the first row indicates the similarity of small-volume OARs (inter-class similarity), and the second row shows examples of poor delineation for the same OAR (intra-class variation).
As mentioned earlier, some image-based methods treat the task of structure nomenclature standardization as an OAR classification task [17, 20]. In the field of computer vision, deep learning has led to a series of breakthroughs for classification tasks [21, 22] that have improved upon traditional classification methods [23, 24]. Related works seeking to improve the performance of DL-based classifiers have mainly focused on three aspects: 1) constructing deeper, wider, and more elaborate architectures [22, 25, 26, 27, 28, 29, 30] to increase the capacity for adapting data and training [31]; 2) enriching samples to get close to the actual distribution [32,
33, 34, 35, 36]; and 3) adding subjective constraints to make high-level features extracted within the network correspond to the domain knowledge required for specific tasks [37, 38, 39, 40]. Existing state-of-the-art networks [21, 22, 25, 26, 27, 28, 29, 30] for classification can be applied to the current task. ResNet [25] has a lower computational cost but better performance than other networks [41]. Therefore, we have made many attempts to use ResNet50 for this application, but these attempts have yielded results similar to previous reports [17, 20]: the true positive rate (TPR) of small-volume OARs (such as the pituitary and optic chiasm) cannot meet the requirements for clinical implementation. It is worth noting that clinicians can make quick and accurate decisions for small-volume OARs, even with poorly delineated samples, which means the images contain enough effective information for clinicians to apply their domain knowledge and recognition mechanisms. To date, there has been no relevant research on how clinicians make accurate decisions when classifying OARs, but we can simulate this process and, thus, incorporate the implicit domain knowledge and recognition mechanisms necessary for decision making into the target framework.
The main goal of our work is to explore ways to integrate clinicians’ domain knowledge and recognition mechanisms into a neural network to improve the classifier’s performance for categorizing small-volume OARs. To this end, we propose an automatic structure nomenclature standardization framework, 3D Non-local Network with Voting (3DNNV). This framework consists of an improved data processing strategy and an optimized feature extraction structure. The data processing strategy was proposed to provide the explicit information that clinicians use when labeling structures. The feature extraction structure simulates the observation process, which enhances the observational fineness of the region of interest (ROI) in the high-level features.
A. Improved data processing strategy
We propose a simple and effective adaptive data processing strategy: adaptive sampling and adaptive cropping (ASAC) with voting. ASAC simulates the process of clinicians observing images and collecting the information needed for decision making, and it generates multi-scale and multi-position inputs for a sample. ASAC constructs a set of augmented inputs, assists the model in mining the effective information implicit in the raw data, and extracts the domain knowledge that clinicians typically need to identify OARs. The voting strategy accounts for variations in a structure’s shape and location that may lead to poor delineation. This strategy is a weighted sum of all the predictive results of inputs for the same sample; this makes the final result closer to predicting the "ideal" semantic features. The voting strategy also agrees with the principle of clinicians making decisions based on comprehensive information.
B. Optimized feature extraction structure
The convolutional network only processes one local neighborhood at a time, and the common way to model the long-range dependency on semantic features is to increase the receptive field. In order to fill the gaps in capturing long-range dependency and to enhance the observational fineness in the region of interest in the high-level semantic features, we added non-local blocks [42] to ResNet50 to optimize the feature extraction structure in the network (designated “NN” for Nonlocal Network). Non-local blocks apply a self-attention mechanism [43] to image sequence processing by calculating the similarity matrix for high-level semantic features, thereby
containing the long-range dependency and enhancing the representation of the semantic features.
By combining the ASAC/Voting strategy with the Nonlocal Network, we obtained the final framework, 3DNNV, which can standardize the nomenclature of structures in RT datasets. The 3DNNV integrates clinicians’ domain knowledge and recognition mechanisms into the final model from a new perspective, mitigates the problems caused by imbalance and poor delineation in RT datasets, and improves the performance for identifying small-volume OARs. This framework allows us to categorize structures in cross-institutional RT data quickly and efficiently, then automatically relabel these structures with general labels recommended in AAPM TG-263 [1]. Furthermore, 3DNNV is extensible and can be easily transferred to other anatomic sites after fine-tuning on a few samples.
The rest of this paper is organized as follows: Section II introduces related works that have sought to automate nomenclature standardization of OARs in recent years. Section III describes our 3DNNV framework. Section IV shows the results of experiments evaluating 3DNNV’s performance and comparing it with other state-of-the-art methods. Section V discusses the limitations of this study and the future prospects of our work. Section VI summarizes our main findings and provides future directions.
EXAMPLES OF INCONSISTENT LABELS IN RT DATASETS. THERE ARE VERY DIFFERENT LABELS FOR THE SAME OARS, SUCH AS “NOD” AND “NERF OPT
A. Text-based methods
Text-based methods standardize structure nomenclature mainly by using structured naming templates or label mapping dictionaries. Mayo et al. [15] built software containing structured templates, which allows clinicians to relabel structures interactively. The fixed template helps to unify labels better than free-text interactive tools. Nyholm et al. [16] mapped the main structure labels in local clinical centers to the name list of the general naming convention, then manually corrected the mismatched labels through the interactive interface. The authors used the tool to aggregate RT data from 15 medical centers in Sweden. More recently, Schuler et al. [8] pointed out that, when standardizing radiotherapy data, it is difficult to distinguish between typographic name variations and fundamental semantic differences in the same structure. Therefore, they developed a tool called Stature that maps a local standard structure name (LSSN) to the AAPM TG 263 naming table by creating a lookup dictionary. The above methods map the original labels to standardized labels based on a dictionary and manual intervention. These kinds of methods can establish the mapping between the original labels and standardized labels to quickly solve the problem of inconsistent labels in the local RT dataset. However, language constantly changes, as shown in Table I, which limits these methods’ applicability to cross-institutional datasets. In addition, the text-based methods cannot handle large-scale retrospective datasets.
B. Image-based methods
Image-based methods, which are based on the invariant semantic information in medical images, are learnable automatic recognition methods that overcome the problems inherent in text-based methods. The label propagation, which is implemented by an atlas-based deformable image registration (DIR) algorithm, registers an atlas with known labels to the input and then chooses the one with the highest overlap mask to relabel the input [19]. In this way, unknown datasets can be standardized by labels in the atlas. However, the DIR’s performance is unstable [18]. Also, this method is highly time-consuming, so it falls well short of practical requirements.
Our previous work departed from these methods, as it converted label standardization to the task of automatically categorizing structures in RT data and modeled the process with a deep neural network, which used the weighted mask of OARs to construct a composite mask as 2D input [17]. This work demonstrated the excellent performance of deep learning networks in standardizing OAR labels, but the experiment did not make full use of the three-dimensional shape and location information on the CT. The classes of OARs in the training dataset were clean and sufficient, but the real dataset contained many other challenges, such as heavy data imbalance, inter-class similarity, and intra-class variation, that could limit the method when extending it to other anatomic sites. More recently, Rhee et al. [20] extended the number of categories to 19 OARs in the head-and-neck region and loosely utilized the encoder of V-Net [44] to construct their framework, TG263-Net. This framework leveraged 3D inputs and achieved high accuracy in identifying 19 OARs, but it did not take into account imbalance and poor delineation in RT datasets, so its performance in identifying small-volume OARs is insufficient for practical clinical needs.
A. Overview of 3DNNV
This section outlines the workflow of 3DNNV (Fig. 2). 3DNNV consists of two parts in the inference phase for
standardizing structure nomenclature: 1) the ASAC/Voting strategy and 2) the Non-local Network. For any OAR in given Digital Imaging and Communications in Medicine (DICOM) data, the CT and corresponding mask are extracted to form a raw data pair. Then, ASAC generates multi-scale and multi-position inputs for each sample. During training, each input generated by ASAC is regarded as an independent sample, and the parameters of the non-local network are updated and optimized based on the samples in each mini-batch. In the inference phase, multiple inputs for a sample are fed into the network, which outputs the vectors (Vectors in Fig. 2, 256-d vector for each). Sharing weights here allows the consistent representation of multiscale/multi-position inputs in feature space so that we can leverage the same model with the same parameters to extract high-level features for each input. The 256-d vectors vote for a final predictive result as the output of 3DNNV, and the sample is renamed with a standardized label.
B. Data
In accordance with Brouwer et al.’s suggestion [45], we selected the 28 categories of head-and-neck OARs shown in Fig. 1 (a) to train our model. We compared our model’s performance in standardizing structure nomenclature against other models by testing them on three different head-and-neck image datasets.
1) HN_PETCT
HN_PETCT [46, 47] is an open-source head-and-neck RT dataset released on The Cancer Imaging Archive (TCIA) [48] that includes data collected from four different French medical institutions comprising 298 patients. We collected 4372 samples in total for the 28 OAR categories. Then, we divided the samples into three subsets for training, validation, and testing in a ratio of 3:1:1. It should be noted that the number of samples in the dataset is extremely imbalanced. For Glnd_Lacrimal_L/R and Pituitary, only 9 samples for each were used as training data.
2) PDDCA
PDDCA [49] is an open-source RT dataset containing data from 48 patients that was released by the MICCAI 2015 Segmentation Challenge. This dataset contains only 9 categories of head-and-neck OARs (Parotid_L, Parotid_R, Glnd_Submand_L, Glnd_Submand_R, Bone_Mandible, Brainstem, OpticChiasm, OpticNrv_L, and OpticNrv_R). All contours for OARs were re-delineated by trained radiologists. We collected 408 samples in total and used all of them as a test set.
3) HN_UTSW
HN_UTSW is an RT dataset collected by our team at UT Southwestern that contains data for 408 patients. We collected a total of 5153 samples for 28 OARs (the same categories as HN_PETCT) and used all of them for testing to show our model’s generalizability.
FIGURE 2. Overview of 3DNNV in the inference phase.
C. Preprocessing
For each patient’s data in given DICOM files, 3D CT volumes and corresponding masks were extracted to form raw data, then the voxel size of the 3D volumes in the raw data was normalized. To ensure that the small-volume OARs do not lose any information through down-sampling, we chose to use the same voxel dimension ratio z:y:x = 0.77:1:1 from the training dataset HN_PETCT for all other datasets. We performed trilinear interpolation for resizing and reshaping. Due to differences in maxima (and minima) of Hounsfield unit (HU) values for different patients, the range of HU values was truncated to [-1000, 2500], then normalized to [0, 1]. We directly used a binary [0, 1] matrix to represent the mask. For some patients, the OAR contours may be missing in some intermediate slices and then are generated using the nearest slices.
D. 3DNNV: 3D Non-local Network with Voting
ASAC/Voting is an essential part of 3DNNV. It is worth noting that ASAC is a data processing strategy that can be applied in all stages, but the voting strategy is applied only in the inference phase.
1) ASAC: ADAPTIVE SAMPLING AND ADAPTIVE CROPPING
For each OAR, a pair of pre-processed 3D CT and mask volumes (Raw Data in Fig. 2) are cropped into smaller volumes using sliding cubes of (the blue and orange cubes in the part of ASAC in Fig.2) along the patient long axis, which are then used as inputs for the non-local network. Here m and n are the sizes of the sliding cube in axial plane and in patient long axis direction. In our experiments, we use 5 different sizes for the cropping cubes: 12×128×128, 18×192×192, 24×256×256, 30×320×320, and 36×384×384. The cubes slide at a step size of . The cropped image volumes using cubes of different sizes are resized into
12×128×128 before being inputted into the non-local network. ASAC is not only a way to extract clinicians’ domain knowledge but also a way to deal with the issues related to limited computational resources and imbalanced training datasets. For some oversized OARs, such as Brain, the contour in the axial slice cannot be entirely captured by small-volume cubes (such as 12×128×128). Therefore, it is necessary to adaptively resize the CT and mask first to fit them into the cubes, then perform the sampling. By performing ASAC, we gained multi-scale and multi-position inputs for each sample.
2) VOTING
As mentioned above, ASAC generates multi-scale and multi-position inputs, which contain global and local information. The outputs of the non-local network, corresponding to different inputs, will be summed up to vote for the final recognition result. This voting strategy is used at the inference phase.
3) NON-LOCAL NETWORK
We set vanilla ResNet50 as the backbone for our 3DNNV network and also as the baseline for our performance comparisons (Table II).
Then, we added non-local blocks [42] to the backbone network to form the final 3D non-local network (Fig. 3). Inspired by the self-attention mechanism [43], Wang et al. [42] proposed the non-local block to capture the global dependence on semantic features. It was designed to handle sequential data, so we stacked it into our framework. In this work, we are committed to enhancing the position information’s dependence on the CT image and the shape information’s dependence on the mask image, so the pairwise function may be implemented using the concatenated form. The non-local block used in our network is defined as follows:
x and z are set as the input and output, respectively, of the non-local block. Both are of the same size: . B denotes the batch size of the input, and C represents the number of channels. D, H, and W are depth, height, and width, respectively. Here, i is the index of an output position whose response is to be computed, j is the index of all
possible positions, and y is an intermediate output with the same size as , and are all 1 × 1 × 1 convolution layers. Operator [.,.] indicates the concatenation operation, and is the mapping matrix that converts the concatenated vector to the scalar output. “” indicates identity mapping, and the input is added to the transformed y to get the final output z of the non-local block. C(x) is a regularization term: .
FIGURE 3. Non-local Network. To construct the Non-local Network in our work, we stacked one Non-local block at the end of each Bottleneck block in Res2 and Res3. Details for the Bottleneck block and Non-local block are shown in the figure. The self-attention mechanism [14] is applied in the Nonlocal block to capture the long-term dependency within semantic features. Any input for the network is 2-channel 3D data (12×128×128 CT and 12×128×128 mask), and the corresponding output will be 256-d vector.
A. Experimental setting
1) TRAINING DETAILS
Using the training data outlined in section III.B and the preprocessing outlined in section III.C above, we trained our deep learning models as described below. To account for imbalance in the numbers of images for each OAR in the dataset, we applied a non-uniform sampling method—OARs that were represented less were inversely proportionally sampled more times. We augmented the training data by performing affine transformations, including randomly translating, rotating, shearing, and scaling. Finally, the central cube of the sample was cropped as input data. The final input data size was 2 × 12 × 96 × 96, which is two-channel 3D data that includes the 3D CT volume and the corresponding mask on the same slices. All architectures used in this work were initialized as described by He et al. [51]. The Adam optimization algorithm [52] was applied to optimize the networks with an initial learning rate of 1e-4, and cross-
entropy was set as the loss function. The batch size was set to 16. For samples generated by ASAC, we set the total number of epochs to 20, and the learning rate dropped by a factor of 10 after 2, 5, and 10 epochs. For other architectures without ASAC, we set the total number of epochs to 200, and the learning rate decreased by a factor of 10 after 10, 20, and 30 epochs. The 3DNNV was implemented on the PyTorch1.0 [50] framework and trained on a single GPU NVIDIA Tesla K80.
2) EVALUATION
For this multi-class classification task, we used true positive rate (TPR), F1 score, and area under the receiver operating characteristic curve (AUC) to evaluate the performance of our models. These metrics are defined as follows:
Multi-class classification can be considered as multiple binary classifications and can calculate true positive (TP), false negative (FN) and false positive (FP) values for each category separately. F1 score is the harmonic mean of the positive predictive value (PPV) and TPR. In (7), means the i-th positive sample sorted by probability. AUC indicates how well the model distinguishes between different classes. AUC is not sensitive when used on an imbalanced test sample.
B. Comparisons among ResNet models
We developed the 3DNNV model in a step-wise manner. First, we set vanilla 3D ResNet50 as the backbone network; then, we optimized the architecture; and finally, we integrated domain knowledge into the network. We evaluated and compared the performance of the models obtained at each step.
Our first goal was to determine an initial preprocessing strategy for the raw data. Beginning with the baseline network, we tested three different strategies: taking global samples without voxel size normalization (GS), global samples with voxel size normalization (VN_GS), and local samples with voxel size normalization (VN_LS) as inputs. Samples collected at the scale of 36×384×384 were marked as global samples (GS), and samples collected at the scale of 12×128×128 were marked as local samples (LS). “VN” means voxel normalization. Accordingly, we designated the architectures as Baseline (GS), Baseline (VN-GS) and Baseline (VN-LS). We found that, for the error-prone small-volume OARs in the head-and-neck region, detailed information contained in the local sample plays an important role in recognition (see Fig. 4 and Table II), so incorporating local details benefits models in classifying small-volume OARs. However, the model trained only on local samples, Baseline (VN-LS), could not distinguish between BrachialPlex_L (BP_L) and BrachialPlex_R (BP_R) (Fig. 5). Without the global location information, the model failed to indicate on which side the OAR should be.
FIGURE 4. Average F1 Scores (%) for ResNet-based models’ classification of small-volume OARs in HN_UTSW.
To enhance the representation of small-volume OARs in high-level feature space, we added non-local blocks to the backbone networks with voxel normalization and compared the performance of the non-local network (NN) with the baselines. We designated the non-local network architectures as NN (VN-GS) and NN (VN-LS). The NN architectures performed slightly better than the baselines over all categories, especially for Pituitary and OpticChiasm (Table II). However, their performance on other small-volume OARs was barely satisfactory. Like the Baseline (VN-LS) architecture, the non-local network architecture trained on local samples, NN (VNLS), could not distinguish between BrachialPlex_L (BP_L) and BrachialPlex_R (BP_R).
Finally, we applied the ASAC/Voting strategy to generate multiple inputs for a sample and combine the information through voting. We constructed and trained the 3DNNV network on the samples generated by ASAC. In the inference phase, all output vectors for the same sample voted for the final predictive result. We found that 3DNNV performed well in identifying small-volume OARs, even those similar in shape, size and location, such as the pituitary and optic chiasm.
When we compared the performance of the six ResNetbased models in classifying the 28 OARs across all three institutional test sets, we found that 3DNNV was superior to the baseline methods for classifying OARs and had good generalizability across different institutional datasets (Table III).
FIGURE 5. Confusion matrix of Baseline (VN-LS) on BrachialPlex_L and BrachialPlex_R.
C. Comparisons with previous works
To further test 3DNNV’s ability to standardize structure nomenclature, we compared its performance with that of other image-based methods: specifically, atlas-based registration and several deep learning-based methods.
1) ATLAS-BASED REGISTRATION
Atlas-based registration can standardize structure nomenclature by matching OARs with an atlas in the database and renaming the input with the atlas label that has the largest overlap mask. To test atlas-based registration for this application, first, we constructed a 2D single-atlas database for the 28 OARs, each sample in which contained a CT slice and a mask for the OAR to be identified in the same slice. Second, for each pair of CT and mask of the OAR to be identified (noted as fixed CT and fixed mask), the moving CT in each atlas was registered to the fixed CT, and the transformation (warping parameters) was learned. Applied the transformation on the moving mask of atlas, and then, the area of overlap between the deformed moving mask and the fixed mask was calculated by using the Dice Similarity Coefficient (DSC). DSC is shown in the following formula, with X and Y denoting the given fixed mask in given data and the moving masks in the atlas database.
For the experiment comparing atlas-based registration with 3DNNV, every structure processed by 3DNNV was first run through an early-match module to avoid processing
standardized structures repeatedly. The early-match module performed string matching between the original label and the standardized label: if and only if the original label fully matched one of the standardized labels in the dictionary, then the original label was treated as an already standardized label. Otherwise, the structure was fed into 3DNNV to obtain the prediction result. To limit the running time, we tested both methods on data from two randomly selected patients in the HN_UTSW dataset (Table IV).
The results show that the atlas-based registration algorithm is very time-consuming and unstable on different patient datasets, and its running time is almost 30 times longer than 3DNNV’s, which makes atlas-based registration unacceptable for this application. The registration effect of atlas-based deformable image registration often depends on the atlas dataset, the deformation model, and the objective function. However, it is difficult to construct an optimal single-atlas database. Multi-atlas datasets could be applied to make up for this deficiency, but this would be even more time-consuming.
2) DL-BASED METHODS
We also applied and analyzed other DL-based methods for structure nomenclature standardization and compared their performance with 3DNNV. Taking into account the massive impact of different sampling strategies and networks on performance, we set several different architectures for the experiment. For the various inputs used in this section, 1c2d is a 1-channel composite mask [17], 2c2d is a 2-channel input combining 2D CT and the corresponding mask, and 2c3d is a 2-channel 3D CT and mask [20]. For the different networks, we trained and tested 5-layer CNN [17], vanilla 2D ResNet50 [25], and TG263-Net [20] on the same datasets and compared their performances with 3DNNV. To fairly compare different methods with different inputs, we set 4 architectures—5-layer CNN (1c2d), ResNet50 (1c2d), 5-layer CNN (2c2d) and ResNet50 (2c2d)—to determine the
best combination of network and inputs. We found that the ResNet50-based models performed far better than the 5-layer CNN models (Fig. 6, Table V, and Table VI), even though ResNet50 has fewer parameters and costs less on computation. The 2-channel inputs include more information, which generally improves the overall performance of 28 categories (Table VI). Of note, Pituitary got an F1 score of 0.0% (Table V) because of the extremely imbalanced training sample: not only were there many more samples for Optic Chiasm than for Pituitary, but the two OARs are quite similar and error-prone. As a result, all test samples for Pituitary were predicted as Optic Chiasm.
FIGURE 6. Average F1 Scores (%) for deep learning-based models’ classification of small-volume OARs in HN_UTSW.
Based on the results of the above experiments, we set three more architectures—TG263-Net (2c3d), NN (2c3d), and 3DNNV—to determine the optimal sampling strategy and structure for the framework. For TG263-Net [20], we loosely used the encoder in V-Net [44] to construct the classifier. Then, we normalized the voxel size of CT and mask volumes in raw data to 2 mm : 2 mm : 2 mm, and we cropped the central 64 × 64 × 64 cubes (on CT and mask) to construct the 2-channel input. We randomly translated the center-of-mass by 10 mm to gain 9 inputs for each sample. In the inference phase, the 9 vectors extracted from 9 inputs vote for a final prediction result. This sampling strategy is similar to 3DNNV’s, so we applied the TG263-Net’s sampling strategy (along with the voting strategy) to our Non-local Network
(NN) and designated the architecture as NN (2c3d). When we compared these two architectures, NN (2c3d) performed notably worse than TG263-Net (2c3d). Nevertheless, after replacing the NN (2c3d)’s sampling and voting strategy with ASAC/Voting, we arrived at the framework of 3DNNV, which includes the improved data processing strategy and the optimized feature extraction structure. The average TPR, F1, and AUC for 3DNNV and the other DL-based methods over all categories on all test datasets are shown in Table VI. Although TG263-Net performed slightly better than NN, it required a longer running time. Most importantly, 3DNNV outperformed the other DL-based methods and had better generalizability across institutional datasets.
D. 3DNNV’s extensibility
To demonstrate the 3DNNV’s extensibility, we fine-tuned the model on other anatomical sites.
Data from 8 lung region patients and 5 prostate region patients were selected to fine-tune the model: we used the parameters of 3DNNV pre-trained on the 28 head-and-neck OAR data for initialization, then we froze all parameters
except those on the fourth residual block (Res4 shown in Fig. 3) and the fully-connected layer, and we set the learning rate as 1e-5 for the trainable layers. Next, we tested the fine-tuned model on data from 29 lung region patients and 28 prostate region patients. Other training settings were the same as for 3DNNV.
The experimental results are shown in Table VII. The model only needed 20 epochs to transfer to recognizing OARs in other anatomical regions with a small amount of data, and it obtained a good recognition accuracy. This means that, with very little data and a short amount of time, we can easily transfer the pre-trained model to the target anatomical sites to meet the needs of the new application.
A. Effectiveness
As mentioned before, 3DNNV yields better performance at identifying small-volume and error-prone OARs than all other deep learning-based models we investigated. To some degree, the sampling/voting strategies applied in TG263-Net and our framework are similar: generate many inputs for a sample, and vote for a final result at the inference phase. Here, we try to explain why 3DNNV works for error-prone
OARs, and compare it with TG263-Net. The 256-d vectors (outputs of the network) for error-prone OARs are visualized in Fig. 7. There are partial small-volume OARs in the head-and-neck region, the data of which are often poorly delineated and imbalanced; some of these small-volume OARs are similar in location and shape, such as Pituitary and OpticChiasm. Fig. 7 (a) and Fig. 7 (b) indicate the predictive results of TG263-Net on small-volume and error-prone OARs; apparently, most of these OARs are hard to identify without the voting strategy. However, after applying the voting strategy, the OARs that come with similar shapes/locations/sizes and imbalanced training samples still tend to be confused, like Pituitary and OpticChiasm. 3DNNV’s improved data processing strategy and optimized feature extract structure solved this problem, as shown in Fig. 7 (c) and Fig. 7 (d). We gained more reliable and credible results: clear boundaries between different categories allow easier classification.
(a) TG263-Net (without voting) (b) TG263-Net (with voting)
(c) 3DNNV (without voting) (d) 3DNNV (with voting)
FIGURE 7. Visualization of the predictive results on the test dataset (HN_UTSW). To show the performance of 3DNNV, we compared it with TG263-Net [20]. For each category of small-volume OARs shown in the top-right legend, 9 samples were selected from dataset HN_UTSW and fed into networks to extract high-level features (256-d vectors). Then, we reduced the dimensionality of the high-level features by using Principal Component Analysis (PCA) [53], and the result is illustrated in the figure. We highlighted the results of classifying Pituitary and OpticChiasm. (a) and (b) are the results of TG263-Net, which still confused Pituitary with OpticChiasm. (c) and (d) show the clear boundaries between different small-volume OARs when using the ASAC/Voting strategy.
B. Statistical significance of the performance improvement
To illustrate the statistical significance of 3DNNV's performance improvement over the Baseline (GS) model, we performed a one-way analysis of variance (ANOVA) test on the results of the Baseline (GS) and 3DNNV models over all
28 categories of OARs in the head-and-neck datasets. The mean difference denotes the difference between the average TPR/F1/AUC values over six sets of models tested on the datasets (Table VIII). Positive numbers in the mean difference indicate that 3DNNV performed better than Baseline (GS), and negative numbers indicate that 3DNNV performed worse. We set the p-value as 1.0 for samples whose variance were identical between the Baseline (GS) and 3DNNV. We found that 3DNNV significantly outperformed Baseline (GS) (-value < 0.05, )
in identifying small-volume and error-prone OARs, especially in HN_UTSW (Table VIII).
ONE-WAY ANALYSIS OF VARIANCE (ANOVA). TO ILLUSTRATE THE IMPROVEMENT DIRECTLY, WE COMPARE THE RESULTS OF 3DNNV WITH BASELINE (GS) IN TERMS OF TPR, F1, AND AUC. BOLDFACE INDICATES STATISTICALLY SIGNIFICANT IMPROVEMENT (THRESHOLD P-VALUE < 0.05, THE MEAN DIFFERENCE >
C. Limitations
1) RUNNING TIME
To reduce the running time and improve the performance of 3DNNV, we added an early-match module (Fig. 8) to the framework and maintained a locally standardized label dictionary. The early-match module performs string matching [54] between the original label and the standardized label: if
FIGURE 8. Diagram of the early-match module.
and only if the original label fully matches one of the standardized labels in the dictionary, then the standardized label is used to rename the given structure. This reduces the number of structures to be processed by 3DNNV and allows
the framework to process unknown structures not included in the training dataset.
Originally, 3DNNV was used to process a patient data containing 38 structures. A running time of 7 m 41.83 s was required to obtain all the recognition results. This running time is too long to be acceptable for further applications. To solve this problem, we added the early-match module before feeding the input into the 3DNNV. This module relies on a pre-stored dictionary as the basis for string matching. After adding the early-match module to the framework, 3DNNV only needed to process 17 OARs in this patient (containing 38 structures), so the total running time was 3 m 36.05 s. Timely updates and maintenance for the dictionary will help to optimize the automatic identification process and avoid reprocessing labels that have already been standardized. However, the limitation is that the dictionary can only handle one-to-one mapping. When given RT data collected from a multi-language environment, the dictionary mapping method will not significantly reduce the running time of standardization. This is why a single-dictionary mapping method cannot handle cross-institutional data.
2) MULTIPLE LABELS FOR THE SAME STRUCTURE
The original 3DNNV model was trained and tested on only 28 OARs in head-and-neck datasets, which limits the model’s recognition range to these 28 categories. To make the model generalizable to more structures, we tried to extend it to other anatomical sites, and it worked well. However, like Schuler et al. [8], we found that the model cannot distinguish typographic name variations from fundamental semantic differences in the same structure. In this work, we mainly discuss standardizing OAR labels, but in practice, the structures in individual RT data will be labeled differently for different treatment purposes. For example, the same structures might be labeled CavityOral_avoid or CavityOral, SpinalCord or
SpinalCord_5mm, IL_Parotid, CL_Parotid, Parotid_L, or Parotid_R, depending on the specific application for which the labels are being used. These inputs have similar semantic features in images, so it is very difficult to identify these structures based on image information. At the same time, some non-target structures will have multi-level labels for a single OAR—such as Musc_Constrict_M, Musc_Constrict_S, Musc_Constrict_I, and Musc_Constrict, or OpticChiasm_aaa and OpticChiasm_bbb, where aaa is the resident’s name and bbb is the actual attending physician’s name—depending on different RT plans and local policies. These standardization conventions may vary across different medical institutions and treatment plans. At the same time, the standardization of target volumes also warrants attention. The target volume often overlaps with OARs and could be misidentified as an OAR. Additional information can be used to help identifying target volumes, such as positron emission tomography (PET), which is widely used in the clinical practice and able to accurately define biological target volume (BTV). The utilization of BTV and gross tumor volume (GTV) will improve the accuracy for identifying clinical target volume (CTV) [55, 56]. Adding text information may also help us to improve the performance of 3DNNV and meet the requirements of clinical applications.
3) OUTLIERS
In previous experiments, we found that the masks collected from different clinical centers may have inconsistent contours. These inconsistencies result from differences in physician experience and in how the local institution defines delineation for OARs. Moreover, there are outliers in many datasets: some lack masks in some slices; in other cases, the label does not always match the contour in the mask because of inaccurate delineation or partial depiction. We believe that detecting delineation outliers also presents a challenge to standardizing nomenclature for RT data.
In this paper, we propose a novel framework, 3DNNV, that combines an ASAC/Voting strategy and a non-local network to integrate clinicians’ domain knowledge and recognition mechanisms into our deep learning architecture. To the best of our knowledge, our work is the first to propose an architecture that integrates domain knowledge to solve the recognition problems caused by imbalance and poor delineation. Our model had a significantly higher average true positive rate than the baseline model across three test datasets (+ 8.27%, + 2.39%, and + 5.53%). More importantly, our model outperformed the baseline in terms of the F1 Score of the
Pituitary (28.63% to 91.17%) with only 9 training samples, when tested on the HN_UTSW dataset.
We visualized the vectors of our predictive results to evaluate the effectiveness of 3DNNV. One-way ANOVA tests showed the statistical significance of 3DNNV’s performance improvement over Baseline (GS). Finally, we discussed limitations of the model that could impede application, and we suggested future work for automatically standardizing anatomical structure nomenclature in radiotherapy.
Our findings in this work will advance efforts to automate the standardization of organ labels in DICOM RT data, which will facilitate and improve data-driven research.
We would like to thank Dr. Jonathan Feinberg for editing the manuscript.
[1] C. S. Mayo, et al. “American association of physicists in medicine task group 263: Standardizing nomenclatures in radiation oncology,” International Journal of Radiation Oncology*Biology*Physics, vol. 100, no. 4, pp. 1057-1066, Mar. 2018, DOI: 10.1016/J.IJROBP.2017.12.013, [Online].
[2] D. Hultstrom, “Standards for cancer registries volume II: Data standards and data dictionary” North American of Central Cancer Registries, edition 7th, version 10, Mar. 2002, [Online] Available: https://www.naaccr.org/wp-content/uploads/2016/12/NAACCR-Volume-II-REVISED-5-14-02.pdf
[3] L. Santanam, et al. “Standardizing naming conventions in radiation oncology,” International Journal of Radiation Oncology*Biology*Physics, vol. 83, no. 4, pp. 1344-1349, Jul. 2012, DOI: 10.1016/J.IJROBP.2011.09.054, [Online].
[4] S. P. Robertson, et al. mining framework for large scale analysis of doseoutcome relationships in a database of irradiated head and neck cancer patients,Medical Physics, vol. 42, no. 7, pp. 4329-4337, Jun. 2015, DOI: 10.1118/1.4922686, [Online].
[5] J. O. Deasy, et al. “Improving normal tissue complication probability models: The need to adopt a ‘data-pooling’ culture,” International Journal of Radiation Oncology*Biology*Physics, vol. 76, no. 3, pp. S151-S154, Mar. 2010, DOI: 10.1016/J.IJROBP.2009.06.094, [Online].
[6] R. C. Chen, et al. “How will big data impact clinical decision making and precision medicine in radiation therapy?” International Journal of Radiation Oncology • Biology • Physics, vol. 95, no. 3, pp. 880 – 884, Jul. 2016, DOI: 10.1016/J.IJROBP.2015.10.052, [Online].
[7] T. Skripcak, et al. “Creating a data exchange strategy for radiotherapy research: Towards federated databases and anonymised public datasets,” Radiotherapy and Oncology, vol. 113, no. 3, pp. 303-309, Dec. 2014, DOI: 10.1016/J.RADONC.2014.10.001, [Online].
[8] T. Schuler, et al. “Big data readiness in radiation oncology: An efficient approach for relabeling radiation therapy structures with their tg-263 standard name in realworld data sets,” Advances in Radiation Oncology, vol. 4, no. 1, pp. 191-200, 2019, DOI: 10.1016/J.ADRO.2018.09.013, [Online].
[9] E. Roelofs, et al. “International data-sharing for radiotherapy research: An open-source based infrastructure for multicentric clinical data mining,” Radiotherapy and Oncology, vol. 110, no. 2, pp. 370-374, Feb. 2014, DOI: 10.1016/J.RADONC.2013.11.001, [Online].
[10] L. Potters, E. Ford, S. Evans, T. Pawlicki, and S. Mutic, “A systems approach using big data to improve safety and quality,” International Journal of Radiation Oncology • Biology • Physics, vol. 95, no. 3, pp. 885-889, Jul. 2016, DOI: 10.1016/J.IJROBP.2015.10.024, [Online].
[11] C. S. Mayo, et al. “The big data effort in radiation oncology: Data mining or data farming?” Advances in Radiation Oncology, vol. 1, no. 4, pp. 260-271, 2016, DOI: 10.1016/J.ADRO.2016.10.001, [Online]
[12] W. Zhu, et al. “AnatomyNet: Deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy,” Medical physics, vol. 46, no. 2, pp. 576-589, Nov. 2018, DOI: 10.1002/MP.13300, [Online].
[13] H. Wickham, “Tidy data,” Journal of Statistical Software, vol. 59, no. 10, pp. 1-23, Aug. 2014. [Online], Available: https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf
[14] T. Dasu and T. Johnson, “Data Quality,” in Exploratory Data Mining and Data Cleaning, New York, NY, USA: John Wiley & Sons, Inc., 2003
[15] C. S. Mayo, et al. “Establishment of practice standards in nomenclature and prescription to enable construction of software and databases for knowledge-based practice review,” Practical Radiation Oncology, vol. 6, no. 4, pp. e117-e126, 2016. DOI: 10.1016/J.PRRO.2015.11.001, [Online].
[16] T. Nyholm, et al. “A national approach for automated collection of standardized andpopulation-based radiation therapy data in Sweden” Radiotherapy and Oncology, vol. 119, no. 2, pp. 344-350, 2016, DOI: 10.1016/ J.RADONC.2016.04.007, [Online].
[17] T. Rozario, T. Long, M. Chen, W. Lu, and S. Jiang, “Towards automated patient data cleaning using deep learning: A feasibility study on the standardization of organ labeling,” arXiv.org, Dec. 2017, [Online], Available: https://arxiv.org/abs/1801.00096
[18] A. Sotiras, et al., “Deformable medical image registration: A survey,” IEEE Transactions on Medical Imaging, vol. 32, no. 7, pp. 1153-1190, 2013, DOI:10.1109/TMI.2013.2265603, [Online].
[19] H. Duc, “Atlas-Based Methods in Radiotherapy Treatment of Head and Neck Cancer” Doctoral thesis, University College London, London, UK, 2013.
[20] D. Rhee, et al., “TG263-Net: A Deep Learning Model for Organs-At-Risk Nomenclature Standardization” Presented as e-poster at the AAPM 61st Annual Meeting. San Antonio, TX, USA, Jul. 14-18, 2019
[21] Y. LeCun, et al., “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. DOI: 10.1109/5.726791, [Online].
[22] A. Krizhevsky, et al., “ImageNet Classification with Deep Convolutional Neural Networks” in NIPS, Lake Tahoe, Nevada, USA, 2012, pp. 1097-1105.
[23] J. Sivic, et al., “Discovering object categories in image collections,” in ICCV, Beijing, China, 2005, pp. 2254-2261.
[24] B. E. Boser, et al., “A training algorithm for optimal margin classifiers,” in Fifth Annual Workshop on COLT, Pittsburgh, Pennsylvania, USA, 1992, pp. 144-152.
[25] K. He, et al. “Deep residual learning for image recognition,” in CVPR, Las Vegas, Nevada, USA, 2016, pp. 770–778.
[26] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” In ICLR, San Diego, CA, USA, 2015.
[27] S. Xie, et al., “Aggregated Residual Transformations for Deep Neural Networks” in CVPR, Honolulu, Hawaii, USA, 2017, pp. 1492-1500.
[28] S. Gao, et al., “Res2Net: A New Multi-Scale Backbone Architecture” in IEEE TPAMI, 2020, DOI: 10.1109/TPAMI.2019.2938758, [Online].
[29] G. Huang, et al., “Densely Connected Convolutional Networks” in CVPR, Honolulu, Hawaii, USA, 2017, pp. 4700-4708.
[30] Q. Dou, et al., “Multilevel Contextual 3-D CNNs for False Positive Reduction in Pulmonary Nodule Detection” IEEE Transactions on Biomedical Engineering, vol. 64, no. 7, pp. 1558-1567, Jul. 2017, DOI: 10.1109/TBME.2016.2613502, [Online].
[31] R. Novak, et al., “Sensitivity and generalization in neural networks: an empirical study,” In ICLR, Vancouver, BC, Canada, 2018.
[32] G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Malossi. “BAGAN: Data augmentation with balancing GAN,” arXiv preprint arXiv:1803.09655, 2018.
[33] Ekin D. Cubuk, et al., “AutoAugment: Learning Augmentation Policies from Data” in CVPR, Long Beach, CA, USA, 2019, pp. 113-123.
[34] C. Han et al., "Combining Noise-to-Image and Image-to-Image GANs: Brain MR Image Augmentation for Tumor Detection" IEEE Access, vol. 7, pp. 156966-156977, 2019. DOI: 10.1109/ACCESS.2019.2947606
[35] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Hammers, D.A. Dickie, M.V. Hernández, J. Wardlaw, and D. Rueckert. “GAN augmentation: augmenting training data using generative adversarial networks,” arXiv.org, Oct. 2018, [Online]. Available: https://arxiv.org/abs/1810.10863
[36] V. Sandfort, K. Yan, P.J. Pickhardt, and R.M. Summers. “Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks,” Scientific Reports, vol. 9, pp. 16884, Nov. 2019. DOI: 10.1038/s41598-019-52737-x.
[37] R. Geirhos, et al., “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness,” in ICLR, New Orleans, Louisiana, USA, 2019.
[38] Tiange Luo, et al., “Few-Shot Learning with Global Class Representations” in ICCV, Seoul, Korea, 2019.
[39] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object Detection,” in ICCV, Venice, Italy, 2017, pp. 2980-2988
[40] Y. Chen, Y. Bai, W. Zhang, and T. Mei. “Destruction and Construction Learning for Fine-Grained Image Recognition,” in CVPR, Long Beach, California, USA, 2019, pp. 5157-5166.
[41] S. Bianco, et al., “Benchmark Analysis of Representative Deep Neural Network Architectures” IEEE Access, vol. 6, pp. 64270-64277, Oct. 2018, DOI: 10.1109/ACCESS.2018.2877890, [Online].
[42] X. Wang, et al., “Non-local neural networks,” in CVPR, Salt Lake City, Utah, USA, 2018, pp. 7794–7803.
[43] A. Vaswani, et al., “Attention is all you need,” in NeurIPS, Long Beach, CA, USA, 2017, pp. 5998-6008.
[44] F. Milletari, N. Navab, and S. Ahmadi, “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 2016, pp. 565-571, DOI: 10.1109/3DV.2016.79.
[45] C. L. Brouwer, et al., “CT-based delineation of organs at risk in the head and neck region: DAHANCA, EORTC, GORTEC, HKNPCSG, NCIC CTG, NCRI, NRG Oncology and TROG consensus guidelines,” Radiotherapy and Oncology, vol. 117, no. 1, pp. 83–90, 2015, DOI: 10.1016/J.RADONC.2015.07.041, [Online].
[46] M. Vallières, et al. “Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer,” Scientific Reports, no. 7:10117, 2017, DOI: 10.1038/S41598-017-10371-5, [Online].
[47] M. Vallières, et al. “Data from Head-Neck-PET-CT,” the Cancer Imaging Archive, 2017, DOI: 10.7937/K9/TCIA.2017.8OJE5Q00, [Online], Available: https://wiki.cancerimagingarchive.net/display/Public/Head-Neck-PET-CT
[48] K. Clark, et al., “The cancer imaging archive (tcia): Maintaining and operating a public information repository” Digit Imaging, vol. 26, no. 6, pp. 1045-1057, 2013, DOI:10.1007/S10278-013-9622-7, [Online].
[49] P. F. Raudaschl, et al., Evaluation of segmentation methods on head and neck CT: Auto segmentation challenge 2015, physics, vol. 44 no. 5, pp. 2020 – 2036, Jun. 2017, DOI: 10.1002/MP.12197, [Online].
[50] A. Paszke et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems 32, Red Hook, New York, USA, 2019, pp. 8024-8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[51] K. He, et al., “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification” in ICCV, Santiago, Chile, 2015, pp. 1026-1034.
[52] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in ICLR, San Diego, CA, USA, 2015.
[53] I.T. Jolliffe, “Principal Component Analysis” Springer Series in Statistics, 2nd ed., New York, NY, USA, 2002.
[54] Fuzzywuzzy [Online]. Available: https://github.com/seatgeek/fuzzywuzzy
[55] Q. Song, et al. “Optimal co-segmentation of tumor in PET-CT images with context information” IEEE Transactions on Medical Imaging, vol. 32, no. 9, pp. 1685-1697, Sep. 2013, DOI: 10.1109/TMI.2013.2263388, [Online].
[56] L. Rundo, et al. “A fully automatic approach for multimodal PET and MR image segmentation in gamma knife treatment planning” Computer Methods and Programs in Biomedicine, vol. 144, pp 77-96, Jun. 2017, DOI: 10.1016/j.cmpb.2017.03.01, [Online].
Qiming Yang received a B.S. degree in software engineering from the Sun Yat-sen University, Guangzhou, China, in 2017. Her previous works focused on biomedical image processing. She is currently pursuing an M.S. degree in Sun Yat-sen University and was visiting the University of Texas Southwestern Medical Center in Dallas, Texas, from September 2018 to August 2019. Her research interests include image processing, deep learning, and computer vision.
Hongyang Chao received B.S. and Ph.D. degrees in computational mathematics from Sun Yat-sen University, Guangzhou, China. In 1988, she joined the Department of Computer Science, Sun Yat-sen University, where she was initially an Assistant Professor and later became an Associate Professor. She is currently a Full Professor in the School of Data and Computer Science. She has published extensively in the area of image/video processing and holds 3 U.S. patents and 4 Chinese
patents in the related area. Her current research interests include image and video processing, image and video compression, massive multimedia data analysis, and content-based image (video) retrieval. She was visiting the University of Texas Southwestern Medical Center in Dallas, Texas, from September 2018 to August 2019.
Dan Nguyen is currently an Assistant Professor in the Medical Artificial Intelligence and Automation (MAIA) Laboratory at the University of Texas Southwestern Medical Center in Dallas, Texas. He received a B.S. in Physics at the University of Texas at Austin in 2012 and a Ph.D. in Biomedical Physics at the University of California, Los Angeles in 2017. His current research in MAIA Lab includes using artificial intelligence technologies and advanced optimization algorithms for radiation therapy treatment planning. In particular, he is
tackling problems involving clinical volumetric dose prediction, Pareto surface navigation, incorporating human and learned domain knowledge, dose calculation, beam orientation optimization, and uncertainty estimation.
Steve Jiang received his Ph.D. in Medical Physics from Medical College of Ohio in 1998. After completing his postdoctoral training at Stanford University, he joined Massachusetts General Hospital and Harvard Medical School in 2000 as an Assistant Professor of Radiation Oncology. In 2007, Dr. Jiang was recruited to University of California San Diego as a tenured Associate Professor to build the Center for Advanced Radiotherapy
Professor with tenure in 2011. In October 2013, Dr. Jiang joined University of Texas Southwestern Medical Center as a tenured Full Professor, Barbara Crittenden Professor in Cancer Research, Vice Chair of Radiation Oncology Department, and Director of Medical Physics and Engineering Division. He is the founding director of the Medical Artificial Intelligence and Automation Laboratory. Dr. Jiang is a Fellow of Institute of Physics and American Association of Physicists in Medicine. His current research interest is to develop and deploy artificial intelligence technologies to solve medical problems.