The purpose of this study is to develop an automated, validated algorithm for vertebral body segmentation on lateral chest radiographs using deep learning. Successful, automated segmentation of vertebral bodies could lead to the detection of spinal fractures and the precise quantification of vertebral body heights. Automated vertebral body height measurements appropriately applied to a large, diverse dataset could stratify mean values and standard deviations of vertebral heights by patient age, height, sex, and other clinical parameters.
In the United States, an estimated 750,000 new spinal compression fractures occur every year. On routine lateral chest radiography, vertebral fractures and compression deformities may be under-diagnosed and sometimes difficult to appreciate. The progression of undetected fractures places patients at an increased risk for complications associated with significant morbidity and mortality [1] An automated, validated solution to the detection of vertebral body compression fractures on lateral chest radiographs may facilitate early diagnosis and allow for timely intervention to unburden patients of preventable compression fracture sequela.
Prior automated solutions for vertebral segmentation using traditional computer aided detection (CAD) approaches, demonstrated modest results, likely due to the sheer heterogeneity of patient anatomical variation. A study performed by Mysling et al. outperformed previous vertebral segmentation attempts but demonstrated a 52% error rate in the segmentation of fractured vertebrae and a 10% overall vertebral segmentation error rate. The efficacy of this approach was analyzed incompletely, but segmentation error was defined in this study as >2 millimeters point-to-countor distance, a length metric between the ground truth manual segmentation and the algorithm-based segmentation. Wong et al. proposed a live fluoroscopic vertebral segmentation strategy, but demonstrated no performance data to assess the technique’s realworld performance. [2]
Deep learning techniques have demonstrated success in pixel-level labeling tasks using semantic segmentation, which is the partition of an image into unique parts or objects, such as identifying all cars or people within an image. [3] In medical images, this has been applied to the segmentation of pancreatic tumors on CT [4] , brain tumors on MRI [5], and stroke lesions on MRI for example. [6] Some deep learning methods for semantic segmentation include fully convolutional neural networks (FCN) [7] convolutional autoencoders such as the U-Net [8] and DeepLab [9] . As opposed to semantic segmentation, there also exist solutions using deep learning for instance segmentation, where each object instance is identified within an image (e.g. car 1, car 2, car 3, etc...). For example, with regard to vertebrae, an instance segmentation solution could identify each vertebral body such as T1, T2, T3 and color-code them for example. Some popular instance segmentation solution includes Mask-RCNN. [10]
In this study, we choose to employ semantic segmentation using a standard U-Net [8] , as it has shown to be relatively accurate with regard to segmentation in medical imaging, including segmentation of brain tumors, kidneys and pulmonary nodules. [11–13] The U-Net network architecture is structured into an encoder and a decoder. The encoder follows the classic architecture of the convolutional neural network, with convolutional blocks each followed by a rectified linear unit (ReLU) and a max polling operation to encode image features at different levels of the network. The decoder up-samples the feature map with subsequent up-convolutions and concatenations with the corresponding encoder blocks. This network style architecture helps to better localize and extract image features and assembles a more precise output based on encoder information.
The model is first created using the training dataset by comparing the training data with expected output to establish optimal weights with back-propagation rules. Validation data is then used to establish the optimal number of hidden units to verify a stopping point for the back-propagation algorithm of the trained model essential for model selection. The test dataset is utilized to establish the accuracy of the model from the fully trained final model weights. The test and validation datasets are categorized independently to ensure accuracy as the final model is biased toward the validation data used to make final model selection.
Fig. 1 U-Net Architecture
In Figure 1, each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.
As vertebrae typically represent less than 20% of the total area in each lateral radiograph, vertebral segmentation data are highly imbalanced. Previous studies demonstrate that neural network performance deteriorates as class imbalance increases. [14] The resulting imbalanced data model would be highly trained for the features of the majority class but poorly trained for the features of the minority class, causing instability in early training. Despite a probable technically high accuracy from the generated model, the output results would be poor segmentations.
A dice loss function is a type of loss function specifically designed to mitigate dataset class imbalance and is frequently used for medical imaging algorithms. [7] The dice score is measured as an overlap of the output mask with ground truth to assess each segmentation task and is specifically designed for use in volumetric segmentation in medical imaging. [7] The coefficient measures the overlap between set X, the ground truth, and Y, the predicted mask. For binary class segmentation, the dice score is expressed as the following:
Intersection-over-union (IoU), also known as the Jaccard index, is a coefficient that calculates overlap in segmentation tasks. [15] IoU also measures the overlap between set X, the ground truth, and Y, the predicted mask. Some studies indicate that IoU values typically exceed corresponding dice coefficients except when coefficient values equal 0 or 1. [16] The IoU is calculated as the intersection between two images divided by their union, expressed as the following:
coefficient and IoU both equal 1. If the ground truth and predicted mask share
no elements, then both coefficients equal 0. If the ground truth and predicted mask are neither identical nor absolutely incongruent, then each coefficient value falls between 0 and 1.
An IRB-exempt study using 124 de-identified HIPAA-compliant lateral chest radiographs on unique patients was performed. Images were pre-processed using contrast-adaptive histogram equalization (CLAHE) to standardize the appearance and improve image contrast. Images were subsequently down-sampled from the original size (4238 3480 pixels) to 512 x 512 pixels using bilinear interpolation. Down-sampling was performed to aid back-propagation and neural network learning within graphics processor unit (GPU) memory constraints. Segmentations of visible vertebrae were manually performed using ImageJ version 1.50i (National Institutes of Health, USA) by both a medical student (JL) and a board-certified radiologist (PL). All segmentations of the images were verified and adjusted as needed by a board-certified radiologist (PL)
The resulting binary mask was additionally down-sampled to 512 x 512 pixels. Ground truth-labels were color-coded with black for vertebrae and white for each image background, as seen in Figure 2. Class imbalance in the vertebral segmentations was preempted in this study with the use of a dice loss function.
Fig. 2 (a) Down-sampled lateral chest radiograph (b) Binary ground truth label with black vertebrae and white background
The U-Net based convolutional neural network was employed to segment vertebrae from lateral chest radiographs. 74 images (59.68%) were used for the training dataset, 10 images (8.1%) were used for the validation dataset, and 40 images (32.25%) were used for the test dataset. The model was built using Keras 2.06 (https://keras.io/) with TensorFlow 1.1 (Google LLC, Mountain View, CA) and CUDA 8.1 (Nvidia Corporation, Santa Clara, CA).
Differential learning rates and optimizers were trialed to enhance update weight and bias values to maximize model performance. Based on multiple experiments, the best performance resulted from using the Adam optimizer with a learning rate of 0.0001. The dice score and IoU were utilized on the validation dataset to augment model selection. The model was trained until a plateau in validation loss, which occurred at 10 epochs on a CUDA-enabled Nvidia 1080Ti 11GB graphics processing unit. (Nvidia Corporation, Santa Clara, CA)
The dice score and IoU were used to assess model performance. As the segmentation task was binary, loss function was applied as the summation of dice score loss and binary cross entropy.
In the holdout test dataset, an average dice coefficient value of 90.5 and an average IoU of 81.75 were obtained. Ground truth masks, predicted masks, and overlay segmentation masks were successfully generated. In Figure 3, the ground truth mask represents the manual segmentation and the predicted mask is the resulting output of the deep learning model.
Fig. 3 (a) Original image (b) Ground truth mask (c) Predicted mask
In the overlay-segmented mask seen in Figure 4, the original radiograph is superimposed with the ground truth mask in red and the predicted mask in blue.
Figure 5 demonstrates an example of an inferior segmentation result. Two adjacent thoracic vertebrae, T3 and T4, as indicated by the blue arrows, are poorly visualized on the original radiograph and poorly segmented by the UNet model.
The U-NET segmentation network resulted in a model with a dice score of 90.12 and an IoU of 82.10 for automated segmentation of vertebrae on lateral chest radiography. These findings demonstrate that deep learning techniques for pixel-wise segmentation may require a relatively small number of images
Fig. 4 Overlay segmented mask
Fig. 5 (a) Original Image (b) Overlay segmented mask
to accurately segment vertebral bodies on radiography, compared to deep learning for whole-image classification, which typically needs thousand or more images per class. [17] This solution represents novel success with regard to a vertebral segmentation algorithm on lateral chest radiographs though there has been previous success in the classification and identification of spine fractures on 3D CT. [18] Previous failures in automated radiographic vertebral segmentation may be due to the comparatively low-resolution of radiographs when compared with CT, especially in regard to patients with osteoporosis and other conditions that limit vertebral visualization and background contrast.
While the U-NET was relatively successful in segmenting the thoracic vertebre, there were some cases in which the segmentation failed or was less optimal. For example,Figure 5 demonstrates one lateral radiograph where there was an improper segmentation of some of the upper thoracic vertebra. This was most likely due to the poor contrast resolution and relatively obscurity of the upper thoracic spine on that image related surrounding structures . As there were few cases with complex low-level spinal features, it is likely that a larger, more diverse dataset will improve the model’s ability to train on more complex cases and apply that learning to complex test cases.
Because vertebrae typically represent a minority of the total pixels on a lateral radiograph, there was a high risk for class imbalance between the segmented vertebrae and background. To combat this risk of class imbalance, the dice coefficient as a loss function was employed , and data augmentation strategies and appropriate loss functions were utilized to mitigate model overfitting. Though not explicitly assessed, it is likely that the use of the dice coefficient and data augmentation strategies improved model performance. One can consider other strategies to improve performance, including the use of a U-Net variant with Inception-inspired architecture, other FCNs, or RCNNs. [19] A larger, more clinically diverse dataset would also likely improve performance.
In the future, we plan to use this model in conjunction with image processing techniques that count the number of uniquely segmented vertebre starting from the first readily visible vertebrae, likely from T3 or T4, and ending at the most caudal vertebrae visible on the image. In the future, it would be interesting to try an instance segmentation solution, such as a Mask R-CNN where each vertebra is labeled and separately segmented (e.g. T3, T4, T5, etc...). This has the advantage of a one-step deep learning solution that can identify and segment each vertebra. [10]
As two-view chest radiography represents the most frequently utilized imaging modality in the United States, an automated screening solution to detect vertebral fractures would provide tremendous benefit to patients and could mitigate systemic healthcare expenditure. [20]- [21] The retrospective application of this algorithm to a large database of lateral chest radiographs could also determine normative values for mean vertebral body height, strat-ified by pertinent patient data, such as age, sex, height, weight and other demographic and clinical parameters. This has implications to better individualize patient care and prevent future spinal fractures.
Deep learning using a U-NET demonstrates promise in the automated segmentation of vertebrae on lateral chest radiographs.
1. Adams JE, Lenchik L, Roux C, Genant HK. Radiological assessment of vertebral fracture. International osteoporosis founda-
tion vertebral fracture initiative resource document part II; 2010. Available from: https://www.iofbonehealth.org/sites/default/files/PDFs/ Vertebral%20Fracture%20Initiative/IOF_VFI-Part_II-Manuscript.pdf.
2. Wong SF, Wong KYK, Wong WNK, Leong CYJ, undefined D K K Luk. Tracking Lumbar Vertebrae in Digital Videofluoroscopic Video Automatically . Medical Imaging and Augmented Reality. 2004;p. 154–162.
3. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation;.
4. Roth HR, Lu L, Farag A, Shin HC, Liu J, Turkbey EB, et al. DeepOrgan: Multi-level Deep Convolutional Networks for Automated Pancreas Segmentation. In: Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015. Cham: Springer International Publishing; 2015. p. 556–564.
5. Pereira S, Pinto A, Alves V, Silva C. Brain Tumor Segmentation Using Convolutional Neural Networks in MRI Images. IEEE Transactions on Medical Imaging. 2016 03;35:1–1. Available from: 10.1109/TMI.2016. 2538465.
6. Yanran W, Aggelos KK, Xue W, Todd BP. A deep symmetry convnet for stroke lesion segmentation. 2016 IEEE International Conference on Image Processing (ICIP). 2016;p. 111–115.
7. Evan S, Jonathan L, Trevor D. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(4):640–651. Available from: 10.1109/TPAMI.2016.2572683.
8. Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. CoRR. 2015;abs/1505.04597.
9. L C, G P, I K, K M, A LY. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis & Machine Intelligence. 2018 April;40(4):834–848. Available from: 10.1109/TPAMI.2017.2699184.
10. He K, Gkioxari G, Dollár P, Girshick R. Mask r-cnn; 2017. p. 2961–2969.
11. Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. In: Medical Image Computing and Computer-Assisted Intervention - MICCAI 2016. Cham: Springer International Publishing; 2016. p. 424–432.
12. Dong H, Yang G, Liu F, Mo Y, Guo Y. Automatic Brain Tumor Detection and Segmentation Using U-Net Based Fully Convolutional Networks; 2017. p. 506–517. Available from: 10.1007/978-3-319-60964-5_44.
13. Tong G, Li Y, Chen H, Zhang Q, Jiang H. Improved U-NET network for pulmonary nodules segmentation. Optik. 2018;174:460–469. Available from: https://doi.org/10.1016/j.ijleo.2018.08.086.
14. Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD. Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural networks : the official journal of the International Neural Network Society. 2007;21:427– 36.
15. Jaccard P. The distribution of the flora in the alpine zone. New Phytologist. 1912;2(37–50). Available from: https://doi.org/10.1111/j. 1469-8137.1912.tb05611.x.
16. Taha AA, Hanbury A, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med Imaging. BMC medical imaging . 2015;15(29). Available from: 10.1186/s12880-015- 0068-x.
17. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition. Ieee; 2009. p. 248–255.
18. Joseph EB, Jianhua Y, Ronald MS. Vertebral Body Compression Frac- tures and Bone Density: Automated Detection and Classification on CT Images. Radiology. 2017;284 3:788–797.
19. Fausto M, Nassir N, Seyed-Ahmad A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 2016 Fourth International Conference on 3D Vision (3DV). 2016;p. 565–571.
20. Dixon S. Diagnostic Imaging Dataset Statistical Release. NHS England . 2016;July:1–16.
21. Maehara C, Jacobson F, Andriole K, Khorasani R. Utilization Effect of Integrating a Chest Radiography Room into a Thoracic Surgery Ward. Journal of the American College of Radiology . 2012;9(6):421–425.