The authors have declared that no competing interests exist.
Recent years have witnessed wider prevalence of vertebral column pathologies due to lifestyle changes, sedentary behaviors, or injuries. Spondylolisthesis and scoliosis are two of the most common ailments with an incidence of 5% and 3% in the United States population, respectively. Both of these abnormalities can affect children at a young age and, if left untreated, can progress into severe pain. Moreover, severe scoliosis can even lead to lung and heart problems. Thus, early diagnosis can make it easier to apply remedies/interventions and prevent further disease progression. Current diagnosis methods are based on visual inspection by physicians of radiographs and/or calculation of certain angles (e.g., Cobb angle). Traditional artificial intelligence-based diagnosis systems utilized these parameters to perform automated classification, which enabled fast and easy diagnosis supporting tools. However, they still require the specialists to perform error-prone tedious measurements. To this end, automated measurement tools were proposed based on processing techniques of X-ray images. In this paper, we utilize advances in deep transfer learning to diagnose spondylolisthesis and scoliosis from X-ray images without the need for any measurements. We collected raw data from real X-ray images of 338 subjects (i.e., 188 scoliosis, 79 spondylolisthesis, and 71 healthy). Deep transfer learning models were developed to perform three-class classification as well as pair-wise binary classifications among the three classes. The highest mean accuracy and maximum accuracy for three-class classification was 96.73% and 98.02%, respectively. Regarding pair-wise binary classification, high accuracy values were achieved for most of the models (i.e., > 98%). These results and other performance metrics reflect a robust ability to diagnose the subjects’ vertebral column disorders from standard X-ray images. The current study provides a supporting tool that can reasonably help the physicians make the correct early diagnosis with less effort and errors, and reduce the need for surgical interventions.
The spinal column is comprised of 33 small bones called vertebrae, which are classified into five distinct areas; cervical, thoracic, lumbar, sacrum, and coccygeal. It is essential for the human body motion and stability. More importantly, the spinal column provides protection for the spinal cord and nerve roots. The spinal cord is part of the central nervous system (CNS) and is responsible for carrying sense and movement information from and to the brain. Hence, the degeneration of the spine results in a wide range of ailments (e.g., restricted motion, pain, numbness, etc.), and reduces the quality of life in general [
Several pathologies can affect the vertebral column. In this paper, we examine two types of degenerative pathologies; Scoliosis and Spondylolisthesis. Scoliosis is a curvature of the thoracic or lumbar spine in the coronal plane (i.e., sideways). It is diagnosed by the specialist using X-ray images of the spine and possibly a Magnetic Resonance Imaging (MRI) to rule out tumors [
The process of diagnosing the spinal column disorders starts with a physical examination. In this step, the doctor investigates the patient’s medical history, participation in sports/physical activity, and involvement in accidents. Moreover, the back and spine need to be carefully examined for signs of abnormal shape, restricted range of motion, or muscle weakness/spasm. In addition, the examination involves performing posture and gait analysis [
The medical literature in relation to the health of the vertebral column has focused primarily on extracting biomechanical parameters that objectively determine and quantify the disease state of the spine. To this end, scoliosis and its severity can be diagnosed using the Cobb angle, which was described by John Cobb in 1948 and represent the gold standard. However, it has some shortcomings relating to measurement difficulties and in relation to 3D deformities [
The research landscape using machine learning (ML) and artificial intelligence (AI) followed a similar path to that of the medical literature by designing algorithms that can automatically extract the aforementioned biomechanical markers of disease from medical images [
Recently, deep learning AI architectures has enabled more innovation in disease diagnosis from medical images. For example, Mahajan et al. [
The Cobb angle is typically measured using X-ray images. Hence, Tan et al. [
A similar path was taken in the literature for spondylolisthesis identification. Neto et al. [
The contributions of this paper are as follows:
Develop a reliable artificial intelligence system for the diagnosis of scoliosis and spondylolisthesis based on radiographic X-ray images of the vertebral column. Such a system can provide support for clinical diagnosis decisions, and reduce errors and overhead. We collect X-ray images of subjects suffering from scoliosis and spondylolisthesis, as well as healthy ones, as determined by the specialists in the hospital. This dataset will expand and enrich any comparable publicly available datasets, enable the development of automated machine learning and AI algorithms for the detection of vertebrae ailments, and can be used for training and educating medical students, residents, and specialists. Investigate several deep learning convolutional neural network models for the classification of scoliosis, spondylolisthesis, and normal X-ray images using transfer learning. We evaluate the performance of the deep learning models for three-class (scoliosis vs spondylolisthesis vs normal) and pair-wise classification problems (scoliosis vs spondylolisthesis, scoliosis vs normal, and spondylolisthesis vs normal). The cost of each model in terms of training and testing times were also evaluated. We share, through public data repository, the original images and resized versions that match the requirements of deep learning models in five sizes; [224 224 3], [227 227 3], [256 256 3], [299 299 3], and [331 331 3].
The rest of this paper is organized as follows. In the materials and methods section, we present the data collection procedure, subjects, deep learning models, performance evaluation setup, and performance metrics. The results section provides the results in detail and discussion of the various observations. The conclusion section presents the future works and concludes the work in this paper.
The work in this paper exploits the abilities of generically pre-trained convolutional neural network models to automatically classify X-ray images into three possible spine-related conditions; scoliosis, spondylolisthesis, or normal(i.e., healthy). The approach achieves high performance metrics while not requiring manual or automatic measurements nor any feature extraction as this is inherently done by the deep learning architecture. In addition, no elaborate image processing or modeling are required.
The current study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board (IRB) at King Abdullah University Hospital (KAUH), Deanship of Scientific Research at Jordan University of Science and Technology in Jordan (Ref. 19/144/2021). X-ray images of the vertebral column were collected locally at King Abdullah University Hospital, Jordan University of Science and Technology, Irbid, Jordan. Written informed consent was obtained from all subjects involved in the study (or their parents in case of minors). The diagnosis was determined by two orthopedic specialists at the KAUH.
The dataset included 338 subjects (240 females, 98 males) with an age range from 9 months to 79 years and mean ± SD of 24.9 ± 18.58 years. The number of subjects with normal X-ray images was 71 (40 females, 31 males) with an age range of 9 months to 56 years and mean ± SD of 19.41 ± 11.19. The number of subjects diagnosed with spondylolisthesis was 79 (49 females, 30 males) with an age range of 15-79 years and mean ± SD of 53.59 ± 14.02. The number of subjects diagnosed with scoliosis was 188 (151 females, 37 males) with an age range of 5-35 years and mean ± SD of 14.73 ± 3.36.
Typically, the main input to the diagnosis of vertebral column diseases is medical images (i.e., X-ray, CT, or MRI). Hence, convolutional neural networks (CNNs) were used to classify the input into the possible disease state. CNNs are a type of feed forward neural networks with a deep architecture and form the basis for a major part of the deep learning models (DNNs) in the literature. Other types include Recurrent Neural Network (RNN) with variations (e.g., Long Short Term Memory (LSTM), and transformers), and Generative adversarial networks (GANs). CNNs have been found to be useful for image processing and classification as they are able to extract patterns and features in images regardless of scaling, mirroring, rotation, or translation.
The CNN is generally comprised of several types of layers and takes a tensor of order 3 as input (i.e., an image with N rows, M columns, and 3 (RGG) color channels). Convolution layers scan the image looking for correlated regions (e.g., vertebra). The input image is divided into small subparts called receptive fields, which in turn are grouped into feature maps. Each feature map has a corresponding weight matrix (i.e., kernel), which is learned/updated during training. Rectified linear unit (ReLU) usually follows the convolution layer and introduces nonlinearity into the CNN. Pooling layers reduce the dimensionality of the feature maps feeding into subsequent layers by considering subparts of the feature map and taking the maximum (i.e., max-pooling), average (i.e., average-pooling), or other statistical measure. Fully connected layers are similar to multilayer perceptron (MLP) networks and ensure that all elements in the previous layer contribute to the output or following layer. Dropout layers remove certain elements of the network in order to prevent overfitting and improve model generalization. The mathematical foundations, benefits, alternatives, and tradeoffs are well-established in the literature and beyond the scope of this work [
Transfer learning utilizes pre-trained deep learning models, which were developed using millions of images from the ImageNet [
The following is a short description of the 14 convolutional neural network models used in this paper:
SqueezeNet is 18 layers deep with an image input size of [227 227 3]. It was designed with the premise that smaller deep neural networks can offer comparable accuracy levels to large architectures but with the advantages of lesser inter-process communication, faster deployment on end-user machines, and more suitability to resource-limited environments. The model was pre-trained using the ImageNet database [ GoogLeNet is 22 layers deep with an image input size of [224 224 3]. It is part of the family of Inception deep learning models and it is marked by the improved utilization of the computing resources, which allowed for increasing the depth and width of the network without any additional computational cost [ Inception-v3 is the third version of the Inception models, which improves on the previous two by having more parameters (e.g., utilizing three different filter sizes in the parallel convolution layers). The model is 48 layers deep with an image input size of [299 299 3] pre-trained on images form ImageNet [ DenseNet-201, as the name suggests, is 201 layers deep with an image input size of [224 224 3]. The model represents a big jump in the number of layers compared to others. This was made possible by shortening the connections between layers close to the input/output. Connections between layers are made such that each layer feeds into later layers, which improves feature propagation/reuse and drastically reduces the number of parameters [ MobileNets is 53 layers deep with an image input size of [224 224 3]. It is a network designed for mobile environments. Thus, the model is required to be efficient and small by reducing the memory requirements. This is achieved by inverted residual bottleneck layers that require computation that can be scheduled with minimum working set (i.e., number of tensors concurrently stored in memory) [ ResNet-101, ResNet-50, and ResNet-18. The ResNet family of models with the corresponding layer depth require the same image input size of [224 224 3] and pre-trained on the ImageNet database. The architecture is characterized by using network-in-network scheme that employ learning residual functions with reference to layer inputs [ The Xception model is 71 layers deep with an image input size of [299 299 3]. It is trained on images from the ImageNet database. The architecture improves on the Inception network by replacing the standard inception modules with depthwise separable convolutions [ The Inception-ResNet-v2 model is 164 layers deep with an image input size of [299 299 3]. It is trained on images from the ImageNet database. The architecture is hybrid of the Inception model and residual connections, which results in faster training [ ShuffleNet is another model designed for resource limited deployment environments. It is based on pointwise group convolutions and channel shuffling, to drastically improve the computational overhead without scarifying the classification accuracy [ NAsnetMobile is the mobile version of the Neural Architecture Search Network (Nasnet) model. The main idea of this type of models is to learn the network architecture during training on the specific dataset using reinforcement learning search. Converging to the best model is reduced to finding the optimal cell structure (i.e., convolutional layer), which is duplicated to other convolutional networks but with different weights [ DarkNet-53 is pre-trained on the ImageNet database and requires an input image of size [256 256 3]. The model is 53 layers deep and was designed with speed and object detection as primary objectives [ EfficientNet-b0 is the baseline EfficientNet architecture, which provides scaled models up to EfficientNet-b7. The architecture design is based on the idea of compound scaling, which uniformly scale the network depth, width, and input resolution by fixed scaling coefficients [
The deep learning models were modified, trained, and evaluated using MATLAB R2021a software running on an HP OMEN 30L desktop GT13 with 64 GB RAM, NVIDIA®GeForce RTX™ 3080 GPU, Intel®Core™ i7-10700K CPU @ 3.80GHz, and 1TB SSD.
To prevent the models from overfitting specific image details, pixel translation (i.e., shifting the image) by 30 pixels vertically and horizontally was performed on the X-ray images used for training. Moreover, training images were randomly flipped along the x-axis (i.e., reflection), and rescaled from the range [0.9,1.1]. The model training options were set such that the minimum batch size was 10 (except for NASNet-Mobile, which had the size set to 2 due to slowness), the max epochs was set to 6, and initial learning rate was 0.003. Moreover, the stochastic gradient descent with momentum (SGDM) optimizer was used for training due to popularity and fast convergence [
The performance of the models was evaluated using five metrics: precision, recall, specificity, F1 score, and accuracy. Precision is the ratio of true positives to all images identified as positive (i.e., including false positives). Recall (i.e., sensitivity) is the ratio of true positives to all relevant elements (i.e., the actual positives). Specificity, or the true negative rate, measures the ability to identify negative elements. The F1 score is the harmonic mean of the recall and precision and expresses the accuracy of classification in unbalanced datasets. The accuracy is defined as the ratio of the true positives for all classes to the number of instances (i.e., total images in the testing set). The five measures are defined as follows:
Where
The purpose of the experiments was to evaluate the effectiveness of the pre-trained models, after customization and training, in identifying the correct disease diagnosis of the X-ray image. Moreover, since deep learning algorithms incur high overhead, the time of the training and testing was recorded too. Depending on the classification problem (three classes or two, and type of disease), the number of testing images ranged from 45 to 101.
Tables
The results are reported for 40 runs of each model. SD stands for standard deviation.
Model | Mean accuracy | Max. accuracy | Min. accuracy | SD |
---|---|---|---|---|
SqueezeNet | 91.29% | 95.05% | 87.13% | 2.94% |
GoogLeNet | 93.76% | 96.04% | 91.09% | 1.40% |
Inception-v3 | 92.97% | 95.05% | 89.11% | 1.83% |
DenseNet-201 | 96.34% | 99.01% | 94.06% | 1.48% |
MobileNet-v2 | 91.39% | 95.05% | 88.12% | 1.75% |
ResNet-101 | 93.27% | 95.05% | 86.14% | 2.71% |
ResNet-50 | 94.36% | 96.04% | 91.09% | 1.98% |
ResNet-18 | 94.26% | 95.05% | 92.08% | 1.02% |
Xception | 88.22% | 92.08% | 85.15% | 2.58% |
Inception-ResNet-v2 | 90.30% | 94.06% | 83.17% | 3.05% |
ShuffleNet | 92.38% | 96.04% | 89.11% | 2.38% |
NASNet-Mobile | 90.30% | 95.05% | 78.22% | 4.80% |
DarkNet-53 | 91.58% | 95.05% | 86.14% | 2.85% |
EfficientNet-b0 | 87.92% | 91.09% | 83.17% | 2.18% |
Model | F1 Score | Precision | Recall | Specificity |
---|---|---|---|---|
SqueezeNet | 89.98% | 94.10% | 88.00% | 94.54% |
GoogLeNet | 93.24% | 95.52% | 91.55% | 95.88% |
Inception-v3 | 92.32% | 94.66% | 90.93% | 94.97% |
DenseNet-201 | 95.97% | 97.61% | 94.62% | 97.89% |
MobileNet-v2 | 90.35% | 93.74% | 88.13% | 94.45% |
ResNet-101 | 92.55% | 96.15% | 90.16% | 96.38% |
ResNet-50 | 93.84% | 96.74% | 91.79% | 96.91% |
ResNet-18 | 93.82% | 96.65% | 91.73% | 96.80% |
Xception | 86.71% | 93.08% | 83.47% | 93.32% |
Inception-ResNet-v2 | 89.34% | 92.99% | 87.36% | 93.29% |
ShuffleNet | 91.76% | 94.48% | 90.28% | 94.55% |
NASNet-Mobile | 89.77% | 90.99% | 89.62% | 91.00% |
DarkNet-53 | 90.62% | 94.77% | 88.23% | 95.01% |
EfficientNet-b0 | 86.41% | 92.98% | 82.62% | 93.47% |
Tables
The results are reported for 40 runs of each model.
Model | Mean accuracy | Max. accuracy | Min. accuracy | SD |
---|---|---|---|---|
SqueezeNet | 95.71% | 98.70% | 92.21% | 2.13% |
GoogLeNet | 97.01% | 97.40% | 93.51% | 1.23% |
Inception-v3 | 96.23% | 97.40% | 93.51% | 1.43% |
DenseNet-201 | 97.01% | 98.70% | 96.10% | 0.88% |
MobileNet-v2 | 95.45% | 97.40% | 94.81% | 0.92% |
ResNet-101 | 97.66% | 98.70% | 97.40% | 0.55% |
ResNet-50 | 97.14% | 97.40% | 96.10% | 0.55% |
ResNet-18 | 97.66% | 98.70% | 96.10% | 1.02% |
Xception | 91.17% | 94.81% | 88.31% | 2.36% |
Inception-ResNet-v2 | 93.38% | 96.10% | 88.31% | 2.16% |
ShuffleNet | 95.97% | 98.70% | 90.91% | 2.25% |
NASNet-Mobile | 92.73% | 97.40% | 83.12% | 4.55% |
DarkNet-53 | 97.27% | 98.70% | 96.10% | 0.74% |
EfficientNet-b0 | 92.99% | 96.10% | 88.31% | 2.46% |
Model | F1 Score | Precision | Recall | Specificity |
---|---|---|---|---|
SqueezeNet | 94.33% | 96.26% | 93.04% | 97.50% |
GoogLeNet | 96.18% | 97.52% | 95.12% | 98.32% |
Inception-v3 | 95.01% | 97.38% | 93.24% | 98.28% |
DenseNet-201 | 96.12% | 97.69% | 94.82% | 98.48% |
MobileNet-v2 | 93.97% | 96.69% | 91.96% | 97.84% |
ResNet-101 | 96.97% | 98.45% | 95.71% | 98.98% |
ResNet-50 | 96.27% | 98.11% | 94.76% | 98.75% |
ResNet-18 | 97.01% | 97.80% | 96.31% | 98.55% |
Xception | 87.54% | 93.54% | 84.55% | 95.86% |
Inception-ResNet-v2 | 90.90% | 95.24% | 88.30% | 96.92% |
ShuffleNet | 94.70% | 96.37% | 93.66% | 97.56% |
NASNet-Mobile | 89.45% | 95.41% | 86.82% | 96.98% |
DarkNet-53 | 96.51% | 97.29% | 95.89% | 98.20% |
EfficientNet-b0 | 90.34% | 94.76% | 87.74% | 96.63% |
Tables
The results are reported for 40 runs of each model.
Model | Mean accuracy | Max. accuracy | Min. accuracy | SD |
---|---|---|---|---|
SqueezeNet | 98.00% | 100.00% | 95.56% | 1.95% |
GoogLeNet | 96.00% | 100.00% | 93.33% | 2.93% |
Inception-v3 | 96.22% | 100.00% | 91.11% | 2.78% |
DenseNet-201 | 98.00% | 100.00% | 95.56% | 1.64% |
MobileNet-v2 | 96.22% | 97.78% | 93.33% | 1.83% |
ResNet-101 | 99.33% | 100.00% | 97.78% | 1.07% |
ResNet-50 | 98.44% | 100.00% | 97.78% | 1.07% |
ResNet-18 | 98.22% | 100.00% | 97.78% | 0.94% |
Xception | 96.00% | 97.78% | 91.11% | 2.73% |
Inception-ResNet-v2 | 96.22% | 100.00% | 95.56% | 1.50% |
ShuffleNet | 97.56% | 100.00% | 95.56% | 1.64% |
NASNet-Mobile | 86.44% | 93.33% | 80.00% | 5.18% |
DarkNet-53 | 96.89% | 100.00% | 84.44% | 4.94% |
EfficientNet-b0 | 96.44% | 97.78% | 88.89% | 2.81% |
Model | F1 Score | Precision | Recall | Specificity |
---|---|---|---|---|
SqueezeNet | 98.00% | 98.01% | 98.12% | 97.87% |
GoogLeNet | 95.99% | 96.27% | 96.16% | 95.82% |
Inception-v3 | 96.21% | 96.31% | 96.25% | 96.18% |
DenseNet-201 | 97.98% | 98.17% | 97.89% | 98.11% |
MobileNet-v2 | 96.18% | 96.62% | 96.04% | 96.41% |
ResNet-101 | 99.33% | 99.37% | 99.32% | 99.34% |
ResNet-50 | 98.43% | 98.60% | 98.33% | 98.55% |
ResNet-18 | 98.21% | 98.40% | 98.10% | 98.34% |
Xception | 95.96% | 96.36% | 95.89% | 96.11% |
Inception-ResNet-v2 | 96.19% | 96.55% | 96.07% | 96.37% |
ShuffleNet | 97.54% | 97.73% | 97.47% | 97.65% |
NASNet-Mobile | 85.84% | 89.39% | 85.68% | 87.38% |
DarkNet-53 | 96.84% | 97.13% | 96.79% | 97.00% |
EfficientNet-b0 | 96.43% | 96.64% | 96.37% | 96.51% |
Tables
The results are reported for 40 runs of each model.
Model | Mean accuracy | Max. accuracy | Min. accuracy | SD |
---|---|---|---|---|
SqueezeNet | 93.88% | 97.50% | 86.25% | 3.14% |
GoogLeNet | 94.75% | 96.25% | 91.25% | 1.75% |
Inception-v3 | 94.00% | 98.75% | 88.75% | 3.43% |
DenseNet-201 | 97.00% | 98.75% | 96.25% | 1.05% |
MobileNet-v2 | 92.62% | 97.50% | 88.75% | 2.73% |
ResNet-101 | 93.12% | 95.00% | 90.00% | 1.89% |
ResNet-50 | 94.50% | 96.25% | 90.00% | 2.30% |
ResNet-18 | 95.13% | 97.50% | 92.50% | 1.90% |
Xception | 88.38% | 92.50% | 86.25% | 2.13% |
Inception-ResNet-v2 | 89.50% | 95.00% | 78.75% | 4.97% |
ShuffleNet | 93.62% | 96.25% | 90.00% | 2.24% |
NASNet-Mobile | 89.00% | 96.25% | 73.75% | 6.12% |
DarkNet-53 | 93.75% | 98.75% | 88.75% | 2.50% |
EfficientNet-b0 | 91.25% | 93.75% | 90.00% | 1.32% |
Model | F1 Score | Precision | Recall | Specificity |
---|---|---|---|---|
SqueezeNet | 92.06% | 95.91% | 89.91% | 97.00% |
GoogLeNet | 93.39% | 96.23% | 91.49% | 97.25% |
Inception-v3 | 92.22% | 96.00% | 90.12% | 97.05% |
DenseNet-201 | 96.31% | 97.95% | 95.00% | 98.5% |
MobileNet-v2 | 90.66% | 93.66% | 88.90% | 95.44% |
ResNet-101 | 91.15% | 95.40% | 88.66% | 96.63% |
ResNet-50 | 92.98% | 96.40% | 90.83% | 97.35% |
ResNet-18 | 93.88% | 96.45% | 92.11% | 97.42% |
Xception | 84.05% | 92.91% | 80.62% | 94.86% |
Inception-ResNet-v2 | 85.44% | 93.03% | 82.98% | 94.87% |
ShuffleNet | 92.09% | 94.19% | 90.92% | 95.69% |
NASNet-Mobile | 87.72% | 87.86% | 89.76% | 88.28% |
DarkNet-53 | 91.95% | 95.95% | 89.58% | 97.02% |
EfficientNet-b0 | 88.61% | 93.88% | 85.77% | 95.58% |
Since deep learning models are computation intensive, we have compared the time required to train and test each model.
All times are in seconds.
Model | NormalScolSpond | NormScol | NormSpond | ScolSpond |
---|---|---|---|---|
SqueezeNet | 16.35 | 14.08 | 10.99 | 14.69 |
GoogLeNet | 34.1 | 26.13 | 19.24 | 26.9 |
Inception-v3 | 84.58 | 68.07 | 47.6 | 69.5 |
DenseNet-201 | 243.52 | 199.25 | 126 | 196.4 |
MobileNet-v2 | 151.47 | 86.1 | 66.91 | 99.77 |
ResNet-101 | 367.83 | 284.4 | 169.87 | 288.16 |
ResNet-50 | 161.04 | 131.47 | 78.1 | 126.98 |
ResNet-18 | 62.58 | 50.28 | 30.7 | 49.97 |
Xception | 337.25 | 221.25 | 135.68 | 256.77 |
Inception-ResNet-v2 | 246.48 | 201.2 | 135.53 | 210.3 |
ShuffleNet | 97.43 | 78.8 | 47.68 | 81.6 |
NASNet-Mobile | 2271.3 | 1804.2 | 1024.5 | 1764 |
DarkNet-53 | 57.5 | 47.38 | 31.2 | 46.69 |
EfficientNet-b0 | 215.47 | 166.25 | 101.86 | 170.32 |
Including images of more vertebral column diseases (e.g., disc degeneration, spondylitis, osteoporosis, etc.) in a global image data store similar to ImageNet. Development of algorithms and using transfer learning to pinpoint faulty vertebrae or the exact location of the spine anomaly. Multistage classification. First images are classified into the corresponding disease state followed by localization or severity grading. Continual learning by the development and deployment of mobile applications to aid physicians, collect data, and refinement of the AI models.
*healthy, disk herniation, or spondylolisthesis. **Pair-wise permutation of healthy, disk herniation, and spondylolisthesis.
Study | Classification problem | Dataset | Accuracy |
---|---|---|---|
Alafeef et al. [ |
Three-class classification* | 422 subjects | 99.5% |
Reshi et al. [ |
Three-class classification* | 310 records | 99.5% |
Unal et al. [ |
Pair-wise** | 310 records | 96.0% |
Colombo et al. [ |
Healthy vs scoliosis | 272 scoliosis and 20 healthy | 85% |
Wang et al. [ |
progressing vs non-progressive scoliosis | 490 subjects | 76% |
Yang et al. [ |
Four classes for scoliosis severity | 3640 back images | 80% |
This work | Three-class and pair-wise classification | 331 subjects | 96.34%-99.33% |
Artificial intelligence-aided diagnosis systems are being proposed and deployed into many medical areas. These systems have many advantages such as aiding undermanned remote areas, reducing human errors, and optimizing costs. In this paper, it has been shown that transfer deep learning using locally collected X-ray images is able to achieve high performance in terms of correctly identifying normal subjects from those suffering from scoliosis or spondylolisthesis. The highest mean accuracy values ranged from 96.34% for three-class classification to > 97% for the other classification problems. Even though deep learning incurs high overhead, the results show that training and validation can be performed in a reasonably low time using off the shelf hardware resources.
Transfer deep learning can be used to perform spondylolisthesis and scoliosis screening in order to improve the selection of patients who would require further costly CT or MRI imaging. Moreover, the work in this paper can be further improved and made robust by larger databases of more images and more diseases. In addition, field deployment will allow practical benefits and continuous improvements.