Alzheimer’s disease diagnosis from diffusion tensor images using convolutional neural networks

Machine learning algorithms are currently being implemented in an escalating manner to classify and/or predict the onset of some neurodegenerative diseases; including Alzheimer’s Disease (AD); this could be attributed to the fact of the abundance of data and powerful computers. The objective of this work was to deliver a robust classification system for AD and Mild Cognitive Impairment (MCI) against healthy controls (HC) in a low-cost network in terms of shallow architecture and processing. In this study, the dataset included was downloaded from the Alzheimer’s disease neuroimaging initiative (ADNI). The classification methodology implemented was the convolutional neural network (CNN), where the diffusion maps, and gray-matter (GM) volumes were the input images. The number of scans included was 185, 106, and 115 for HC, MCI and AD respectively. Ten-fold cross-validation scheme was adopted and the stacked mean diffusivity (MD) and GM volume produced an AUC of 0.94 and 0.84, an accuracy of 93.5% and 79.6%, a sensitivity of 92.5% and 62.7%, and a specificity of 93.9% and 89% for AD/HC and MCI/HC classification respectively. This work elucidates the impact of incorporating data from different imaging modalities; i.e. structural Magnetic Resonance Imaging (MRI) and Diffusion Tensor Imaging (DTI), where deep learning was employed for the aim of classification. To the best of our knowledge, this is the first study assessing the impact of having more than one scan per subject and propose the proper maneuver to confirm the robustness of the system. The results were competitive among the existing literature, which paves the way for improving medications that could slow down the progress of the AD or prevent it.


Introduction
Neurodegenerative diseases have gained increasing attention in the past few decades; these include Alzheimer's Disease (AD), Mild Cognitive Impairment (MCI), and others. Several research groups have tackled the usage of machine learning algorithms for the sake of detection, localization, prediction of the disease, or clustering different diseases or disease stages [1][2][3]. Image based diagnosis of AD is important and required mainly to avoid subjective assessments [4]. Deep learning-based methods gives successful results particularly in medical image analysis [5] due to flexible and efficient formulations [6].
In 2017, over 121 thousand people died from AD, in the United States, making it the sixth leading cause of death. Between the years 2000 and 2017, the number of deaths due to AD has increased by 145% [7]. By 2050, the number of people older than 60 years will be increased by 1.25 billion, equivalent to 22% of the global population, with 79% living in the world's less developed countries [8]. The annual expenses for the disease per is around $868 and $3,109 per person in low-income and lower-to middle-income countries respectively [9].
MCI can be an antecedent to several neurodegenerative diseases [1,10]. Prominently, MCI is considered to be the prodromal phase to AD [7,11]. Around 15-20% of people elder than 65 years were diagnosed with MCI because of different pathologies. In a two-year follow-up, 15% of the subjects with MCI would develop dementia, and in a five-year follow-up 32% of subjects, with MCI, would develop AD [7].
In this work, the objective was to classify AD and MCI from healthy controls (HC) using a convolutional neural network (CNN), where DTI and MRI were employed. Moreover, all diffusion maps were investigated and compared; Mean Diffusivity (MD), Fractional Anisotropy (FA), and Mode of Anisotropy (MO). Moreover, the effect of the time interval between two subsequent scans was investigated to assess the convenient period between one's scans to avoid overfitting.

Materials and methods
The dataset, employed in this study, is owned by a third-party organization; the Alzheimer's Disease Neuroimaging Initiative (ADNI). A complete description of ADNI and up-to-date information is available at http://adni.loni.usc.edu/ and data access requests are to be sent to http://adni.loni.usc.edu/data-samples/access-data/. Detailed inclusion criteria for the diagnostic categories can be found at the ADNI website (http://adni.loni.usc.edu/methods, ADNI2 manual page 27). All ADNI studies are conducted according to the Good Clinical Practice guidelines, the Declaration of Helsinki, and U.S. 21 CFR Part 50 (Protection of Human Subjects), and Part 56 (Institutional Review Boards). Written informed consent was obtained from all participants before protocol-specific procedures were performed. The ADNI protocol was approved by the Institutional Review Boards of all of the participating institutions. The ethics committees/institutional review board that approved the ADNI study are listed within S1 File. The dataset employed is formed of 406 subjects: 185, 106, and 115 subjects with HC, MCI and AD respectively. The subjects' characteristics are listed in Table 1.
The preprocessing of the scans was adopted from the pipeline introduced in [19,20]. MRI T 1 scans were spatially segmented and normalized to the Montreal Neurological Institute (MNI) template using the Statistical Parametric Mapping (SPM12) software; specifically, the Computational Anatomy Toolbox (CAT12) was utilized and the Diffeomorphic Anatomical Registration using Exponentiated Lie algebra (DARTEL) algorithm was implemented [21]. Linear regression was implemented to remove the effect of the Total Intracranial Volume (TIV) [22]. This pipeline output several files; such as the White Matter (WM) volume, Gray Matter (GM) volume, TIV values, and the deformation fields (to and from the MNI space). On the other hand, the DTI scans were preprocessed as per the guidelines of the FMRIB Software Library (FSL) [23]; where the eddy currents were corrected, the skull was stripped, the diffusion tensor was calculated, and the diffusion maps were calculated. Last, the DTI maps were co-registered with the normalized T 1 scans of the same subject at the same time point via the SPM coregister toolbox [24].
Three main maps of diffusion can be calculated from DTI, named; MD, FA, and MO [25,26]. MD is the average of the eigenvalues of the diffusion tensor ellipsoid [27]. FA is a measure of the flow in the axons being isotropic or closer to anisotropic (0 is perfect isotropy and 1 is perfect anisotropy). MO reflects the skewness of the flow; i.e. is it closer to tubal (-1), spherical (0), or planar (1) flow. In neurodegenerative diseases, including Alzheimer's disease, the demyelination can be perceived as an increase in Radial Diffusivity (RD) and a decrease in FA [28,29].
The hippocampus and the entorhinal cortex are the main and earliest regions that develop anatomical atrophy in the case of AD [26,28,30-36]. Thus, the bounding box, including the hippocampus and the entorhinal cortex, was identified via the Harvard-Oxford [37][38][39][40] and the Juelich [41] atlases respectively; originally, the scans were 121×145×121 and by selecting the Volume of Interest (VOI), they became 61×37×38 (Fig 1).
To address the classification task, a 2D CNN was employed, since it has the advantage of taking into consideration the spatial relationships between pixels; especially with a pathology that would progress over time and brain regions [19, [42][43][44].
The proposed CNN consisted of the image input layer, convolutional filters as described below, batch normalization layer, ReLU layer [45], maxpooling layer, fully connected layer,

PLOS ONE
softmax layer, and the classification layer (Fig 2). The weights of the network were calculated using the gradient descent optimization using the Root Mean Square Propagation (rmsprop) algorithm [46]. Recently, Sobolev gradient based optimization has been used in deep network based methods to diagnose AD [47,48]. However, the standard gradient descent optimization is efficient in the proposed approach in terms of computation. The concept of 1×1 convolution was first introduced by Lin et al. [49], whereas its usage has been scarce in medical applications [50,51]. Its role in decreasing the complexity while increasing the nonlinearities -hence, the discriminative ability-was later clarified in [52].
In order to select the optimal network hyperparameters such as the network depth, filters' size and number, iterative experiments were employed where one layer was added and its filter size was optimized before adding the next layer. The optimal sizes were selected when the highest performance measures were met. It is worth noting that the learning of the weights was done via mini-batch scheme; where the batch size, the learning rate, and the number of epochs were 20, 0.001 and 60 respectively. The ten-fold cross-validation implies 10% test set and 90% training set; further, the training set is split into 75% for the network training, and 25% for the validation of the parameters. Validation was performed once per epoch.
In this work, three experiments were performed: • Analysis of individual and cascaded maps: The MD volume was fed to CNN and the performance measures of the test set were calculated; the same was done for FA and MO and the optimal CNN parameters were selected. In addition, the three diffusion volumes were cascaded and fed to the CNN and the optimal CNN parameters were selected for the cascaded volume. The same setting was done for the cascaded MD and GM volumes. Cascading in this study was done by concatenating the diffusion map volumes following each other in depth. Thus, the original size of 61×37×38 for one map would increase to 61×37×76 and 61×37×114 in case of two and three maps respectively; where the third dimension is normal to the axial plane as described in Fig 2 • Analysis while including a single scan per year for the same subject: The impact of excluding temporally-close scans (less than a year); i.e. portions vs. annual was assessed in this text. In particular, the dataset comprised subjects who have been scanned more than once; the interval between two subsequent scans was not fixed. Thus, all scans were explored altogether (denoted in this study by portions). In addition, scans that remained after excluding those belonging to the same subjects but scanned within less than a year; either from preceding or succeeding scan (denoted by annual) were explored as well. In other words, if subject X has scans X S1 , X S2 , and X S3 sorted by the date the scan was performed; where X S1 and X S2 were taken within less than a year, and X S2 and X S3 were taken within a year or more, only X S1 and X S3 would be kept for that subject X; which is referred as annual. Whereas, if all scans for X were retained irrespective of the interval; i.e. keeping X S1 , X S2 , and X S3 , it would be referred as portions.
• Analysis of segregated versus mixed training and test datasets: Separating the cross-validation folds by IDs or random assignment to any fold; segregated vs. mixed was evaluated in this text. In this analysis, the impact of multiple scans per subject was exploited in two ways; to put all scans belonging to one subject in either a training or test set per cross-validation which was referred to previously as "segregated", or just to randomize the selection of scans per cross-validation irrespective of the ID of the subject which was referred as "mixed".
Five performance measures were calculated; Area Under the Curve (AUC), accuracy, sensitivity, specificity, and F1-measure [53]. Since the MD mean of the performance measures was primarily the highest in comparison with the other maps, a statistically significant difference between all other maps and MD was analyzed. The Sign test was employed [54,55] since the population was not always normal, assessed by the Shapiro-Wilk test [56], and there were only ten points of observations (number of the cross-validation folds).
To plot the Receiver Operating Characteristic (ROC) curve for any experiment, ten-fold cross validation passes, having different x-y pairs (sensitivity/TPR and 1-specificity/FPR) corresponding to each cross-validation pass, were utilized. The ten curves were interpolated to a common x-axis, named False Positive Rate (FPR), and calculated the average for the other axis, named True Positive Rate (TPR).
For each fold in the cross-validation, the ROC curve has a set of FPR and TPR points forming the curve. First, a common arbitrary set of FPR values was chosen. In order to calculate the average ROC curve of the ten folds, the (FPR, TPR) pairs were sorted in a monotonically increasing fashion with respect to FPR. All the ten curves were looped over, where each loop was unique over the FPR values. To avoid the problem of multiple TPR-values for the same FPR value; i.e. vertical lines, the (FPR, TPR) pair were selected at the last value of FPR (corresponding to the largest value of TPR denoted by TPR max for the same value of FPR). Then, the (FPR, TPR max ) was interpolated to the previously-selected common FPR-grid. The same method was applied for the rest of the curves, such that all of them coincide on the same grid of FPR, then the average was calculated.
The aim of this work was to provide an automatic classification of the MCI and AD versus HC. For some subjects having multiple scans at different timepoints, the effect of selecting only scans that were taken a year or more from the previous one with respect to the same subject was investigated. In addition, the impact of having different timepoint scans for the same subject in the training set and how the separation based on the subject is assessed in terms of the effect on the overall performance.
The implementation was done on a 64-bit Windows server 2019 machine, Intel Xeon CPU E5-2650 @ 2 GHz processor, eight cores, and 384 GB RAM. The CNN architecture was built using MATLAB ver. R2018b.

Results
In this study, several objectives were addressed for detecting AD via a machine learning technique; namely CNN. the first objective is to search for the best values for the CNN hyperparameters that would maximize performance. Whereas, the second objective is study if the diffusion maps would yield a good discrimination between different classes or fusion with other structural data will boost the performance. The third objective is to evaluate the impact of the time gap between two successive scans belonging to the same subject. Finally, the study was interested in assessing the effect of mixing the training and test sets or segregating them such that all scans belonging to the same subject are in either the training set or the test set.
Upon evaluating the different hyperparameters of 2D CNNs, the optimal CNN size, for one volume (MD, MO, FA, or GM) each of which is 61×37×38, is formed of one layer in depth having five filters each of which was 5×5×38. On the other hand, the optimal CNN size for cascaded volumes experiments; namely MD+MO+FA of size 61×37×114 and GM+MD of size 61×37×76, is formed of two layers; where the first included thirty 1×1×114 or thirty 1×1×76 filters respectively and the second layer included five 3×3×30 filters. Regarding 2D CNNs, it is worth pointing out that the depth of the filters must match with that of the input volumes, and that the depth of the output of the convolution must match with the number of filters [57].

Analysis of individual and cascaded maps
Regarding the maps themselves, the MD maps were roughly statistically significant than other diffusion maps, in comparison, and also the three volumes cascaded (Tables 2 and 3 (Table 2). GM resulted in an accuracy of 91.3% and 75.7%, a sensitivity of 88.3% and 60.7%, and a specificity of 92.8% and 84% and an AUC of 0.96 and 0.80, for classifying AD and MCI respectively from HC.
Further, incorporating the GM with the MD (cascading them as deeper volume denoted by MD+GM) improved the results (Table 3), sometimes significantly depending on the performance measure involved, compared with either MD or GM alone. Specifically, MD+GM produced an accuracy of 93.5% and 79.6%, a sensitivity of 92.5% and 62.7%, a specificity of 93.9% and 89% and an AUC of 0.94 and 0.84 for AD and MCI classification respectively. Cascading the three maps resulted in the least performance (Table 3); MD+MO+FA produced an accuracy of 78.6% and 70.8%, a sensitivity of 66.3% and 41.5%, a specificity of 85.6% and 87.3% and an AUC of 0.86 and 0.74 for the classification of AD and MCI respectively versus HC.

Analysis while including a single scan per year for the same subject
Generally speaking, excluding the scans that belonged to the same subject that were carried out within less than a year resulted in an insignificant drop in the performance in terms of accuracy, AUC, sensitivity, specificity, and F1-score, as shown in Tables 2 and 3.

Analysis of segregated versus mixed training and test datasets
Mixing up the scans for one subject in both the training and test sets in one cross-validation yielded overfitting; in particular, the results were generally statistically significantly higher in mixed portions experiments with respect to segregated ones. The level of significance was less when the input scans were removed if two scans for the same subject were performed in less than a year ( Table 3). The accuracy and AUC for the cascaded mixed maps were 16.9% and 0.12 respectively higher than the corresponding segregated ones for HC/AD classification and 22.2% and 0.25 respectively for HC/MCI classification; this highlights the overfitting severity encountered.
The ROC curves for all analyses are displayed in Fig 3, and summary of results is tabulated in Table 4. [29]found out that the MD in the hippocampal and para-hippocampal areas was complementary to the GM volume for the classification of HC/AD. Firbank et al. [66] added that the clusters where MD was significantly higher in AD subjects than that of controls were primarily in the left temporal lobe; that was parallel with atrophy in the grey matter in these locations. Further, Rose et al. [67] reported that MD was elevated significantly at the hippocampus, amygdala, and entorhinal cortex, whereas, FA was reduced significantly mainly at the thalamus. Also, they showed that the cortical areas with increased MD correlate

PLOS ONE
with regions of reduced gray matter density measured using structural MRI in patients with AD. It is worthy pointing out that the results in this work, coincide with [67]. In this work, the FA yielded better results than MO. This seems to be in contrast to the lowsample study of [65] where the MO yielded accuracy that was~7%-10% higher than that driven by FA in HC/AD and HC/MCI respectively. It is worth noting that the sample size, used in this study, was at least five-fold that of Lee et al [65]. The cascaded diffusion maps yielded worse performance, but not always significantly, than employing the MD.
The GM volumes alone, in agreement with the literature, improved the results [17,68]; this is attributed to the fact that AD is prominently characterized by amyloid plaques and neurofibrillary tangles that deposit in the GM which, in turn, leads to the death of the neurons and the thinning of the cortex or simply atrophy [69][70][71][72]. Oishi et al. reported in their study "DTI is useful for localizing and quantifying the anatomical abnormalities, but apparently not adequate to investigate the histopathological background of the diseases" [71]. They explained that the DTI measures could be affected by the pathology or other reasons. For example, the diffusion lasts for up to 100 ms in a radius of up to 10 μm that is to be averaged over a voxel of 2-3 mm in size; this indeed makes it more sensitive to the presence of multiple fiber bundles and partial volume effect [71,73]. In addition, in Henf et al.'s work, they concluded that without applying the partial volume correction, MD was not superior to gray matter volume in separating MCI and AD from HC [73].
It is important to assert that in this work, the volumes namely; MD, FA, and MO, and GM and MD were cascaded. Whereas, Wen et al. [62] assessed the MD and FA values over the GM mask (Table 4), and therefore, the performance of the two works cannot be properly compared. Dyrba et al. [17], using the European DTI study on dementia (EDSD) cohort, reported that combining the MD with GM extracted from structural MRI, where Support Vector Machine (SVM) was utilized, had worsened the results of GM alone. Moreover, the authors reported that the GM utilization outperformed the MD in terms of accuracy, sensitivity and specificity PLOS ONE (Table 4). In this work, incorporating the GM with the MD improved the results; not always significantly though. One of the biggest hurdles encountered when dealing with machine learning in general, and neural networks in specific, is the limitation of the dataset; especially when dealing with medical data; this is the main cause of overfitting [74]. The batch normalization layer is used to reduce the problem of overfitting [75,76] due to its importance in deep learning [77]. In addition, the usage of small-sized filters is usually enhancing the test set performance measures compared to larger filters as explained in the Methods section, through decreasing the overfitting which aligns with Pereira et al. [78] who advocated that small filter sizes of 3×3 would minimize the effect of overfitting since the number of parameters to be learnt decreased. Further, Simonyan and Zisserman [52] explained that the effective receptive field of two stacked 3×3 convolutional layers was equivalent to a single 5×5 layer and that of three stacked 3×3 convolutional layers was equivalent to a single 7×7 layer. Moreover, increasing the number of layers increases nonlinearities, which also decreases the weights, to be optimized, by 77% and 81% for the first and the second case respectively. In addition, the proposed architecture comprised of only one or two layers in depth to alleviate the problem of overfitting; this is in agreement with Ahmed et al. [76] and RStudio online tutorials [79]. Though, ten-fold crossvalidation technique was incorporated to give a good estimate about the generalizability of the classification [80,81].

PLOS ONE
It can be noticed that the drop of the performance between portions (all scans of the subject are included) and annual (only scans a year or more apart are included) was minor; this could be attributed to the fact that the number of the scans, upon being annually-scanned, dropped -at least-to half of those without this constraint ( Table 1).
As shown previously in the Results section, the effect of segregating the scans of the same subject to either the learning or the testing data versus randomly selecting the scans with no constraints during the cross-validation folds that the accuracy and AUC dropped by around This could be interpreted as an overfitting case where during the cross-validation pass, the network considered the temporal instance of the scan of the same subject as a previously seen scan in the training stage; where there is a spatial dependency in the same subject as the disease progresses. This overfitting case would promote the classification performance task. [44,[82][83][84].
Further, the average execution time for the entire ten-fold cross-validation, training and testing, was 12.5 minutes, and the average time per one scan during testing was 0.005 seconds; this is quite competitive when the availability of a graphical processing unit (GPU) is restricted or not possible.
It is important to highlight that all models, proposed in this study, had their specificity higher than their sensitivity; i.e. they are better at handling true negatives than true positives. Coherent to this, some analyses suggested the presence of a trade-off between these two measures [85][86][87]. This is mainly due to the fact that number of healthy subjects used in this study was quite larger than the number of MCI and AD subjects [85,86].
It is worthy to mention that incorporating the CSF amyloid data could be considered to be interpreted and asses its role in differentiating cognitive deficits. Longitudinal assessment of the cases should be studied; this is a promising means of early detection of the onset of AD which helps aid AD drug discovery and testing.

Conclusion
In this paper, a CNN was handcrafted to classify MCI and AD from HC. The MD, FA, MO, GM, MD+GM scans were compared; MD was the best-performing diffusion map amongst the diffusion maps regarding classification in terms of accuracy, specificity, and AUC of 88.9%, 91.7% and 0.93 respectively for HC/AD classification, and 71.1%, 81.8% and 0.68 respectively for HC/MCI classification. Combining GM with MD enhanced the performance but below the 5% significance level; to give an accuracy, a specificity, and an AUC of 93.5%, 93.9% and 0.94 respectively for HC/AD classification and 79.6%, 89% and 0.84 respectively for HC/MCI classification.
The dataset comprised more than one instance per subject and in this work, it is recommended that the training and test sets should be split such that one's scans were in the same pile; i.e. the IDs of the subjects in the training set and the test set should not overlap.