Early diagnosis of Alzheimer’s disease using combined features from voxel-based morphometry and cortical, subcortical, and hippocampus regions of MRI T1 brain images

In recent years, several high-dimensional, accurate, and effective classification methods have been proposed for the automatic discrimination of the subject between Alzheimer’s disease (AD) or its prodromal phase {i.e., mild cognitive impairment (MCI)} and healthy control (HC) persons based on T1-weighted structural magnetic resonance imaging (sMRI). These methods emphasis only on using the individual feature from sMRI images for the classification of AD, MCI, and HC subjects and their achieved classification accuracy is low. However, latest multimodal studies have shown that combining multiple features from different sMRI analysis techniques can improve the classification accuracy for these types of subjects. In this paper, we propose a novel classification technique that precisely distinguishes individuals with AD, aAD (stable MCI, who had not converted to AD within a 36-month time period), and mAD (MCI caused by AD, who had converted to AD within a 36-month time period) from HC individuals. The proposed method combines three different features extracted from structural MR (sMR) images using voxel-based morphometry (VBM), hippocampal volume (HV), and cortical and subcortical segmented region techniques. Three classification experiments were performed (AD vs. HC, aAD vs. mAD, and HC vs. mAD) with 326 subjects (171 elderly controls and 81 AD, 35 aAD, and 39 mAD patients). For the development and validation of the proposed classification method, we acquired the sMR images from the dataset of the National Research Center for Dementia (NRCD). A five-fold cross-validation technique was applied to find the optimal hyperparameters for the classifier, and the classification performance was compared by using three well-known classifiers: K-nearest neighbor, support vector machine, and random forest. Overall, the proposed model with the SVM classifier achieved the best performance on the NRCD dataset. For the individual feature, the VBM technique provided the best results followed by the HV technique. However, the use of combined features improved the classification accuracy and predictive power for the early classification of AD compared to the use of individual features. The most stable and reliable classification results were achieved when combining all extracted features. Additionally, to analyze the efficiency of the proposed model, we used the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset to compare the classification performance of the proposed model with those of several state-of-the-art methods.


Introduction
Alzheimer's disease (AD) is a growing or progressive neurodegenerative brain disorder disease. AD is considered to be the most common type of dementia (mental illness), accounting for 50%-80% of dementia cases. It is a complex disease categorized by the accumulation of βamyloid A(β) plaques and neurofibrillary tangles [1], which are composed of tau amyloid fibrils linked to synapse dysfunction loss and a progressive neurodegeneration, leading to memory loss and other brain-related cognitive problems. The pathophysiological changes that cause cognitive, functional and behavioral impairment in AD patients are thought to begin several years or even decades prior to the beginning of clinical symptoms. AD is typically diagnosed in people above 65 years of age, and it has been stated that the number of the AD patients tends to double every 5 years after the patient age is over 65 years [2]. It is believed that one in every 85 peoples will be affected by the AD disease by the year 2050 [3]. The average life expectancy of AD convalescent patients varies between 3 and 10 years, depending on the age when they were diagnosed with AD. The median lifespan is as long as 7 to 10 years for AD patients whose conditions were identified when they were in their 60s or early 70s. This number decreases to 3 years or less for AD patients who were diagnosed when they were in their 90s [4]. Recently, more specific research criteria have been proposed for the early and accurate diagnosis of AD in the prodromal stage or mild cognitive impairment (MCI) of the disease [5], which is of great importance for the timely treatment and possible delay of the disease.
MCI is often used to refer to patients with objective cognitive impairment who have normal capabilities for the activities of daily living and do not meet the criteria for cognitive decline or dementia [6,7]. Generally, two types of clinical changes are observed in MCI patients over time. First, some MCI subjects will develop AD eventually (i.e., MCI caused by AD (mAD) or MCI converters (MCIc)), whereas others will never develop AD (i.e., stable MCI (aAD) or MCI non-converters). About 35% of MCI patients progress to AD or dementia within a 3-year follow-up period, with a yearly conversion rate of 5%-10%. Predictors of this conversion include whether the patients are carriers of ε4 alleles of the apolipoprotein E (APOE) gene, brain atrophy, clinical severity, patterns of cerebrospinal fluid (CSF) biomarkers, cerebral glucose metabolism, and A(β) deposition [8,9]. aAD and mAD subjects are distinguished by the severity of amnestic impairment with aAD requiring memory loss test performance greater than 1.5 SD (or greater below the age-adjusted mean) below standardized norms on memory tests and mAD requiring memory loss test performance between 1.0 and 1.5 SD below standardized norms [10,11]. aAD patients have milder verbal memory impairment than mAD and are thought to represent a very early stage of the disease that may be optimal for disease modification treatments. These subtypes of MCI difference may have implications for differences in biomarker abnormalities, likely clinical course, and treatment response in patients with cognitive impairment [10]. Because of the heterogeneity of the clinical presentation and underlying etiologies within the MCI group, there is no documented treatment for MCI subjects.
Furthermore, the clinical course of MCI is more heterogeneous than distinctive. However, there are many causes of MCI and not all are related to progressive neurodegenerative disorders. Therefore, diagnosing the underlying etiology is very challenging for individuals with cognitive impairment and there is a need for more accurate diagnostic tests to identify MCI patients in whom AD may be the underlying cause. As early as possible this diagnosis methods should be performed in the course of the disease [12]. This problem is a qualitative prognosis problem that can be solved via the classification between aAD and mAD. Because AD is a progressive neurodegenerative disease, there are continuous changes between previously measured and current clinical scores (e.g., AD Assessment Scale Cognitive Subscale (ADAS-Cog) and Mini-Mental State Examination (MMSE)). Therefore, it is important to predict future clinical scores based on data from earlier time points, which is particularly helpful for monitoring disease progression. However, the classification between aAD and mAD is challenging and benchmark results in [12,13] showed low accuracy which, according to the author, was likely caused by the heterogeneity of cortical thinning patterns in aAD subjects. We can overcome this limitation by increasing the number of training samples to cover all complex patterns or by choosing only desirable features to reflect the differences between these two groups.
Over the past few years, many high-dimensional pattern-based classification methods have been developed for AD and MCI. These methods have largely focused on the individual modalities of biomarkers for the diagnosis of AD or MCI (e.g., structural brain atrophy can be measured by structural magnetic resonance imaging (sMRI) [13][14][15][16][17], metabolic brain alterations can be measured by fluorodeoxyglucose positron emission tomography (FDG-PET) imaging [18,19], and pathological amyloid depositions can be measured from CSF [20,21]), which may affect the overall classification performance because different modalities of biomarkers provide different complementary information, which are useful for the diagnosis of AD [20][21][22][23]. All the above criteria are based on clinical scores of early episodic memory impairment with the presence of at least one extra supportive feature, including abnormal sMRI results or abnormal CSF amyloid and tau biomarkers. Previous experiments have shown that the use of multiple biomarkers yields promising results for the early diagnosis of AD or MCI. Moreover, these studies have combined sMRI-based markers with biomarkers based on FDG-PET [22,24], CSF [21,23,25], APOE [26] genotype, or combinations of these biomarkers [27,28] to achieve promising results. However, the availability of all four biomarkers (CSF, PET, sMRI, and APOE genotype) is limited in clinical practice because obtaining these measurements is laborious for both patients and doctors. Moreover, measurements obtained from CSF and FDG-PET are considered an invasive [29,30]. Recent studies focusing only on sMRI modality have achieved a high classification accuracy of 80%-99% in identifying healthy controls (HC) from AD subjects [14][15][16]29,[31][32][33] and 70%-90% when predicting aAD vs. mAD [13,29]. Methods in which features are extracted from sMR images for existing classification approaches can be divided into three groups: voxel-based morphometry [34][35][36], cortical thickness [27][28][29][30][35][36][37][38][39], and region of interest (ROI) (hippocampal volume) [40][41][42][43] methods. It has been shown that the most effective features for AD or MCI classification are extracted from ROIs around the hippocampus, entorhinal cortex, parahippocampal gyrus, amygdala, etc. [13,[44][45][46]. This type of ROI features can improve the classification accuracy, and it also helps to reduce the number of false positive diagnoses.
In this study, we used images from the National Research Center for Dementia (NRCD) dataset to achieve the goals of our proposed method, and also to classify AD with other groups. Here, three different feature extracted from sMRI images were combined into one as stated above, namely, voxel-based morphometry (VBM), hippocampal volume (HV), and cortical and subcortical segmented region (volume, thickness, meancurve, Foldcurv, Curvind, and Gauscurv) (CSC) methods, for the early classification of AD or MCI. This paper shows the benefit of combining all three features into one for the early diagnosis of AD or MCI patients. We used all 326 subjects and their corresponding T1-weighted 3-T MR images from the NRCD dataset to differentiate AD, MCI, and HC individuals. To analyze the impact of combined features compared to that of individual features on the given classification problems, we used three different types of classifier, i.e., K-nearest neighbor (KNN), random forest (RF), and support vector machine (SVM), to evaluate the area under the curve of the receiver operating characteristic curve (AUC-ROC), classification accuracy (ACC), sensitivity (SEN), specificity (SPE), precision (PRE), and F1-score (F1) in each experiment. We also measured the Cohen's kappa statistic and McNemar's chi-squared test value for each classification group. These were the statistical analysis tools. The obtained experimental result showed that the use of combined features from the VBM, HV and CSC techniques yielded superior performance in terms of AD or MCI classification compared to the use of individual features.

Ethics statement
In this study, all procedures performed involving human contributors were in accordance with the latest Declaration of Helsinki. Subjects were prospectively recruited from two centers: Chonnam National University Hospital, and Chosun University Hospital including the National Research Center for Dementia in Gwangju, Korea. All patients provided written informed consent at the time of inclusion in the cohort for use of data, samples, and images before the data collection. In the case of AD patients with the inability of consent, the family member of subjects gave consent before the participation. Assessments or psychological tests were not used to determine whether a patient were able to provide written informed consent. The consent procedure and data acquisition were approved by the Institutional Review Board (IRB number 2013-12-018-070) of Chonnam National University Hospital, and Chosun University Hospital, Gwangju, South Korea.

Subjects
In this study, the dataset was acquired from a pool of persons registered at the National Research Center for Dementia in Gwangju, Korea, from January 2014 to March 2018. All subjects were examined by skilled neurologists and received a full dementia screening test, which included past history, neurological examination, laboratory and neuropsychological tests, and brain MRI. All subjects were examined through clinical interview, which included an assessment of the Clinical Dementia Rating (CDR) [47]. All HC subjects received a CDR score of 0. This group had a normal range of cognitive function and good general health with no sign of brain atrophy changes on sMRI scans, which were analyzed with the Analyze 11.0 software on an iMac system running OS X Server. All aAD (stable MCI that did not convert to AD within a 36-month time period) subjects received a CDR score of 1, and their neuropsychological assessment z-scores were above -1.5 according to education-, age-, and gender-specific norms. They had good health overall with no widespread multiple lacunae or other focal brain lesions or brain atrophy other than supposed incipient AD on MRI scans. All the mAD (MCI that converted to clinical AD within a 36-month time period) subjects met [47] and received a CDR score of 2. Their neuropsychological assessment z-scores were below -1.5 on at least one of the memory assessments according to the education-, age-, and gender-specific norms. We did not consider MCI patients who had been followed for less than 36 months and had not converted within this time frame. All the AD subjects received a CDR score of 3. This group had an affected cognitive function, had severe memory loss, and also showed evidence of brain atrophy changes in sMRI scans. The diagnosis of AD, aAD, mAD, and HC patients was made according to the clinical criteria proposed by the NIA-AA or IWG-2 on AD, MCI and HC [5][6][7]9]. The patients were also unable to make judgments or solve given problems. The exclusion criteria were (1) serve vision or hearing loss; (2) illiteracy; (3) sign of focal brain lesions on sMRI including multiple lacunae and WM hyperintensity lesions of grade 2 or more according to the Fazeka scale; (4) any other type of dementia; (5) any significant medical, neurological, or psychiatric disorders that could disturb cognitive function; and (6) present use of psychoactive medication. More detailed information about the participants had been reported before in [48,49]. The subjects were between 49 and 87 years of age, spoke Korean, and had been studied sufficiently to give a reasonable evaluation of functionality. As [5,6,50,51] suggest that for aAD subject group the objective memory loss measured by education score-adjusted on Wechsler Memory Scale Logical Memory II subscale: 9-11 for 16 or more years of education, 5-9 for 8-15 years of education, and 3-6 for 0-7 years of education. Whereas for the mAD subject group the objective memory loss measured by education score-adjusted on Wechsler Memory Scale Logical Memory II subscale: � 8 for 16 or more years of education, � 4 for 8-15 years of education, and � 2 for 0-7 years of education. Therefore, we have followed above-mentioned criteria to give an educational score to the patients. Here, means and standard deviations were calculated to describe continuous variables. Table 1 shows the demographic information for the 326 subjects. We selected all the patients for whom processed images were available. A total of 326 subjects were selected with 81 subjects belonging to the AD group (42 females, 39 males; age ± SD = 71.86 ± 7.09 years, education level = 7.34 ± 4.88 years, range = 56-83 years; range = 0-18 years), 171 subjects belonging to the HC group (88 females, 83 males; age ± SD = 71.66 ± 5.43 years, range = 60-85 years; education level = 9.16 ± 5.54 years, range = 0-22 years), 39 subjects belonging to the mAD group (14 females, 25 males; age ± SD = 73.23 ± 7.09 years, range = 49-87 years; education level = 8.20 ± 5.19 years, range = 0-18 years), and 35 subjects belonging to the aAD group (20 females, 15 males; age ± SD = 72.74 ± 4.82 years, range = 61-83 years; education level = 7.88 ± 6.30 years, range = 0-18 years).
From Table 1 we can see that aAD (stable MCI) patients ages were between 61-83 and their education score was ranged from 7.88 � 6.30 0-18], likewise, mAD (who had converted to AD within a 36-month time period) patients ages were between 49-87 and their education score was ranged from 8.20 � 5.19 0-18], which show that mAD subject groups were more educated then aAD subject groups and mAD subjects were young compared to aAD subject groups. Student's unpaired t-tests were applied to decide whether there were statistically significant differences between the demographics and the clinical characteristics, and a standard p-value (0.05) was used as a significance threshold level. No significant differences were found in any group. Except for the mAD group, all groups included a large number of female subjects. Compared to the other groups, the mAD group had older subjects. As can be seen from Table 1, the AD group had a much lower level of education compared to that of other groups. To obtain unbiased estimations of the performance, we randomly split the dataset into two groups: a training

sMR image acquisition
The images used in this study were T1-weighted sMR images, which were acquired using a 3D magnetization-prepared rapid the gradient-echo sequence with a resolution of (1 × 1 × 1) millimeter voxel size. The 3-T T1-weighted axial images were obtained with the following parameters: slice thickness, 5.0 mm; interslice thickness, 1 mm; repetition time, 2000 ms; echo time, 20 ms; flip angle, 90˚; matrix size, 324×244 pixels; and field of view, 183×220 mm 2 .

Image analysis
An image preprocessing step was applied to each sMR image. It is well known that strong bias fields can cause serious mislabeling of voxel tissue types, which can compromise the accuracy of the techniques that rely heavily on tissue density (e.g., registration) and, in particular, on gray and white matter contrasts. Therefore, to minimize this effect, we applied an N4 bias field correction method using the Advanced Normalization Tools (ANTs) [52] toolbox to correct the inhomogeneity artifacts in each image. Then, feature extraction process and a feature selection (selection of high-level features) process were applied. These features contain the most essential information for the early classification of AD or MCI. In this study, a 3D sMR input image was fed into a feature extraction toolbox to extract the clinical features (volume, thickness, etc.). Fig 1 shows a systematic block diagram for the proposed method. In this experiment, we used two different types of toolboxes for the extraction of features: Statistical Parametric Mapping (SPM version 12) and Freesurfer (version 6.0). Lastly, the extracted features were passed to the classifier to measure the classification performance between different groups.

Feature extraction
It is well known that ROI [45] features are the most effective features for the classification of AD or MCI. In the proposed technique, we used SPM12 and Freesurfer to extract three wellknown features (VBM, CSC, and HV) from each sMR image. Both these tools are fully automated in nature. They were applied on the 3-T T1-weighted sMR images, which were acquired from the NRCD dataset. Voxel-based morphometry. VBM is a neuroimaging analysis method that allows researchers to investigate focal differences in brain anatomy using statistical tools. Morphometry analysis has become an important tool for performing quantitative measurements and identifying structural differences throughout the brain. The significance of the VBM method is that it is not partial to particular subjects and provides an unbiased and comprehensive score that represent the anatomical differences between groups throughout the brain [53]. The MRI data for VBM are provided as 3D volumetric T1-weighted images. VBM essentially uses statistical tests on all voxels in the brain images to identify the volume differences between groups. For example, to recognize the differences in the patterns of the regional anatomy between two groups of subjects, one can perform a series of t-tests on each voxel in each image. Additionally, sMRI measurements of brain atrophy are promising biomarkers for tracking disease progression in AD patients. Recently, VBM has been applied in a number of studies [54][55][56] for the detection of AD. It can be used to study the volumetric atrophy of the gray matter (GM) that exists in the neocortex of the brain, which can be used to distinguish AD patients from HC individuals. There are many toolboxes available for performing VBM (SPM12, FSL, 3D Slicer, etc.); however, we select SPM12 and its extension called as Computational Anatomy Toolbox (CAT version 12), which provides computational anatomy functions. These tools can be downloaded from https://www.fil.ion.ucl.ac.uk/spm12/ and http://www.neuro.uni-jena.de/ cat/, respectively. CAT12 provides diverse morphometric methods, such as VBM, surfacebased morphometry, deformation-based morphometry, and ROI-or label-based morphometry. We chose the CAT12 VBM because it uses SPM12 segmentation by default. First, the sMRI data were anatomically standardized using the 12-parameter affine transformation provided by the SPM template to compensate for the differences in brain size. We chose the East Asian brains template and left all other parameters to their default settings. The sMR images were then segmented into GM, white matter (WM), and CSF images by using a unified tissue segmentation technique after image intensity non uniformity correction was performed. The obtained linearly transformed and segmented images were then nonlinearly transformed using diffeomorphic anatomical registration (DARTEL) techniques and modulated to create a modified template for DARTEL based on the MNI152 template [57], followed by smoothing using an 8 mm full breadth at half maximum kernel. The final step consists of voxel-wise statistical tests. To construct a statistical parametric map, we computed contrast values based on a general linear model estimated regression parameters. This technique implements a two-sample t-test approach to determine whether there are significant regional density differences between two groups of GM images. We obtained regional information values representing significant density differences between GM images after performing a false discovery rate (FDR) and a family-wise error rate correction. Based on this information, cluster values can be derived to create ROI binary masks, which were later used to acquire GM volumes from GM images for use as morphometric features. Fig 2 shows the segmented brain tissue images generated using the CAT12 VBM methods.
Cortical and subcortical volumetric features. The 3D volumes from all 326 subjects [58][59][60] (freely available online at http://surfer.nmr.mgh.harvard.edu/, which provides measurable gray-level volume data for various brain structures. For preprocessing using Freesurfer, highquality T1-weighted MRI data, such as Siemens MPRAGE or General Electric spoiled gradient recalled sequences with resolutions of approximately 1 mm 3 , are needed. We ran Freesurfer without user intervention (using the command "recon-all-I file-name.nii-all") because this is the process mode that would be used in a preprogrammed pipeline for processing patient data. Cortical and subcortical volumetric features were computed using the cross-sectional automated Freesurfer routine with the default parameters. Bilateral ROIs were joined during the process. In the cortical surface stream, Freesurfer constructs a model of the boundary between WM and cortical GM, as well as between pial surfaces. Once these surface values are known, an array of anatomical calculations is enabled, including CTH, folding, curvature, surface area, and surface normal calculations, for each part of the cortex. The cerebral cortex can be alienated into four sections referred to as lobes. Measurements for all four lobes are extracted by Freesurfer. In this study, we adopted the Desikan-Killiany atlas, which parcellates the entire cortex into 68 labeled regions for each hemisphere. To reduce the size of each feature vector, we combined the cortex regions into four lobes (discussed above) and a cingulate cortex according to the specifications on the Freesurfer website 1 . In this automated subcortical segmentation process, each voxel (in the form of normalized brain volume) is assigned one of 40 labels representing 40 subcortical regions (e.g., amygdala, cerebellum, lateral, and thalamus). It can take up to 11-14 h to complete the subcortical volume segmentation for one subject. Table 2 lists the features extracted from sMRI images using the Freesurfer toolbox.
Hippocampus volume. Segmentation of the HV was performed using Freesurfer [61,62]. This study was inspired by the fact that HV is the most widely used sMRI biomarker for the early detection of AD [63]. Additionally, because Freesurfer segments many ROIs for each subject, it is not restricted to a specific ROI. The left and right hippocampus were both segmented using Freesurfer. This technique estimates the probability that each voxel belongs to a certain structure based on a priori knowledge regarding spatial relationships, which is acquired by using a training set. It uses the differences in voxel intensity to locate and parcellate subcortical structures and to perform affine registration in the Talairach space. The Freesurfer processing stages are detailed in [59], and Fig 3 shows the deep segmented regions of the hippocampus.
In Fig 3, CA1 is the first region along the hippocampal path. A major output path originates from this region and travels to the fifth layer of the entorhinal cortex. From the granule cell of mossy fibers, the CA3 region receives inputs in the dentate gyrus. It also receives inputs from cell in the entorhinal cortex by following the perforant path. The CA3b region [64] occupies the central region between the fimbria and the fornix connection. The CA3c region [64] is located near the dentate and eventually inserts itself into the hilus. Much of the synchronous fragment activity that is associated with interictal epileptiform movement appears to be generated in the CA3 region. The CA4 region is often referred to as a hilar region when considering portions of the dentate gyrus. Unlike the pyramidal neurons in the CA1 and CA3 regions, the neurons in the CA4 region contain mossy cells that receive inputs primarily for the dentate gyrus from granule cells in the form of mossy fibers [61].

Feature selection
After the feature extraction stage, a normalization process is performed, where all features are normalized to zero mean and unit variance to reduce data redundancy and improve data integrity between features, as shown in Fig 1. Specifically, given a data matrix X, where the rows represent the subjects and the columns represent the features, the normalized matrix X norm with a elements x(i,j) is calculated as where X j is the j th column of matrix X. Next, the principal component analysis (PCA) [65] method was applied to achieve dimensionality reduction. PCA simplifies complex highdimensional data while retaining important trends and patterns. It is a feature selection process that creates new features from a linear combination of initial features. When performing PCA, the main goal is to find an orthonormal set of axes that point in the direction of the maximum covariance matrix for the data. PCA maps d-dimensional space data into a new kdimensional subspace, where k < d. The new k variables are referred to as principal components (PCs), where each PC has a maximum variance reflecting the variance that was accounted for in all preceding components. PCA is an unsupervised learning method that provides a very powerful and reliable tool for data analysis. As mentioned above, once specific patterns in the data are identified, the data can be compressed into lower dimensions. Here, the number of PCs was determined by maintaining a variance level greater than 99%. From  Fig 4, it can be seen that the first 61 PCs preserved 99% of the total variance. Therefore, the first 61 features were extracted as PCs for AD vs. HC classification. The same procedure was used for the other classification problems.

Classification methods
In this study, we used three different popular classifiers to evaluate classification performance based on single and combined features. Support vector machine. SVM is a discriminative classifier that is formally defined by a separating hyperplane [13,23,29]. In other words, it is a supervised learning method that uses a training dataset to find an optimal separating hyperplane in an n-dimensional space. The optimal hyperplane is one that best separates the two target subject groups. Test subjects are then categorized according to their comparative position, which defines a hyperplane in the ndimensional feature space. In our study, we used the LIBSVM library, which is equipped with Early diagnosis of Alzheimer's disease using combined features a radial basis function (RBF) kernel. An RBF kernel performs better than a linear kernel for a small number of features. A regularization constant C and a set of kernel hyperparameters γ are the hyperparameters required by the SVM. These parameters are optimized by using a cross-validation (CV) method.
Random forest. The RF algorithm is a supervised method that uses an ensemble learning technique for classification [66,67]. It operates by structuring a multitude of decision trees during training and outputting the class that represents the average output of the individual trees during testing. The RF algorithm is typically implemented using the methodology of classification or regression trees (CARTs), where a binary splitting operation recursively partitions trees into homogenous or near-homogenous terminal nodes. A desirable binary split pushes data from a parent node to its child nodes such that the homogeneity in the child nodes is better than that in the parent node. An RF is a group of 100 to 1000 of trees, where each tree is constructed by using bootstrapped samples from the original data. RF trees are different from traditional CARTs because they are constructed non-deterministically according to a twostage randomization method. The first randomization method is implemented by growing trees using bootstrapped samples from the original data, and the second layer of randomization is introduced at the node level when growing the tree. Rather than splitting a tree node using all variables (for each node in each tree), RF selects a random subset of variables and only those variables are used as predictors to determine the best split for the node. The purpose of this twostep randomization technique is to de-correlate the nodes so that forest ensemble will have low variance and manifest the bagging phenomenon. In our study, we used an RF classifier from the Scikit-learn 0.19.2 Python library.
K-Nearest neighbor. KNN is a simple algorithm that belongs to the family of instancebased, competitive learning, and lazy learning algorithms [68,69]. It stores all available labels and classifies new labels according to a similarity measure. For real-valued data, Euclidean distance can be used as a similarity measure. For other types of data, such as categorical or binary data, Hamming distance can be used. KNN makes predictions using a training dataset directly. Predictions are made for a new sample x by searching through the entire training dataset for the K most similar samples (neighbors) and taking the most common output label for those K samples as the label for x. In other words, each instance votes for its class and the class with the most votes is taken as the prediction. Class probabilities can be calculated as the normalized frequencies of the samples that belong to each class in the set of the K most similar samples. The Scikit-learn 0.19.2 KNN Python library was used in our experiments.

Statistical analysis
For each group (AD vs. HC, aAD vs. mAD, and HC vs. mAD), we calculated the Cohen's kappa [70] statistical value, which measures the inter-annotator (rater) agreement for these classification groups. It is a metric unit that compares an experiential accuracy with an expected accuracy (random chance). Cohen's kappa statistic is used not only to evaluate an individual classifier but also to evaluate classifiers between themselves. Moreover, it considers random chance (agreement between random classifiers), which usually means that it is less deceptive than simply using the accuracy as a metric unit. Classifiers built and assessed on the datasets of a different classes of distributions can be compared more consistently through the kappa statistic (as opposed to merely using the accuracy) because of the scaling technique that is related to the expected accuracy. Cohen's kappa statistic is also a better indicator of how well a classifier performs across all the instances because a simple percent of accuracy can be tilted if the class distribution is equally skewed. It is defined as where p 0 is the empirical probability of agreement (or observed accuracy) on the label assigned to any sample and p e is the expected agreement (expected accuracy) when both annotators assign labels randomly. p e is estimated using a per-annotator empirical prior to the class labels. The kappa statistic value lies between -1 and 1. The maximum value means a complete agreement, whereas zero or lower means worse or chance agreement.

Experiments and results
In this section, the experiment results obtained through the combined and single features are presented and shown in the tables below. In this study, four class of data were used: AD, aAD, mAD, and HC. The idea of the proposed technique is to combine three extracted features, namely, VBM, CSC, and HV, from SPM12 and Freesurfer to differentiate between AD and other groups. Moreover, we validated our proposed method on three different types of classification problem, i.e., three binary class problems (AD vs. HC, aAD vs. mAD, and HC vs. mAD) as shown in Fig 1. Here, we used three individual features (VBM, CSC, and HV) and a combination of them to classify the three different types of the classification problem. For these cases, we used the NRCD dataset, which is a private dataset. First, we extracted the voxel-based features from each sMR image using SPM12 and then we used Freesurfer to extract the cortical, subcortical, and hippocampus features from each sMR image. Here, we used an early fusion scheme to concatenate the three features (VBM, CSC, and HV) into one. It is a simple method that combines the different modalities of features into a single feature vector, and then we train a classifier on that single feature vector. Moreover, we applied a feature selection technique using PCA, which will select the effective features from the original features and send these selected features to a classifier, to measure the performance of each classification group.
In this study, we used three different types of classifier: KNN, SVM, and RF. To obtain unbiased estimates of the performance, we randomly split the set of participants into two groups a training dataset and a testing dataset, at a ratio of 70:30, respectively. In the training dataset, a five-fold stratified cross-validation technique was applied to obtain the optimal hyperparameter values for the cost function, C, and γ for SVM; Max_features, Criterion, Max_depth, and optimal hyperparameter values for RF; and n_neighbors optimal hypermeter value for KNN. These optimal hyperparameter values were calculated by using a grid-search CV library function from Scikit-learn 0.19.2 and also by applying a five-fold stratified cross-validation method on the training set. For each method, the obtained optimized hyperparameters value was then used to train the classifier using the training data, and then the performance of each resulting classifier was evaluated using 30% of the testing dataset. In this way, we achieved unbiased estimates for each classification group. To evaluate whether each classification group had an inter-annotator (rater) agreement or not, we calculated the Cohen's kappa statistic index. Moreover, we plotted the receiver operating characteristic (ROC) curves then calculated the area under the curve value for each classification problem. The AUC was invariant to the class distribution, which was an advantage since the number of control subjects was larger than the number of AD patients. We also calculated the classification accuracy, sensitivity, precision, specificity, and F1 values for the classification groups. We repeated the cross-validation procedure five times to obtain a more reliable cross-validation error and extracted the mean AUC value for each classification group. Our experiments were conducted in two stages. In the first stage, only individual features were selected, such as VBM-extracted ROI volumes, CSC-extracted feature volumes, and HVextracted features. These features were then fed into the classifiers one at a time to measure the individual feature performance. In the second stage, combinations of all features (also called single feature vector) were applied to individual classifiers to measure the classification performance. Moreover, a 95% confidence interval for ACC was estimated according to multiple classification runs. To evaluate whether each technique performed significantly better than a random classifier, we used McNemar's chi-squared test with a significance threshold of 0.05, which is a typical benchmark value. This test analyzes the differences between proportions in paired observations. It was used to assess the differences between the proportions of correctly classified subjects (i.e., ACC). The corresponding contingency chart is provided in Table 3. We also used McNemar's chi-squared test to evaluate the differences between the proportions of incorrectly classified subjects. All experiments were conducted on a computer running Ubuntu Linux version 16.04 with Python version 3.6.

AD vs. HC
We calculated the statistical values that represent the significance levels of clusters in the activation map as shown in Table 4 after comparing the results of the statistical two-sample t-tests for the AD vs. HC group. Table 4 specifies the main affected area distributed in the AD vs. HC group, and an achieved voxel clusters with detailed information including its peak coordinates in the form of Montreal Neurological Institute Space, cluster-level in p-value, and the peak intensity in T-value of each cluster, which is given below. We used an uncorrelated threshold value of P uncorrected � 0.001 at the voxel level, FDR value of P FDR = 0.05, and an FWER value of P FWER = 0.05 at the cluster level to perform a bias correction for multiple comparisons. An ROI binary mask was created from the five selected clusters and later GM volume was extracted from the two groups of images (AD and HC). The minimum cluster size in this study was kept as 200 voxels because while applying two-sample t-test in this group, we find a large number of differences in their GM region while comparing them. Here, each cluster contains more than 200 adjacency voxels that show the significant variation of those diffusion parameters. The selected significant voxels are displayed with their T-values. The ROI was defined by comprising the suprathreshold intensity voxels. Moreover, the integral of suprathreshold intensities within a cluster naturally combines both signal extent and signal intensity. Hence, from Table 4, we can say that for AD vs HC group, both suprathreshold intensities (positive and negative) has shown the significant affected region while overlapping AD subject images over HC patient images for extracting GM difference region as a cluster which was shown by their T-value. The ascending peak intensity (T-value) is shown from the darkness to brightness. As shown in Fig 5, the left brain shows significant differences in GM probability between the AD and the HC group. The dark region of positive correlation also shows differences in the hippocampus region, including significant atrophy in the AD group as compared to the HC group. Fig 5 also shows that the left hemisphere of the hippocampus (2266 voxels) region has a significant GM volume loss when comparing the AD group with the HC group, and their peak intensity (T-value) value is 14.3469. Table 5 shows the main affected area distributed in the aAD vs. mAD group, and an achieved voxel clusters with detailed information including its peak coordinates in the form of Montreal Neurological Institute Space, cluster-level in p-value, and the peak intensity in T-value of each cluster, which are given below. For the aAD vs. mAD group, we used an uncorrelated threshold value of P uncorrected � 0.001 at the voxel level, FDR value of P FDR = 0.05, and an FWER value of P FWER = 0.05 at the cluster level, to perform a bias correction for multiple comparisons. The minimum cluster size in this study was kept as 100 voxels because while applying two-sample t-test in this group, we didn't find much difference in their GM region while comparing them. The obtained cluster information is listed in Table 5. Here, each cluster contains more than 100 adjacency voxels that show the significant variation of those diffusion parameters. The selected significant voxels are displayed with their T-values. From Table 5, we can say that for aAD and mAD group, only negative suprathreshold intensity has shown the significant affected region while overlapping aAD subject images over mAD patient images for extracting GM difference region as a cluster, which was shown by their T-value, whereas no significant region is found on positive suprathreshold intensity. That is why there is no region for positive  peak intensity, and also because of small GM differences in this region, their peak intensity value looks similar. The ascending peak intensity (T-value) is shown from the darkness to brightness. We selected nine clusters to construct an ROI brain mask.  Table 6 shows the on-line table for main affected area distributed in the HC vs. mAD group, and an achieved voxel clusters with detailed information, including its peak coordinates in the form of Montreal Neurological Institute Space, cluster-level in p-value, and the peak intensity in T-value of each cluster are given below. For the HC vs. mAD group, we applied uncorrected threshold value of P uncorrected � 0.001 at the voxel level, FDR value of P FDR = 0.05, and an FWER value of P FWER = 0.05 at the cluster level to generate eight clusters. The obtained cluster information is listed in Table 6. The minimum cluster size in this study was kept as 300 voxels because while applying two-sample t-test in this group we found a large number of differences in their GM regions while comparing them. Here, each cluster contains more than 300 adjacency voxels that show the significant variation of those diffusion parameters. The selected significant voxels are displayed with their t-values. From Table 6, we can say that for HC vs. mAD group, both suprathreshold intensities (positive and negative) has shown the significant affected region while overlapping HC subject images over mAD patient images for extracting GM difference region as a cluster which was shown by their T-value. The ascending peak intensity (T-value) is shown from the darkness to brightness. Fig 7 shows that the right hemisphere of the Cerebrum and hippocampus (4541  voxels, and 3337 voxels) region has a significant GM volume loss when comparing the HC group with the mAD group, and their peak intensity (T-value) is 3.5911, and -5.4644, respectively. Tables 7-9 and Figs 8-10 show the classification result for AD vs. HC, aAD vs. mAD, and HC vs. mAD. Fig 11 shows Cohen's kappa statistic graph for three classification groups (AD vs. HC, aAD vs. mAD, and HC vs. mAD). From this graph, we can see that our proposed method achieved a good level of agreement within each group when classifying the AD vs. HC, aAD vs. mAD, and HC vs. mAD groups using the combined features. In this study, the combined features with the SVM classifier achieved Cohen's kappa values of 0.9056, 0.7606, and 0.9468, which are close to 1, for AD vs. HC, aAD vs. mAD, and HC vs. mAD, respectively. As can be seen from Tables 7-9, the SVM classifier with the combined features achieved a higher level of the agreement compared to that with individual features.

HC vs. mAD
Here, Fig 12 shows the AUC curve for the AD vs. HC, aAD vs. mAD, and HC vs. mAD classification groups. Moreover, for the AD vs. HC group, our proposed model achieved an AUC of 93.93%, which indicates that it performed very well when distinguishing AD subjects from HC subjects using combined features. Moreover, for the aAD vs. mAD group, our proposed model correctly classified the converted patients when compared to the stable patients, with an AUC of 87.08%. Likewise, our proposed model achieved an AUC of 95.83% for the HC vs. mAD group. Overall, for all classification methods, our proposed model performed well and its probabilities from the positive classes were well separated from the negative classes. From Tables 7-9, we can see that the SVM classifier with a combined features achieved better AUC result compared to that with individual features.

Comparison to State-of-the-Art methods
The NRCD dataset is not available publicly for research purposes. Therefore, to compare our method with other state-of-the-art methods, we applied our proposed method on the ADNI public dataset, which can be downloaded from http://adni.loni.usc.edu/. The ADNI data center was established in 2003 as a public-private corporation with aid from several groups within the organization, including the National Institute of Aging and National Institute of Biomedical Imaging and Bioengineering, as well as several non-profit organizations and private pharmaceutical companies. The primary objective of the ADNI dataset has is to determine whether serial MRI, PET, various biological biomarkers, and clinical and neuropsychological tests can be combined to assess the progression of MCI and early AD symptoms. For up-to-date information, please visit www.adni-info.org. For this study, a total of 163 sMR images (belonging to one of the AD, MCIc (MCI converted to AD, who had converted to AD within a 24-month time period), MCIs (stable MCI, who had not converted to AD within a 24-month time period), and HC groups) were acquired from the ADNI dataset. Patients with MCI and HC subjects were chosen randomly from a database of participants in longitudinal studies who were monitored for at least 24 months. All subjects underwent several neuropsychological examinations to produce several clinical characteristic indicators in combination with MMSE results, functional assessment questionnaire scores, and clinical dementia ratios. Table 10 shows the demographic information for all subjects. Both male and female subjects were included in our experiments. All sMRI scans used in this study were acquired from 1.5-T MRI scanners. First, we extracted voxel-based features using SPM12 and then we extracted cortical, subcortical, and left and right hemisphere  hippocampus features from each sMR image using Freesurfer. Here, we followed the same procedures as those for the NRCD dataset. For the classification between different groups, we chose the same classifier as that for the NRCD dataset. Table 11 shows the obtained result for the AD vs. HC group. It can be seen from below table that our proposed method achieved a significantly better p-value than the conventional p-value threshold of 0.05. Likewise, Tables  12 and 13 show the results for the MCIc vs. MCIs and the HC vs. MCIc classification, where some groups yielded a significantly better p-value than the conventional p-value of 0.05.
The proposed method achieved better result for all three classification groups. For the AD vs. HC group, the combined features with the SVM classifier achieved an AUC of 97.56%, and an ACC of 96.42% compared to the best individual feature output, which was obtained by HV features using the SVM classifier, and the obtained Cohen's kappa value for this group was 0.9465, which is close to 1, and also showed a high level of agreement between these two groups. Moreover, for the MCIs vs. MCIc group, our proposed method with combined features achieved better results (AUC of 96.89%, and ACC of 97.36%) and the obtained Cohen's kappa value for this group was around 0.9345, which is close to 1, and demonstrate a highlevel of concurrence between the MCIs and the MCIc group. Likewise, for the HC vs. MCIc group, our proposed technique with combined features obtained better results (AUC of 95.43%, and ACC of 94.73%) and the obtained Cohen's kappa value for this group was 0.9260, which is close to 1, and demonstrate a high-level of concurrence between the HC and the MCIc group.
Recently, several studies have reported their classification results for distinguishing AD patients from HC individuals according to the MRI dataset. Zhang et al. [23] performed a multimodal classification of AD based on a combination of MRI, CSF, and PET data. By using MRI data, they achieved an ACC of 86.2% for the AD vs. HC classification. By combining all the aforementioned biomarkers, they achieved a higher ACC of 93.2%. Westman et al. [27] reported an ACC of 87% when using MRI data and an increased ACC to 91% when combining MRI data with CSF measures. Lama et al. [32] adopted Freesurfer to compute the CTH and volumetric measurements. Using an extreme learning machine classifier, they achieved an ACC of 77.88%. Cuingnet et al. [13] compared 10 widely used methods using the ADNI dataset. They used three different techniques to extract features from the brain: VBM, CTH, and HV. They reported an SEN of 81% and an SPE of 95% as the best performance measures when using the Voxel-Direct-D-gm method. Wolz et al. [29] used a multi-method technique for the early detection of AD. They considered four types of features, namely, HV, CTH, TBM, and manifold-based learning, which they combined together to achieve an ACC of 89% using linear discriminant analysis and 86% using an SVM classifier. Jha et al. [71] proposed a technique using complex dual-tree wavelet PC features and an extreme learning machine as a classifier. For the ADNI dataset, their method achieved an ACC of 90.26%. Beheshti et al. [72] proposed a CAD system composed of four systematic stages for analyzing global and local differences in the GM of AD patients compared to that of HC individuals using a VBM technique. They used seven different feature ranking methods, and with their Fisher's criterion as a stopping criterion method they achieved an ACC of 92.48% for the AD vs. HC classification. Comparison results of the proposed technique with other published classification methods (using different biomarkers) on the ADNI dataset are provided in Table 14. The table shows that the performance of the proposed method using combined features from sMRI data (VBM+CSC+HV) is superior or comparable to that of other methods reported in the literature.

Discussion
In this experiment, we evaluated the automatic diagnostic capabilities of three structural MRI features (VBM, CSC, and HV), both individually and using combined features, for 326 subjects from the NRCD dataset. To the best of our knowledge, this is the first study wherein VBM outputs had been combined with CSC and HV for the classification of AD or MCI. From Tables 7-9, we can be seen that, the VBM individual feature achieved better result than those of other individual features. However, combining all features into one improved the classification performance for all classification groups. These results show that using a combination of various sMRI-based features can improve classification accuracy compared to using only a single feature and can produce a more powerful and steady classifier. To assess and compare the performances of each technique, we performed three classification tests: AD vs. HC, aAD vs. mAD, and HC vs. mAD. Here, we applied three different types of classifier: KNN, SVM, and RF. To  obtain unbiased estimates of the performance, we randomly split the set of participants into two groups a training dataset and testing dataset, at a ratio of 70:30, respectively. In the training set, a five-fold stratified CV method was applied to obtain optimal optimized hyperparameter values for the cost function, C, and γ for SVM; Max_features, Criterion, Max_depth, and optimal hyperparameter values for RF; and n_neighbors optimal hypermeter value for KNN. These optimal hyperparameter values were decided by using a gridsearch CV library function from Scikit-learn 0.19.2 and also by applying a five-fold stratified cross-validation method on the training set. For each method, the obtained optimized hyperparameter value was then used to train the classifier using the training dataset and then the performance of each resulting classifier was evaluated using the remaining 30% of the testing dataset. In this way, we obtained unbiased estimates for each classification group.

AD vs. HC
The classification results for an AD vs. HC are summarized in Table 7 and are presented in Fig  8. For each case, the dataset was separated into two subsets with a 70:30 ratio. All methods obtained a significantly better p-value than the conventional p-value of 0.05. The underlying This shows that, our proposed model using combined features achieved a high level of agreement between these two groups.

aAD vs. mAD
The classification results for aAD vs. mAD are summarized in Table 8 and are presented in Fig  9. The SVM classifier with combined features obtained a significantly better p-value (0.0009) than the conventional p-value of 0.05. It also achieved an AUC of 87.08%, ACC of 86.95%, SEN of 77.77%, SPE of 92.85%, PRE of 87.5%, and F1 of 77.77% for this classification group. Moreover, the obtained Cohen's kappa value for this group was 0.7606, which is close to 1. This shows that our proposed model achieved a high level of agreement between these two groups.

HC vs. mAD
The classification results for a HC vs. mAD are summarized in a Table 9 and are presented in Fig 10. The SVM classifier with combined features obtained a significantly better p-value  Tables 7-9 and Figs 8-10, it can be seen that the AUC and ACC values improved significantly when applying multiple features, compared to using individual features. All the classification methods in this study achieved good AUC and ACC values, and the SVM classifier performed significantly better than the KNN and RF classifiers for distinguishing AD or mAD patients from HC individuals. Still, our study should be considered as a preliminary proof-ofconcept study. It proposes a promising approach for the future translation of neuroimaging into patient benefit. This approach requires replication and validation in larger samples; however, it provides initial evidence of a rapid and accessible methodology that could potentially aid clinical decisions.

Conclusions
In this paper, a novel feature fusion technique was proposed to improve the classification accuracy of the AD, aAD, mAD, and HC groups. First, we preprocessed sMR images and used CAT12, which is integrated with the SPM12 software, for the extraction of specific ROIs. We then used Freesurfer to extract CSC and HV features from 326 subjects. Finally, we merged these three types of features linearly and used the combined features for the early prediction of AD. We found that the combination of morphometric features with cortical and HV features performed better than any individual features. Additionally, the proposed method achieved an AUC and ACC of 87.08% and 86.95%, respectively, for the aAD vs. mAD classification group (NRCD dataset) and 96.89% and 97.36%, respectively, for the MCIc vs. MCIs classification (ADNI dataset). The p-value obtained by McNemar's chi-squared test for aAD vs. mAD and MCIs vs. MCIc group was less than the conventional p-value threshold of 0.05. The obtained Cohen's kappa value for aAD vs. mAD was 0.7606 for the NRCD dataset and for MCI vs MCIc was 0.9345 for the ADNI dataset, which is close to 1. We can say that this classification group achieved a high level of agreement with each other. The proposed method main advantages are as follows. First, it uses combined features that can be extracted from a sMRI modality. Second, it uses three different classifiers to achieve the best possible AUC and ACC values. We performed several evaluation experiments on the private NRCD dataset using optimized hyperparameters. The test results were analyzed and presented in this paper, and they showed the efficiency of the proposed model for improving classification performance.
The method proposed in this study performed better in every classification group. However, it still has some drawbacks as it was only tested on a relatively small dataset. In the future, we will apply the proposed model to a large publicly available datasets. Additionally, we also plan to examine different imaging modalities, such as PET and functional MRI, for the early detection of AD.
Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ ADNI_Acknowledgment_List.pdf. ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery

Author Contributions
Conceptualization: Kun Ho Lee.