Gaussian Mixture Models and Model Selection for [18F] Fluorodeoxyglucose Positron Emission Tomography Classification in Alzheimer’s Disease

We present a method to discover discriminative brain metabolism patterns in [18F] fluorodeoxyglucose positron emission tomography (PET) scans, facilitating the clinical diagnosis of Alzheimer’s disease. In the work, the term “pattern” stands for a certain brain region that characterizes a target group of patients and can be used for a classification as well as interpretation purposes. Thus, it can be understood as a so-called “region of interest (ROI)”. In the literature, an ROI is often found by a given brain atlas that defines a number of brain regions, which corresponds to an anatomical approach. The present work introduces a semi-data-driven approach that is based on learning the characteristics of the given data, given some prior anatomical knowledge. A Gaussian Mixture Model (GMM) and model selection are combined to return a clustering of voxels that may serve for the definition of ROIs. Experiments on both an in-house dataset and data of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) suggest that the proposed approach arrives at a better diagnosis than a merely anatomical approach or conventional statistical hypothesis testing.


Introduction
Alzheimer's disease (AD) is a progressive, degenerative and incurable disease of the brain and the main cause of dementia. The number of people suffering from dementia is expected to The goal of the present research is to explore the diagnostic usefulness of a novel CADbased approach for FDG-PET data. From a technical point of view, the neuroimaging community frequently employs Statistical Parametric Mapping (SPM) [19] in a univariate statistical testing approach to localizing voxels that are discriminative for different groups. This approach suffers from considering voxels independently, which is not taking into account correlations between neighbouring voxels. In contrast to such a univariate approach, multivariate methods were proposed to learn the patterns in data from a higher perspective. For example, principal component analysis (PCA) is used to extract features, which are then fed into a classifier [20]. Another work [21] used PCA to analyze FDG-PET in amnestic MCI and illustrated some interesting findings using the principal components. It is worth mentioning that PCA essentially transforms the original features into another feature space, which is different from the method proposed in the following.
A Region of Interest (ROI) based method was developed to derive features from PET images [22,23]. A Gaussian Mixture Model (GMM) models the difference between controls and patients with AD, where the number of Gaussians (K) was fixed to 64, which can be a drawback since 64 may not be the optimal value. In this paper, the proposed approach is able to choose the optimal K by a model selection technique. Because the correct diagnosis of dementia is also of practical interest, we therefore propose a novel CAD approach that discriminates between NC, MCI and full-blown AD.

Ethics Statement
Each patient in the in-house dataset, gave written informed consent to participate in the study. Patients' names were anonymized in the database. The study is allowed to be conducted by Technische Universität München (TUM). The protocols were submitted to appropriate Boards and their written unconditional approval obtained and submitted to Regulatory Affairs at the Alzheimer's Disease Neuroimaging Initiative Coordinating Center (ADNI-CC) prior to commencement of the study. Further information about ADNI can be obtained from www.adniinfo.org.

Data Acquisition
Experiments were performed on two independent datasets. One is from the publicly available ADNI database (http://adni.loni.usc.edu/) and the other is an in-house dataset of patients and controls recruited at the memory outpatient unit of the Department of Psychiatry at Technische Universität München. The above datasets are in the following referred to as ADNI and TUM, respectively. ADNI has a large pool of PET (co-registered, averaged) images, which have been acquired on various scanners using different imaging parameters. To eliminate the bias of these factors, we selected images that have been obtained using the same scanner as well as the same parameters, such as the number of slices. The patient information and the PET scanner type are summarized in Table 1. Further details about the ADNI recruitment procedures are provided in the acknowledgments.
Prior to their use for image analysis, PET images had to undergo two pre-processing steps: spatial normalization and smoothing (kernel size [8 8 8] mm), which were achieved by SPM5. The spatial normalization ensures that the processed image is of the size 91×109×91, which is in accordance with Anatomical Automatic Labeling (AAL) [24]. The final step is the intensity normalization that was done by dividing each voxel by the mean intensity value averaged over all brain voxels (grand mean normalization, the non-brain voxels surrounding the brain were excluded). The second intensity normalization method is called pSMC (primary sensorimotor cortex) and was reported to be advantageous in a study [25]. Anatomically, the "Precentral_L, Precentral_R, Postcentral_L and Postcentral_R" regions in the AAL brain template can be used as the primary sensorimotor cortex.

Gaussian Mixture Model and Model Selection (GMM+MS)
GMM [26] is a parametric density estimation approach that assumes the data is generated by more than one Gaussian distribution. It can cluster a point by assigning the class label to the Gaussian that contributes the largest probability. Given a vector (data point) x,a GMM is defined as: where μ k, S k and and π k are the mean, covariance and mixing proportion respectively. In addition, X K k¼1 p k ¼ 1; π k ! 0 and θ = {μ k ,S k ,π k }. @ denotes the D-dimensional Gaussian distribution: Expectation Maximization (EM) [27] is usually used to solve a GMM by the following steps: Estep: estimate the posterior probability p t ik at t iteration as: M-step: update the parameters μ k, S k and π k at t+1 iteration based on the probabilities from the Estep: EM is guaranteed to converge, so that a locally optimal solution is always assured. However, one big concern in GMMs is the number of components/clusters (K) that must be defined in advance, which can be a hard task. We employ the Bayesian Information Criterion (BIC) [28] to determine this parameter automatically. The BIC is frequently used for model selection, since it considers a trade-off between model fitting and model complexity. By adding a penalty term for the number of parameters in the model, the BIC can alleviate the problem of overfitting, which can be caused by increasing the likelihood by just adding parameters to the model. The BIC has the form: where P is the number of free parameters to be estimated in the model and N is the total number of data points. In a multivariate Gaussian, the number of free parameter is (KD(D+3))/2+K-1,due to K-1 mixing proportions to decide, KD mean values, and (KD(D+1))/2 free parameters in the covariance matrix. The model with the lowest value of BIC is selected as the desired model. Aside from BIC, the Akaike Information Criterion (AIC) [29] is also a common method for model selection.
The log-likelihood (L) of the GMM can be inferred as: ; log À sum À exp trick where the "log-sum-exp" trick [30] is a method for avoiding numeric underflow and thus can improve the numeric stability when computing the BIC in a GMM scenario. Continue writing out the Eq 9, we can compute the log-likelihood of the GMM, which is needed for computing Eq 7 and Eq 8. It can be seen that AIC (the lower, the better) has a lower penalty for model complexity, because log(N)P is usually much larger than 2P. Because both AIC and BIC consist of two terms, the final scores weight the relevance of these two terms. In AIC, the term 2P does not contribute much to the final score as opposed to -2log(likelihood). However, log(N)P increases much faster when the model becomes more complex (more clusters), thus the resulting BIC score stops growing as the number of clusters increases. From Eqs 7 and 8, we see that AIC and BIC suggest different quantities. Thus, they may yield different model selection results. In the present study, we conduct experiments using both AIC and BIC, and report more results on BIC. In the following, we review some reported studies comparing BIC with AIC. Our experiments show that AIC and BIC yield comparable results.
A recent study [31] reveals a similar observation that illustrates model selection discrepancy between AIC and BIC, preferring BIC to AIC for the sake of favoring simpler models. In addition, another work [32] explicitly states that "AIC has been shown to perform well at selecting the true number of factors when it exists, but only at small sample size N. BIC has been found to outperform AIC in recovering the true K (number of clusters)". Particularly, we observe that AIC and BIC tend to agree with each other when the number of voxels (sample size) is small (i.e., 586, cf. Fig 1). As the number of voxels increases to 12,414, BIC reaches the lowest value much earlier than AIC, which is illustrated in Fig 1. Thus, BIC is more appropriate when the number of voxels is large (greater than 3500 may mean "large", refer to work [33] to avoid a too complex model. As for the plot (c) in Fig 1, it can be seen that the AIC and BIC continues dropping as the number of clusters increases. Thus, one may conclude that the more clusters, the better in the small sample case. However, we can, at least, observe a dip if we test a larger number of clusters. Because we intentionally impose that a cluster should at least contain 30 voxels to avoid trivial clusters, more clusters are not necessary to be tested (refer to Section Parameters in Model).
Another study shows that BIC performs better than AIC in both small and large sample size cases, claiming that AIC lacks the appropriate penalty to prevent overfitting (see Table 2 of the original reference [34]). The difference between our BIC models is greater than 10 (see plot (c) and (f) in Fig 1), thus it suggests strong evidence (cf. Table 2) that the model difference is meaningful according to work on Bayes factors [35].
To empirically show the difference between AIC and BIC in this study, we show the generalization error (GE, equivalently, 1-accuracy) demonstrated by Fig 2. From the 50 bins to 150 bins, we observe that BIC outperforms AIC on 8 cases (50, 60, 70, 80, 90, 110, 120, 140 bins) in terms of mean generalization error. BIC often yields a certain bin that has the lowest GE, for example, in 60 bins, 70 bins, 90 bins, 110 bins, 120 bins, 140 bins and 150 bins. When we perform the classification using all features extracted on all bins together (details are explained in Section GMM+MS on 3D PET images), we see that BIC shows lower GE than AIC demonstrated by the bottom plot of Fig 2. The reason for this is that selected features (on all bins) by BIC contain more discriminative information than the features by AIC, which complies with the results of MCI against AD using pSMC normalization using ADNI dataset. However, AIC shows better performance than BIC in some other cases, such as MCI against AD using TUM dataset. In summary, AIC and BIC suggest generally comparable results across the two datasets in different classification cases.

GMM+MS on 3D PET Images
We employ a clustering method (GMM) to group brain voxels into small regions that exhibit both high similar intensity and geometric affinity. A PET image can be viewed as three dimensional (3D) spatial data along with one extra dimension that represents the intensity of each voxel. Thus, a voxel is denoted by a 4-tuple (x,y,z,I)2< 4 , where x,y,z are the spatial coordinates and I is the intensity value. We used NC PET images as reference images to obtain the clustering results and used these clusters to extract the features from the NC, MCI and AD groups. Note that the method is applied on the AAL (gray matter voxels of MNI space) defined voxels, which constitute the gray matter in the brain. The mean intensity and standard deviation of each cluster are subsequently defined as features. To ensure the clusters have similar intensity values and are geometrically connected, we first group the original voxels into a certain number (e.g., 100) of bins of equally sized intensity ranges, and then cluster each bin by the introduced methods based only on the spatial information, i.e., the x,y,z coordinates. A bin is a data interval that is described by a statistical histogram. The data falling into the same bin are from a certain value interval, such that the data within the same bin are similar in their values. Theoretically, it is hard to find the most appropriate number of bins in advance, thus we tested different numbers from 50 to 150 with step size 10, i.e., 50, 60, . . ., 150. The best one can be chosen by a cross-validation on the training data. In practice, cross-validation is repeated in a way that the division into sets is identical for the various bin sizes. Given the training data, we can further split them into sub-training and sub-test data, which are used to train and evaluate the model respectively. Evaluating the model using the sub-test data gives a predictive accuracy. The yielded highest accuracy of a certain bin corresponds to the most appropriate number of bins. Once the number of bins is determined in this way, the same training procedure can be applied to the whole training data to maximize the use of present training data. Therefore only the training data is used to set the optimal parameters in the experiments. The workflow of the proposed method is summarized in Table 3.
If there are 1000 clusters formed using GMM+MS, then the image can be represented as a 2000 (1000μ and 1000σ) dimensional vector. Generally, not all of these features are Generalization error (GE) by BIC and AIC using pSMC normalization of MCI versus AD on the ADNI dataset. The generalization error (equivalently, 1-accuracy) is computed based on features (mean and standard deviation value) extracted from the clusters in the individual bin (one bin among 50 bins, for example) via 10 times 10-fold cross-validation. For example, a green point can represent the classification GE using n-th bin when we divide the whole brain voxels into 50 bins. The plotted points are the bins that yield GE < 0.25, and these bins are highly predictive, thus contribute to the final classification results. The x-axis represents only the number of points, implying no ordering. The horizontal green line and red line are the mean values computed from green circles and red crosses, respectively. The bottom plot illustrates the generalization error from dividing brain voxels into 50 bins till 150 bins after collecting top informative features from the individual bins. informative, thus we applied a feature selection technique [36] to choose the most discriminative ones for building the model. In this study, we empirically used the top-150 most informative features for learning the model. From Fig 3 of BIC, we observe that the classification accuracy increases in the beginning, but drops after selecting too many features. 150 features appear to provide sufficient classification relevant information, and more features may hamper the classifier's performance due to the well-known curse of dimensionality [26]. As for AIC, top-400 features were selected to perform the experimental comparison. AIC needs more features than BIC, which may be the reason why AIC divides voxels into more clusters so that the discriminative information are spread over many clusters. It should be pointed out that the feature selection [37] and the model building steps only used information from the training set: no information from the test set is used at any point in time (in other words, no information leakage from the test set to the training set has occurred).
A support vector machine (SVM) was used to build the final classification model, which is trained on the training data. The SVM has been shown to perform well in a variety of applications, thus it was chosen to be the classifier in this study. A tutorial [38] offers a good introduction to the SVM. Apart from the SVM, other classification methods could be used, such as Random Forests, Naïve Bayes, and others. We do not attempt to compare the proposed method with SVMs, we rather use a SVM as a classifier in the method. The suggested method aims at extracting useful features from brain voxels, whereas SVMs are a classification method based on input features (voxels).
In terms of running time, it costs roughly 15 hours to cluster the mean NC image from 50 bins to 150 bins. It takes only a few seconds to extract the features of a given PET scan after having the clusters. Therefore, the proposed approach is very efficient once the clusters are derived, since extracting features from new images is fast. The code is implemented in MATLAB and runs on a machine with Intel(R) Core i7-3632QM CPU @2.20 GHz, 8GB of memory. ii. For each bin, run GMM+MS method to yield the clusters.
iii. Collect all the resulting clusters from all the bins. iv. Given a NC, MCI or AD image from the pool of sub-training data, compute the mean (μ) and standard deviation (σ) of the voxels in each cluster using the provided spatial information, i.e., x,y,z of the cluster.
v. The image can then be represented as a feature vector of the means (μ) and standard deviations (σ).
vi. Build SVM model using only the sub-training data, and the predictive accuracy is computed for the sub-test data using the model. (The model is trained on MCI and AD sub-training data, if the classification is MCI against AD.) Collect the computed accuracy from all bins.
3. The resulting clusters correspond to the bin with the highest accuracy are used as the most appropriate clusters for NC, MCI and AD images.

Test phase (remaining 1 fold)
i. Construct the features for both training and test images described as the steps of "iv and v".
ii. Build SVM model using the training data (9 folds), and obtain the results using the remaining test data (1 fold). doi:10.1371/journal.pone.0122731.t003 In addition, the LIBSVM [40] package provides a fast classification, once the features are constructed.

Compared Methods
AAL approach: AAL [24] defines 116 brain regions, and we extracted the mean and standard deviation from each of these regions as features. Thus, in total each image is represented by a 232-dimensional feature vector. T-test: This hypothesis testing based method uses voxel-wise t-test and is widely applied in neuroscience studies. A t-test based method, for example, was tested as a method for feature selection with respect to predictive accuracy recently [39]. If the p-value is lower than a predefined threshold, e.g., 0.001, then this voxel is regarded as an indicator voxel for two groups of individuals. The null hypothesis is that the voxels in the two groups come from a population where the means of the two groups are the same. Therefore, a p-value lower than 0.001 rejects the null hypothesis, i.e., the two groups have different means. Hence, this voxel can be an indicator voxel representing group difference. We performed a two-tailed t-test on each of the voxels defined in an AAL region without multiple comparison correction, with a threshold set to 0.001. Finally, the top-150 voxels with the lowest p-value were chosen for learning, to be in line with our proposed method.

Test Protocol
The experiments were performed on two datasets (ADNI and TUM, cf. Table 1) independently. For each of the datasets, we trained the model based on the training data, evaluating the model using the held-out test data. The training and test data were divided using 10-fold cross- validation. In statistics, 10-fold cross-validation is commonly applied as an approach to testing a predictive model. In 10-fold cross-validation, technically, the original dataset is roughly divided into ten subsets, and each time we select nine subsets for training the model and the remaining one for testing. This procedure is repeated ten times, assuring that each subset is tested exactly once and employed nine times for training. The ten results are then averaged to produce a single estimate. (In case of very small data sets, results are aggregated by instance and not by test set.) In the two-class classification case, each subset contains roughly the same proportion of samples from the two classes, which is called stratification. For MCI versus AD in the TUM dataset (30 MCI, 30 AD), for instance, each subset consists of six samples with three from the MCI and three from the AD group. Each time 54 samples are employed for training and six for testing. After repeating the procedure ten times, we compute the mean value from all ten runs as the final result. We used the implementation of "crossvalind" function in MATLAB 2010 (R2010a) to achieve the stratified 10-fold cross-validation.

GMM+MS in Summary
To sum up, we tested different numbers of bins (50, 60 to 150 with a step size of 10) to decide how many bins the whole brain voxels should be divided into. For example, if the brain voxels are divided into 50 bins, we run the GMM algorithm on each of the 50 bins. The BIC suggests the optimal number of clusters in each bin. The yielded clusters from the 50 bins are then collected as the final clusters in the whole brain. The mean and standard deviation of each cluster (given x, y, and z coordinates) are extracted as the feature values representing a PET scan. Finally, every PET scan can be represented by a vector. A 10-fold cross-validation is then used to train an SVM and test the classification performance based on the vectors. Table 3 further explains the procedure of the method applied to PET scans.

Parameters in Model
The proposed method is able to find the optimal number of clusters by comparing the BIC score computed from different models. Thus, a number of different model selections must be performed. The simplest way to determine the number of clusters in a model is to let the cluster number be equal to the number of voxels in each studied bin (a voxel value interval). However, this is not only computationally expensive, but we may also end up with clusters that include too many (too rough) or too few (too trivial) voxels. To this end, we intentionally defined a priori that the resulting clusters should have 1000 voxels at most and 30 voxels at least. Then the maximal number of clusters in a bin can be calculated as Cmax = #(voxels in a bin)/1000, and the minimal is Cmin = #(voxels in a bin)/30.The numbers between the Cmax and Cmin are used in these model selections. Fig 4 depicts that the number of derived clusters, in general, decreases as the number of bins increases. However, the sharp drop appears in the beginning and the curve then gradually reaches a nearly stable state until 150. Table 4 shows that the GMM+MS method performs much better than the AAL and t-test methods on the task of discriminating MCI from AD. In particular, GMM+MS is statistically significantly better than the t-test method. Specifically, regarding pSMC, the accuracy gain is 12.9% (80.2%-67.3%) with a p-value of 0.017 (calculated using the test suggested by Bouckaert et al., [41]), and the specificity gain is 0.19 (0.80-0.61) with a borderline p-value of 0.066. Regarding the grand mean normalization method, the accuracy, AUC, sensitivity and specificity all show better results than the AAL and t-test approaches. As for NC versus AD, the three methods perform equally well, which may due to the fact that the most essential discriminative information can be easily identified by all of them. As a result, further improvement hardly can be achieved. The t-test approach reveals a slightly better result on NC versus MCI using pSMC, whereas it shows similar performance using grand mean normalization. However, the opposite is true on the TUM dataset (cf. Table 5), which suggests that GMM+MS is much better than the other two methods. Again, a comparable performance is shown for NC versus AD. Regarding MCI versus AD, the grand mean still reveals improved results, and pSMC shows comparable performance. As for the comparison between AIC and BIC, there is no significant difference revealed by the results, cf. Tables 4 and 5. A possible explanation is that both AIC and BIC can discover discriminative clusters sufficiently well, despite the fact that AIC favors a more complex model and BIC tends to choose a simpler one.

Performance Comparison
The ROC (receiver operating characteristic) curves shown in Figs 5 and 6 reveal the different performance of various methods, depicting the true positive rate against the false positive rate. The BIC (green) and AIC (red) curves cover a large portion of the t-test and AAL curves regarding MCI versus AD on both ADNI and TUM datasets. This observation complies with the results shown in Tables 4 and 5, i.e., the proposed method GMM+MS performs better than the compared ones in terms of MCI versus AD.
To sum up, the proposed method performs substantially better than the compared methods, in particular for MCI versus AD. Specifically, three comparisons out of four (TUM and ADNI datasets, grand mean and pSMC methods) demonstrate improved performance. A statistically significant result is also confirmed on the ADNI dataset. The limited NC sample size of 16 in the TUM dataset, as opposed to 30 in ADNI, may be one reason for less accurate results.

Results Analysis
In general, the proposed method achieved either salient performance improvement or comparable results compared to more established methods using two different normalization methods in two independent datasets. However, the results are not perfectly consistent across the two datasets, which might be related to different types of scanners and different image acquisition methods, such as the amount of tracer used, whether an eye mask was used during the scan, and so forth.
The information in Tables 6 and 7 highlights the detailed brain region information regarding the contribution to the classification. These regions include areas which are typically involved in AD, such as the cingulum, precuneus and temporal regions. The red points in Figs 7 and 8 highlight these informative voxels (brain regions), which correspond to the information in Tables 6 and 7.
The use of two independent datasets is a major strength of our study, which keeps the research findings more objective as contrast to the other studies conducted on one dataset. In addition, two different intensity normalization methods, namely primary sensorimotor cortex "*" denotes the GMM+MS is significantly better than t-test approach at a statistical level of 0.05. The p-value is calculated by the corrected paired t-test tailored for comparing learning algorithms [41]). LIBSVM [40] is used to build the SVM models. A linear kernel is used, with a grid search for parameter optimization. Grid search considers only the optimization of the penalty parameter C in the linear SVM, selecting the value of C yielding the best classification result based on the training data. After the best value of C is found, we apply it to the test data. AUC: area under ROC curve. Each experiment was repeated 10 times with a 10-fold cross-validation. P: results using "primary sensorimotor cortex" region for intensity normalization. G: results using "grand mean" method for intensity normalization. Results from AIC are noted in brackets, following the BIC results. and grand mean normalization, were applied to establish the experimental results. Tables 4  and 5 show that the proposed method is better compared to two other baseline methods.

Error Analysis
To gain some insights into the classification difference between GMM+MS, AAL and the t-test methods, we exert to investigating the errors committed by the classifier. To be concise and illustrative, we take MCI vs. AD (grand mean normalization) in the ADNI dataset as a running example. Table 4 shows that the proposed method leads to an accuracy gain of 7%, which is a salient improvement. Looking at the misclassified images, we observe that a certain image is misclassified by AAL 4 times and 10 times by the t-test approach, but without any misclassification by GMM +MS. Therefore, this image, in fact it is an AD image, is selected for a more detailed study.
Recall that the SVM needs to compute the sign of y = wx+b to make a decision. Hence, knowing w and b is essential. Since b is only a constant, we omit it from further analysis. The studied AD image is regarded as the test data and all the rest is treated as training data. After training the model, the SVM returns the support vectors and the weights w (feature importance), such that the classification can be made upon y = wx+b. To stay illustrative and straightforward, we take a closer look at the Euclidean distance, since it gives a direct impression of dissimilarity. To this end, all the data, training and test, are multiplied by the weight vector w, such that each instance is re-weighted by their importance in terms of the SVM. Subsequently, the Euclidean distance is calculated between the test image and each training image (MCI and AD group). Finally, the mean weighted Euclidean distance is computed for the MCI and AD group, respectively, which represents the dissimilarity between the test image and the group. We compute the relative ratio to denote the dissimilarity: where D MCI is the mean distance between the test image and the MCI group, and the same as for D AD . The greater the ρ, the more similar to AD. As a result, ρ(GMM+MS) = 0.26, ρ(AAL) = 0.07, ρ(t-test) = 0.03. Hence, GMM+MS indicates the greatest value and thus it classifies this image correctly as AD. Therefore, the features derived from GMM+MS enable the SVM to make the correct decision in this case, in contrast to the AAL and t-test based methods.

Discussion
In this paper, a machine learning approach, GMM+MS, is used to derive clusters based on an averaged NC PET image. The proposed method has the advantage of determining the number of clusters automatically, using a widely accepted model selection criterion. The model selection procedure assures that the derived model has a good trade-off between model fitting and model complexity. In such a way, a too complex model can be excluded, although it may have a good degree of model fitting. On the other hand, if a model is too concise, it may not have a satisfying level of model fitting. Therefore, the model selection procedure aims to keep the model complexity in a good balance. The two-phased algorithm first divides the voxel values into different bins and then applies the GMM+MS on the coordinate information at each of the bins to yield the final clusters. The resulting clusters have similar intensity values and are also geometrically connected. The top-ten clusters in GMM are recorded in each cross-validation, with different scores assigned to these clusters. The most informative cluster has the highest score. After 10 times 10-fold cross-validation, we rank these clusters according the overall score and select the top-ten, which are marked by the red points in Fig 7. Within these ten clusters, their corresponding AAL brain regions are identified and ranked according to the proportion denoted by the numbers in the After 10 times 10-fold cross-validation, we rank these clusters according the overall score and select the top-ten, which are marked by the red points in Fig   8. Within these ten clusters, their corresponding AAL brain regions are identified and ranked according to the proportion denoted by the numbers in the Table. doi:10.1371/journal.pone.0122731.t007 The experimental results suggest that the proposed method can outperform the compared methods, in particular for discriminating MCI from AD. The underlying reason can be that the proposed algorithm is able to discover finer (smaller) clusters that are helpful in discriminating MCI from AD, while the AAL and t-test approach may fail to reveal such critical information. However, a little inconsistency is seen by the different intensity normalization methods, which also suggests that the intensity normalization procedure can be an important factor.
In the previous section, we also try to shed some light on the performance difference between these methods. Although the SVM is usually applied as a black-box classifier, we can still employ the support vectors and the weights to gain important insights. Since there are 150 features for the GMM+MS and t-test methods, and 232 for AAL, they are high-dimensional datasets, which makes it hard to analyze which features contribute to the correct classification in the end. However, by introducing the relative ratio computed from the Euclidean distance, it is possible to quantitatively show the difference between these approaches.
In terms of time complexity, the AAL method is the fastest because it is based on predefined brain regions. GMM+MS needs to work further based on defined AAL regions. In addition, the number of bins tested can also influence the running time. As for the t-test method, it is simple, but requires more memory to store the images for a group comparison, which can sometimes become a problem if there are too many images.
The proposed algorithm can be widely applied as a feature extraction method on medical imaging data, which can assist the medical imaging community to discover interesting discriminative brain voxels pattern. The applicability of the algorithm may reach broader application scenarios than merely AD classification, as long as imaging feature extraction is concerned. In particular, we also provide a thorough study on the comparison between AIC and BIC, which offers a clear guidance for the model selection issue.
One limitation of the work is the open problem of discriminating patients with MCI who progress to clinically diagnosable AD from those who remain clinically stable: this remains an important and challenging task. To deal with it, one may need a clearly defined dataset (MCI follow-up) and a reasonably sound algorithm, which is left for future work.

Conclusions
The present work proposes a new clustering method, i.e., GMM+MS, for FDG-PET images. It has the advantage of determining the number of clusters automatically. This method is applied only on an NC image to define the clusters, and then the resulting clusters can be used to extract features from the MCI and AD images for automatic diagnosis. Throughout the experiments on two independent datasets, we not only demonstrate the merits of our method, but also show that the intensity normalization and different datasets (scanners) do play some role in the results. In conclusion, our results suggest that the data-driven extraction of informative brain regions may have a role to play in discriminating MCI from AD images for computeraided diagnosis.