Identification of a 5-Protein Biomarker Molecular Signature for Predicting Alzheimer's Disease

Background Alzheimer's disease (AD) is a progressive brain disease with a huge cost to human lives. The impact of the disease is also a growing concern for the governments of developing countries, in particular due to the increasingly high number of elderly citizens at risk. Alzheimer's is the most common form of dementia, a common term for memory loss and other cognitive impairments. There is no current cure for AD, but there are drug and non-drug based approaches for its treatment. In general the drug-treatments are directed at slowing the progression of symptoms. They have proved to be effective in a large group of patients but success is directly correlated with identifying the disease carriers at its early stages. This justifies the need for timely and accurate forms of diagnosis via molecular means. We report here a 5-protein biomarker molecular signature that achieves, on average, a 96% total accuracy in predicting clinical AD. The signature is composed of the abundances of IL-1α, IL-3, EGF, TNF-α and G-CSF. Methodology/Principal Findings Our results are based on a recent molecular dataset that has attracted worldwide attention. Our paper illustrates that improved results can be obtained with the abundance of only five proteins. Our methodology consisted of the application of an integrative data analysis method. This four step process included: a) abundance quantization, b) feature selection, c) literature analysis, d) selection of a classifier algorithm which is independent of the feature selection process. These steps were performed without using any sample of the test datasets. For the first two steps, we used the application of Fayyad and Irani's discretization algorithm for selection and quantization, which in turn creates an instance of the (alpha-beta)-k-Feature Set problem; a numerical solution of this problem led to the selection of only 10 proteins. Conclusions/Significance the previous study has provided an extremely useful dataset for the identification of AD biomarkers. However, our subsequent analysis also revealed several important facts worth reporting: 1. A 5-protein signature (which is a subset of the 18-protein signature of Ray et al.) has the same overall performance (when using the same classifier). 2. Using more than 20 different classifiers available in the widely-used Weka software package, our 5-protein signature has, on average, a smaller prediction error indicating the independence of the classifier and the robustness of this set of biomarkers (i.e. 96% accuracy when predicting AD against non-demented control). 3. Using very simple classifiers, like Simple Logistic or Logistic Model Trees, we have achieved the following results on 92 samples: 100 percent success to predict Alzheimer's Disease and 92 percent to predict Non Demented Control on the AD dataset.


Introduction
Recently, Ray et al.
[1] made a significant contribution to the quest of finding a superior molecular test for an earlier diagnosis of Alzheimer's disease (AD). The method appears to have significantly improved on the state-of-the-art and, as a consequence, their results attracted immediate worldwide attention. Using the abundance of 120 signalling proteins on a training set of 83 archived plasma samples, they produced an 18-protein signature. On two separate test sets of 92 (''AD'' Alzheimer's samples against control) and 47 (''MCI'' mild cognitive impairment samples) the signature was able to show an overall effectiveness of 81% and 91% for AD predictability.
We started this project by analysing the dataset made available and we are glad to report that we have been able to perfectly reproduce their mathematical methods and results from the available datasets. However, our subsequent analysis also produced several important facts worth reporting: using an integrative bioinformatics approach, we identified a 6-protein signature that halves the number of errors in prediction of the previously proposed signature (on the ''AD'' dataset.), when using the same classifier (PAM). A 5-protein signature (which is a subset of the 18protein signature of Ray et al.) has the same overall performance. Finally, using more than 20 different classifiers available in the widely-used Weka software package [2], our 5-protein signature has, on average, a smaller prediction error indicating the independence of the classifier and the robustness of this set of biomarkers (i.e. 96% accuracy when predicting AD against nondemented control).
The 6-protein signature is composed of the abundances of IL-1a, IL-3, IL-6, EGF, TNa and G-CSF. We remark that IL-6 was not selected by Ray et al. in the preliminary gene selection, and as a consequence it is not part of their 18-protein signature. Recognising that the importance of IL-6 as a biomarker for AD is debatable and that many classifiers do not make use of its abundance to inform decisions, we also present our results of a 5protein signature that ignores IL-6.

Results
Base case-analysis of the performance of randomly selected signatures Before reporting our experimental results, it was important to understand the worst possible performance results that a set of k proteins can have when they are selected at random (from the available 120 proteins under study). We showed results of two experiments that aim at quantifying this. We showed the classification performance of 20 signatures with 18 proteins selected at random with a uniform distribution (obviously, we have selected 18 as is the same number of proteins as the signature proposed by Ray et al.). Analogously, we performed the same experiment now constrained to select only six proteins chosen at random (as we will later present comparative results using signatures that only employ 6 and 5 proteins).
The two different collections of 20 sets of randomly generated signatures were chosen using an equal probability for each of the 120 proteins in the set (obviously, not allowing repetitions and constrained to have either 18 or 6 different proteins in total). For this experiment, we decided to use a random forests algorithm (RF) as a base classifier (we are using the algorithm implemented in [3] for reproducibility purposes), generating 150 trees. As the chosen classifier also has a stochastic nature, for each signature we ran 10 experiments with different seeds, and the results we found are quite interesting.
For these twenty 18-protein signatures the average error over the 92 samples considered on the ''AD'' test set, is 15.13 meaning an 84% effectiveness, see Table 1. For the 6-protein case, an average error of 30.5% was observed meaning that an expected lower value of 67% effectiveness was found, see Table 2. With these results we can infer that the original selection of the 120 genes is quite remarkable for revealing biomarkers for prediction of clinical AD. Since a random selection with a simple, yet robust, Table 1. Number of errors from the 18-genes randomly selected signatures on the ''AD'' validation test set.

Computational studies: Results obtained with four different signatures
We report all the results obtained using a set of 24 classifiers which have been selected from the Weka software suite [3], aiming at sampling different algorithmic methodologies in current practice. These classifiers are applied having as input the four different signatures with the same training set. To ensure reproducibility of our reported methods, no parameter was modified from the classifier's default setting from Weka's downloaded code. In this way we were not biasing the experiment with ad hoc parameter selection and we ensure the complete reproducibility of our claims. We are also aware that better results are possible when adjusting the parameters of each classifier considering only the samples of the training set.        Table 6. Report of the results of the 24 classifiers when using the 10-Protein biomarker.   Nevertheless, with these tests our objective is to show the robustness of our methods to discovery biomarkers, by showing the independence of the signature performance from the selected classifier.

10-Protein Signature
It is interesting to note that the mathematical model and algorithms we have used have pointed at Interleukin-6 and included it in the 10-protein signature. It is well known that IL-6 with other cytokines have been the subject of many studies of Removing IL-6 from the biomarker set we have a small gain in predicting AD in both data set, if compared to the 6-protein signature. In this case, the prediction of AD on the ''AD'' test set achieves an average of 96% without dropping the accuracy of the prediction of NonAD. doi:10.1371/journal.pone.0003111.t008 biomarkers for Alzheimer's disease [4][5][6]. Using an integrative bioinformatic approach, described in the next sections, we draw our attention to a smaller signature. The 6-protein signature was obtained by the analysis of the protein-relation graph and interestingly enough, IL-6 is also included in this new core signature. Finally, in the 5-protein signature, IL-6 is excluded to provide another comparison and the five proteins now become a proper subset of the 18 original proteins uncovered by Ray et al. Table 4 presents the genes included in each signature, indicating the protein name, Entrez GeneID and official name. Tables 5, 6, 7 and 8 show the results of the 24 classifiers for all the signatures considered. The classifiers marked with a star have a random component; therefore the average of ten runs with different seeds is reported. Finally, Tables 9 and 10 summarize the results.
The results of our 5-protein signature are reported in Table 8. When considering the ''AD'' test set, average results (over 24 classifiers) are obtained by the 5-protein signature, 96% when predicting AD and 90% when predicting non-demented control. It is also worth mentioning that there are four different classifiers achieving almost 100% accuracy (i.e. having a number of errors smaller or equal to 1) for predicting AD on the ''AD'' test set. These results are achieved without losing accuracy when predicting non-demented controls on the same dataset. In Table 9, a feature of the experiments it is worth commenting: all the signatures drop at least 30% in accuracy when considering the ''MCI'' dataset. This is understandable since the classifiers have no sample labelled ''MCI'' in the training set.
The best overall result, considering both test sets, is obtained by the 6-protein and 5-protein signatures. They present 18 errors and in both signatures this result is obtained twice when using the LMT and Simple Logistic classifiers (Tables 7 and 8).
In Table 10, the standard deviations of the number of errors are almost constant for all signatures, in all datasets. This reinforces our previous claim, the poor performance of the signatures on the ''MCI'' dataset is related to the fact that the signatures were not trained to identify between AD and MCI.
To present the experiment results in another form, we compared the performance of each signature in each test. Table 11 presents the comparison between the signatures when considering all the test sets (''AD''+''MCI'') totalling 139 samples. It is remarkable that the 5-protein signature not only has a better average performance, but also presents the best result on 16 of the 24 algorithms used for classification (the number of errors highlighted in bold text indicates the best performance for this particular classifier).
In Table 12, the same comparison is made but only considering the ''AD'' test set. Once again, it is possible to visualize the performance of the 5-protein signature, obtaining not only the best average result but also the best individual results, presenting 3 errors on 3 occasions.
Finally, Table 13 presents the same analysis for the ''MCI'' test set. In this case the most remarkable observation is the lack of quality to predict MCI-AD. The improved performance of the largest signatures is related to the fact that the signatures have more genes, and because they were not trained to distinguish between MCI patients, the use of more proteins allows a slightly better performance. Nevertheless, even the best signature for this case (a 10-protein signature) presents a poor performance when compared with the previous results.

Discussion
In conclusion, it is clear that the experiment performed by Ray et al. provided an extremely useful dataset for the identification of Alzheimer's disease biomarkers. We have uncovered a robust 5protein signature with near 97% of accuracy to predict AD against non-demented controls using their data. Our signature has less than one third of the proteins than the one proposed in the original paper, and at least the same level of prediction performance.
The next step on this important quest is to set up an independent experimental procedure that now considers samples with mild cognitive impairment (but without AD) in the training set. We do not agree with the methodology of using a training set without MCI to select biomarkers to differentiate between AD and MCI [1]. This has not been done and warrants further investigation. Only in this way we can uncover useful biomarkers to discriminate between AD and MCI.
On the positive side, our methods reveal the true predictive potential of testing for Alzheimer's disease using this panel of signalling proteins. We also believe that our methods show promise and warrant their application in other settings. It is clear that Alzheimer researchers can benefit directly from our identification of more robust biomarkers. The method is revealed to be useful, simple yet very powerful, and warrants its application in other multifactorial diseases.

Methods
Our methodology consisted of the application of an integrative data analysis method. We used four steps: a) abundance quantization, b) feature selection, c) literature analysis, d) selection of a classifier algorithm which is independent of the feature selection process. These steps were performed without using any of the test datasets. For the first two steps, we used the application of Fayyad and Irani's discretization algorithm [7] for selection and quantization, which in turn creates an instance of the (alpha-beta)-k-Feature Set problem [8][9][10]. Fayyad and Irani's method filtered only 14 out of 120 proteins of the training set (i.e. those proteins for which no threshold was selected were filtered out). After quantization, samples 7, 43 (AD, ''Alzheimer's Disease'') and 48 (NDC, ''Nondemented Control'') of the training set were ''in conflict'', which means that they have quantized values (for all 14 proteins selected) which are the same although they belong to different classes. These conflicts are then removed, i.e. the three samples of the training set are eliminated and we apply our algorithms to the remaining 80 samples of the training set. Numerical solution of the (alpha-beta)-k-Feature Set problem led to the selection of only 10 proteins, Table 4. For a detailed explanation of the methods and other applications,  readers can check our referenced publications and references therein [11][12][13].
To guarantee the reproduction of all our experiments, we use algorithms from the Weka Package [3] as classifiers. All the classifiers were used with the default parameters; we are convinced that better results could be found if adjustments are made in each classifier (considering only its result over the training set).
The first signature we uncovered contains 10 proteins, see Table 4. Using the Pathway Studio software [3], we generated an undirected graph of the known 'direct relations' of these 10 proteins. Each node in the graph corresponds to a protein and an edge exists if the Pathway Studio software produced a 'direct relation', indicating important association already observed in the life sciences literature. On this graph we looked for its maximum clique (Fig. 3a). We denote this graph as G = (V,E). Each vertex in V has a one-to-one correspondence with a protein. Each pair of vertices are connected by an edge in E, if and only if, there are many direct relations between the proteins reported in the literature. A clique in G is a subset X of V such that its induced graph G[X] is complete. In other words, we are looking for the maximum subset of proteins, in which all pairs of proteins already have a direct relationship identified between them, thus we consider this set the core of our 10-protein signature (this core has the 6-proteins listed above, see Fig. 3b).
Our first benchmark test for this 6-protein signature was done using Simple Logistic (SL) [14], perhaps the simplest classifier from the Weka software suite. With our 6-protein signature, SL had a performance of 86% after applying 10 times 10-fold crossvalidation over the training set (Fig. 3c). When considering the ''AD'' test set, our 6-protein signature with SL was able to make a classification with 97% of accuracy. For AD samples we achieved 100% positive agreement and for NDC samples a 92% negative agreement (Fig. 3d).
When using the second test set (labelled ''MCI''), that includes samples that had an initial diagnosis of mild cognitive impairment, the performance of all signatures increases the number of errors. It is reasonable to expect that our very trimmed classifiers are going to have some degradation of performance, as they have not been trained to distinguish confirmed AD samples from those that have MCI. When using the same signature to differentiate between AD and other samples of MCI patients, the occurrence of more errors is an expected outcome (Table 9). In spite of this fact, the overall performance of all signatures seems very robust.