Unveiling Clusters of RNA Transcript Pairs Associated with Markers of Alzheimer’s Disease Progression

Background One primary goal of transcriptomic studies is identifying gene expression patterns correlating with disease progression. This is usually achieved by considering transcripts that independently pass an arbitrary threshold (e.g. p<0.05). In diseases involving severe perturbations of multiple molecular systems, such as Alzheimer’s disease (AD), this univariate approach often results in a large list of seemingly unrelated transcripts. We utilised a powerful multivariate clustering approach to identify clusters of RNA biomarkers strongly associated with markers of AD progression. We discuss the value of considering pairs of transcripts which, in contrast to individual transcripts, helps avoid natural human transcriptome variation that can overshadow disease-related changes. Methodology/Principal Findings We re-analysed a dataset of hippocampal transcript levels in nine controls and 22 patients with varying degrees of AD. A large-scale clustering approach determined groups of transcript probe sets that correlate strongly with measures of AD progression, including both clinical and neuropathological measures and quantifiers of the characteristic transcriptome shift from control to severe AD. This enabled identification of restricted groups of highly correlated probe sets from an initial list of 1,372 previously published by our group. We repeated this analysis on an expanded dataset that included all pair-wise combinations of the 1,372 probe sets. As clustering of this massive dataset is unfeasible using standard computational tools, we adapted and re-implemented a clustering algorithm that uses external memory algorithmic approach. This identified various pairs that strongly correlated with markers of AD progression and highlighted important biological pathways potentially involved in AD pathogenesis. Conclusions/Significance Our analyses demonstrate that, although there exists a relatively large molecular signature of AD progression, only a small number of transcripts recurrently cluster with different markers of AD progression. Furthermore, considering the relationship between two transcripts can highlight important biological relationships that are missed when considering either transcript in isolation.


Introduction
Alzheimer's disease (AD) is an irreversible brain disease that begins with mild memory impairment but eventually progresses to severe brain dysfunction and dementia. The prevalence of AD is rising dramatically due to an increasingly ageing population [1]. Early diagnosis is challenging as it can be difficult to discriminate the initial manifestations of the disease from cognitive decline that occurs as a function of normal aging [2]. Intensive efforts are being made to better understand AD and identify appropriate treatments, however the molecular mechanisms underlying the disease are still far from being understood.
In 2004, Blalock and colleagues [3] made an important contribution towards finding a set of molecular biomarkers that correlate with the progression of AD in one region of the brain. Using microarray technology, they assessed RNA transcript levels in post-mortem hippocampal tissue from 9 controls and 22 patients with varying degrees of AD severity. Participants were categorized, based primarily on Mini-Mental State Examination (MMSE) score [4], into one of four clinical groups: Control, Incipient AD, Moderate AD or Severe AD (see Materials and Methods for details of classification).
A total of 22,286 probe sets for protein coding RNA (i.e. mRNA) and non-coding RNA (ncRNA) were used to interrogate the transcriptome. After excluding probe sets with signals below detection thresholds and probe sets targeting unidentified transcripts (e.g. expressed sequence tags), Blalock and colleagues assessed the Pearson's correlation of each of the remaining 9,921 probe set values with MMSE score and neurofibrillary tangle (NFT) count. This analysis revealed 3,413 genes that are significantly correlated (at p,0.05) with MMSE score, NFT count or both [3].
In 2010, Gomez Ravetti and colleagues (including two authors of this contribution) re-analyzed the publicly available dataset of Blalock using a different method, uncovering a 1,372-probe set signature that presents a remarkably high consensus with established phenotypic markers of AD progression [5]. Instead of just assessing correlations between gene expression and clinical measures of AD (such as MMSE score or NFT count), they employed an integrated approach based on combinatorial optimization techniques [6,7] and Information Theory. Briefly, divergence of the gene expression profile of each individual sample from the average characteristic profile of the ''Control'' group was computed using the Jensen-Shannon divergence [8]. Similarly, this approach was used to compute the convergence of the gene expression profile of each individual sample to the average characteristic profile of the ''Severe AD'' group. This allowed the authors to identify genes with expression levels that correlate with the characteristic molecular progression from normal cognition to severe AD. In the current report, we use these quantifiers (which we will simply term as JSD control and JSD severe , respectively) as measures of AD progression, in addition to three common phenotypic markers (MMSE score, NFT count and Braak staging; obtained from [3]). We will refer collectively to these measures of AD progression as 'progression markers'.
The main objective of this work is to identify a reduced set of pairs of RNA transcripts that strongly cluster with markers of AD progression. The analysis of these clusters will also help to identify individual transcripts that recurrently appear in many pairs, and may thus guide the selection of candidate molecules for further research. Towards this end, we apply a large-scale graph-based clustering approach to datasets derived from the 1,372-probe set signature identified by Gomez Ravetti et al. [5] to identify molecular features that correlate with the different 'progression markers'. These datasets include the original 1,372-probe set signature, all pair-wise ratios that can be computed from the 1,372-probe set signature and an expanded dataset containing all pair-wise differences, summations, ratios and products that can be computed from the original signature (giving a total of 3,762,024 probe set combinations). We will refer to these probe set pairs as 'metafeatures' [9].
To cluster such a large number of metafeatures, we reimplemented and enhanced the MSTkNN algorithm in [10] by using the external memory (EM) approaches in [11,12,13]. External memory algorithms are known for their efficiency in handling large-scale data sets [14,15,16] and the MSTkNN algorithm is a graph based data clustering algorithm that has successfully been applied in several applications including the analysis of stock market time series [17], a gene expression dataset [18], a prostate cancer trial dataset [19] and has been integrated with a combinatorial optimization based graph visualization layout [20]. Our previous work indicates that the MSTkNN algorithm produces meaningful clusters (see [17,18,20]) and our proposed modification to this algorithm is still capable of producing reasonable clustering structure in terms of homogeneity and separation (Table 1).
After clustering all the metafeatures together with the different progression markers, we attempt to uncover pairs of probe sets that jointly cluster with each progression marker. Additionally, we identify some probe sets in these pairs that not only cluster with different progression markers but also relate to genes that share common biological pathways. We annotate these pairs of markers with the most recent information available. We also look at the expression of some of these markers in a different transcriptomic study that involves several regions of the brain in search for consensus among different studies.

Results
Results for each of the datasets are presented in the following order: 1. composition of the clusters containing the aforementioned progression markers, 2. results of functional analyses carried out using publicly available tools (iHop, Gather, GIM) and 3. a bibliographic (Pubmed) characterization of certain highlighted The implementations of the k-Means, SOM, CLICK algorithms are obtained from the Expander microarray data cluster tool in [124]. The homogeneity and separation are computed using the definition in [124]. The AD ratio metafeatures data set is generated by taking pair-wise ratios between the features in 1,372-probe AD signatures [5] and including MMSE score, NFT count, Braak staging, JSD control and JSD severe as five progression markers. The other data set contains four different types of metafeatures (ratios, summations, differences and products) and the aforementioned progression markers. doi:10.1371/journal.pone.0045535.t001 probe sets. Probe sets are highlighted for several reasons, including having a strong correlation with one or more progression markers or appearing in a metafeature with a probe set for another gene which has a potential role in AD or in other brain disease. The clustering outcomes are given in File S1, File S2 and File S3. Where a particular probe set has been highlighted and discussed in our previous publication on this dataset [5], we refer the reader to this paper rather than re-iterating discussion points in the present report.
1,372-probe set signature (File S1) First, we analysed the 1,372-probe set signature from [5] using the proposed clustering algorithm and found a total of 228 clusters. We then identified the clusters that contain the proposed progression markers (Figure 1 and Table 2).
Cluster containing MMSE score. A total of 11 probe sets clustered with MMSE score. Among the gene transcripts targeted by these probe sets, TTN (Titin/Connectin) may have a role in AD progression through its ability to form amyloid aggregates [21]. The genes ATP5C1, COX4I1, KLHL20, ITGB8 and C10orf76 were Figure 1. Visualization of the clustering outcome of the 1372-probe set signature. The figure shows only the clusters that contain the progression markers (hexagonal nodes). We note that the probe set for PTEN, whose product has been recently observed to localize with intracellular NFTs [36], has values that correlate strongly with the Jensen-Shannon divergence of the severe profile (JSD severe ). doi:10.1371/journal.pone.0045535.g001 highlighted by the analysis in Gomez Ravetti et al. [5] and we refer the reader to this paper for extensive discussion of these genes in the context of AD. We identified several probe sets that uniquely cluster with MMSE score and target transcripts for genes not previously associated with AD. These included PTMS (parathymosin), a chromatin remodelling protein essential for cell cycle progression and proliferation of normal and malignant cells, MKS1 (Meckel syndrome, type 1), mutations in which are associated with a malformation of central nervous system known as Meckel syndrome, and SPDEF and a probe set for C19orf50.
Cluster containing NFT count. Two probe sets clustered with NFT count. These probe sets targeted transcripts for COX6B1 and MMP11, both of which were highlighted in Gomez Ravetti et al. [5] and will not be discussed in detail here.
Clusters containing JSD control and JSD severe . Two probe sets clustered with JSD control -these related to the genes ALDOB (aldolase B, fructose-bisphosphate) and MLLT4 (myeloid/lymphoid or mixed-lineage leukemia, translocated to, 4), which is reportedly involved in the formation and remodelling of synapses in hippocampus [33]. Three probe sets clustered with JSD severe . These related to the genes BTG3 (BTG family, member 3), which is related to prognosis of neuropathic symptoms in paraneoplastic neuropathy [34], TREX1 (three prime repair exonuclease 1), mutations in which can cause a neurovascular disorder involving progressive cognitive decline due to brain degeneration [35], and PTEN (phosphatase and tensin homolog). Altered distribution of PTEN has been reported in degenerating neurons in AD; specifically, a delocalization from the nucleus to the cytoplasm and accumulation in intracellular neurofibrillary tangles [36].
Comparing clustering outcomes with statistics-based outcomes. An alternative method for selecting probe sets highly correlated with progression markers is to simply perform regression analysis and identify the probe sets with the highest correlation coefficient or lowest p-value. In essence this is a univariate approach, where the selection of a particular probe set is independent of which other probe sets are selected. In contrast, the MSTkNN-based clustering method used here is a multivariate approach which considers the interrelationships of different probe sets in its selection. As such, we would expect the MSTkNN-based method to identify features of potential biological relevance that are missed by univariate approaches.
To investigate the similarities and differences in outcomes of multivariate and univariate approaches, we compared the probe sets identified by our clustering method with those having the highest, most significant correlations by regression analysis (see Figure S1). For example, we compared the 11 probe sets clustered with MMSE to the 11 probe sets most significantly correlated with MMSE by regression analysis. While there were some similarities in the probe sets identified, there were also important differences. Notably, some genes of particular relevance to AD were selected by the clustering method but not by filtering based on statistical For each progression marker, probe sets have been ordered according to their Spearman's rank correlation with the progression marker. Gene symbols in boldface indicate that they were previously discussed in [5] and gene symbols with underlined boldface represent the cases for which a putative relationship exists in the published literature between the gene and AD. doi:10.1371/journal.pone.0045535.t002 significance. For example, clustering with Braak staging identified a probe set targeting CR1 which, as mentioned above, is widely proposed as a genetic risk factor for AD, however this probe set was not identified by the statistical method.

941,885 Ratio Metafeatures Data Set (File S2)
Next, we conducted a cluster analysis using the proposed external memory clustering algorithm on the 941,885 metafeatures generated by calculating the ratio of each pair of probe sets from the 1,372-probe set signature [5]. There were a total of 40,139 clusters in the data set, from which we again identified the clusters that contain the previously mentioned progression markers.
Cluster containing braak staging. A total of 27 ratio metafeatures clustered with Braak staging values ( Table 5). Three of these were comprised of probe sets targeting genes that can be mapped to common KEGG pathways: CR1/SERPINA5 (complement and coagulation cascades), MDH2/ACO2 (TCA cycle) and ITGB5/TNXB (ECM-receptor interaction). Of the various probe sets identified in this cluster, only one targets a gene previously implicated in AD -CR1 (discussed above). Metafeatures in the cluster also contained probe sets targeting six other genes previously highlighted by the analysis of Gomez Ravetti et al. [5]: PTN, PPT1, RIMS2, ASTN1, ITGB5 and PAX6. Many of the metafeatures that showed a positive correlation with Braak staging were dominated by a probe set for LRRC48 (leucine rich repeat containing 48), a gene which is currently poorly characterized.
Cluster containing JSD control . A total of 93 ratio metafeatures (containing a total of 94 different probe sets) clustered with JSD control (Table 6), see also File S2 for complete list). Three of these metafeatures comprised probe sets targeting genes in common KEGG metabolic pathways: LDHA/GOT2 (cysteine metabolism), Most of the metafeatures in this cluster that showed a positive correlation with the divergence from control to severe AD were dominated by KLK3 (kallikrein 3), also known as prostate specific antigen, a well-known blood biomarker of prostate cancer [73]. To determine whether correlations involving KLK3 levels were influenced by gender, we stratified our dataset by gender. We performed the clustering again in both gender-specific datasets and found that KLK3 was completely absent in the same cluster of the dataset comprising only females. We therefore suggest that We have selected 20 metafeatures (10 most positively correlated and 10 most negatively correlated) clustered with JSD control and ordered them by Spearman's rank correlation with JSD control . Genes in boldface indicate that they were previously discussed in [5] and genes with underlined boldface represent the cases for which the gene has been discussed in the context of AD in the published literature. Metafeatures are ordered by Spearman's rank correlation with JSD severe . Genes in boldface indicate that they were previously discussed in [5] and genes with underlined boldface represent the cases for which the gene has been discussed in the context of AD in the published literature. doi:10.1371/journal.pone.0045535.t007  (Table 7). For four of these metafeatures, both probe sets comprising the metafeature targeted genes that can be mapped to a common KEGG pathway. These metafeatures were MLLT4/PTEN (tight junction), PRKCB1/ATP2B2 (calcium signaling pathway), TGFB2/PPP2CA (TGF-b signaling pathway), CYP3A4/CPT2 (fatty acid metabolism). Metafeatures in this cluster contained probe sets for three genes previously investigated in the context of AD (CYP3A4 [74,75,76], ATP2B2 [77,78] and PPP2CA) and four other genes previously highlighted by the analysis of Gomez Ravetti et al. [5] (PTEN, PRKCB1, FCAR and CPT2).

Estimation of False Discovery Rate
Investigating a very large data space, such as that occupied by the many possible metafeatures, will inevitably lead to a number of false positive findings. In order to estimate the false discovery rate at different correlation coefficient thresholds, and therefore demonstrate the validity of our approach in identifying more than just random events, we performed a simple a Monte-Carlo permutation test by randomly permuting the MMSE scores of the 17 samples and computing the correlation of each metafeatures with the permuted MMSE labels. The results after 1,000 permutations reveal that, at all thresholds tested, there is clearly a higher number of strongly correlated metafeatures among our ratio metafeatures dataset than would be expected by chance alone (see Figure S2, Figure S3, Figure S4 and Figure S5).

3,763,403 Ratio-sum-difference-product Metafeatures Data Set (File S3)
We next applied our clustering algorithm to a data set of 3,763,403 metafeatures. This dataset was produced by calculating all pair-wise differences, summations, ratios and products of the 1,372probes identified in Gomez Ravetti et al. [5].
The algorithm created a total of 121,611 clusters for the data set. We identified one larger cluster containing all of the progression markers. Due to the large number of metafeatures in the cluster, we focused only on the metafeatures with the strongest positive and negative correlations with each of the progression markers. We refer to File S3 for the details of this cluster.
Metafeatures correlating with braak staging. From the list of 50 metafeatures most strongly correlated with Braak staging, five comprised probe sets targeting genes in common pathways (See Table 10). Only one gene previously proposed to be involved in AD, CR1 (see above), was targeted by a probe set within the 50 metafeatures. Probe sets targeting some important genes highlighted by the analysis of Gomez Ravetti et al. [5] (COX4I1, CASP9, ITGB1, RHOQ, DLGAP2, GSTA3, BCL2, COX6A1 and ATP5C1) were also found in this set of metafeatures.

Comparison of Metafeature Correlations and Single Probe Set Correlations
The observation of particular probe sets recurring in multiple clustered metafeatures raises the question of whether the clustering of certain metafeatures is driven by a strong correlation between a progression marker and only one of the two individual probe sets comprising a metafeature. To investigate this possibility, we separately assessed the correlation of the two probe sets comprising a metafeature with the progression marker in question and compared this to the correlation between the metafeature and the progression marker.
Deeper analysis of the ''ratio metafeatures'' that clustered with MMSE scores reveals a number of metafeatures where individual We have selected 50 metafeatures (25 most positively correlated and 25 most negatively correlated) and ordered them by Spearman's rank correlation with MMSE score. Genes in boldface indicate that they were previously discussed in [5] and genes with underlined boldface represent the cases for which the gene has been discussed in the context of AD in the published literature (see File S3 for details). doi:10.1371/journal.pone.0045535.t008 Table 9. Ratio-sum-difference-product metafeatures clustered with NFT count.  Table S1). This suggests that the high correlation of certain metafeatures does not simply reflect an additive effect of its two component probe sets (see Figure S6) but instead is driven by the dynamics of the interrelationship (possibly a biological interaction) between the two transcripts that are targeted.
For specific examples we refer to Figure 2 where we demonstrate an example scenario with the three probe sets targeting TTN, CASK and TUG1 and the metafeatures TTN/ PKRCB1, CASK/PTEN and TUG1/SCFD1. In this example, the metafeatures show a better correlation with MMSE score than the relevant individual probe sets ( Figure 2). In general, we found that if both the probe sets in a metafeature target genes in a common pathway, then the metafeature shows better correlation with the progression marker than either of the two individual probe sets. For example, TTN and PKRCB1 both appear in the 'focal adhesion' KEGG pathway and CASK and PTEN in both appear in 'tight junction' pathway.
There were various instances in which the individual components of highly correlated metafeatures did not map to a common KEGG pathway. For example, a metafeature containing probe sets for TUG1 and SCFD1 shows a better correlation with MMSE score than either probe set individually. However it is not surprising that these transcripts do not map to a common pathway, as long ncRNAs such as TUG1 are relative newcomers to functional annotation.
It should be noted, however, that a high proportion of the metafeatures (.90%) showed comparatively better correlations with the progression markers than the individual probe sets comprising these metafeatures. While this is not a universal phenomenon (see Figure S6 and Table S1 for examples, where two probe sets that are highly correlated with a progression marker combine to create a poorly correlated metafeature), in view of the larger data space occupied by the metafeatures, it is logical that the metafeature analysis may yield an increased proportion of spurious, false positive results. This is supported by the observed difference in the estimated false discovery rate of two datasets of different sizes (see Table S2 and Table S3). In an attempt to avoid such results, we subsequently focus on 'robust' findings -probe sets that recurrently cluster with different markers of AD progression.

Robust Markers of AD Progression
We next attempted to identify probe sets that appeared recurrently in the metafeatures and also clustered with different markers of AD progression. We depict this group of probe sets in a 5-way Venn diagram (Figure 3 and Figure 4). In these figures, a null (Q) symbol means that even if an overlap is shown in the figure, there is no common transcript. From the 941,885 ratio metafeatures data set, we identified 11 probe sets that, as part of metafeatures, clustered with more than one progression marker.

Validation of Robust Markers in an Alternative Dataset
In order to gain insights into whether the robust markers highlighted above are restricted to the hippocampus or show changes in other AD-affected brain regions, we utilized an independent dataset contributed by Liang and colleagues [95,96]. This dataset contains microarray data on gene expression in neurons isolated from four different regions of control and AD brain: entorhinal cortex (EC), hippocampus (HIP), middle temporal gyrus (MTG) and posterior cingulate cortex (PC). Molecular signatures of each different region were generated as described in Materials and Methods. Several of the 'robust' probe set markers highlighted by our current analysis of hippocampal tissue were also selected in the molecular signatures of two or more regional neuronal populations. For example, PPIA and ATP5C1 showed expression changes in neurons isolated from the MTG and PC of AD brain relative to control brain, PTEN showed expression changes in the HIP and PC and ICA1 showed expression changes in the HIP, MTG and PC ( Figure 5). In addition, Visinin-like 1 (VSNL1), highlighted in the analysis in Gomez Ravetti et al. [5] as one of the best markers of AD progression and recently proposed as one of the four best CSF biomarkers of early AD [97], showed expression changes in neurons isolated from EC, MTG and PC ( Figure 5) and has been shown in an additional dataset to have altered expression in various brain regions in AD [98].

Discussion
The present study has extended on our previous analysis in [5] by (i) considering variables that represent the interrelationship between two RNA transcripts (i.e. metafeatures) and (ii) applying a novel and powerful graph-based clustering approach to identify a reduced set of transcripts or transcript pairs that correlate strongly with markers of AD progression. This clustering approach is facilitated by the implementation of an external memory algorithm on a graphical processing unit, which allows clustering of massive datasets (in this case involving four million elements) that is not feasible using standard computational methods. We have selected 50 metafeatures (25 most positively correlated and 25 most negatively correlated) and ordered them by Spearman's rank correlation with NFT count. Genes in boldface indicate that they were previously discussed in [5] and genes with underlined boldface represent the cases for which the gene has been discussed in the context of AD in the published literature (see File S3 for details). doi:10.1371/journal.pone.0045535.t009 In addition to identifying transcripts and metafeatures that correlate with established phenotypic markers of AD severity (i.e. MMSE score, NFT count, Braak staging), we utilized two addition quantifiers of AD progression, based on Jensen-Shannon divergence, to identify transcripts and metafeatures that correlate with a putative molecular trend from control to severe AD. It remains to be seen whether these quantifiers provide a more accurate assessment of AD severity, however there are several promising attributes that suggest this is the case. Firstly, the Jensen-Shannon divergence values are based on a set of 1,372 different transcript markers, in contrast to the univariate markers of neuropathology or cognitive function. Secondly, some of the changes observed at the transcriptional level may underlie AD pathogenesis, whereas phenotypic consequences are more likely to simply reflect the molecular perturbations that drive disease pathogenesis. The results presented here indicate that a number of probe sets that are highly correlated with Jensen-Shannon divergence are also highly correlated with more traditional phenotypic markers of AD progression. Our previous studies using these quantifiers in the context of AD [5] and cancer [99] have also yielded high consensus with established markers of disease progression. Together, these findings give us confidence that metrics based on multivariate transcriptional changes can act as reliable markers of disease stage.
The analyses reported here demonstrate that, although there exists a relatively large molecular signature associated with AD progression, a relatively small number of transcripts appear recurrently in metafeatures clustered with the progression markers. This allowed a focussed investigation of a reduced set of biomarkers that have been previously studied in the context of cognitive decline and AD. Furthermore, our approach also put emphasis on a few novel markers that have not been discussed previously in relation to AD progression and warrant further investigation.
While it is outside the scope of this paper to discuss in detail all of the genes identified in the analyses, focussed discussion of some of the most robust findings is warranted. As mentioned at the end of the Results, as set of six genes (PPIA, ATP5C1, LDHA, PTEN, ICA1, CPT2) recurrently appeared in metafeatures and clustered with more than one progression marker in both metafeature datasets. Furthermore, changes in expression of several of these genes were validated in an alternative microarray dataset of neurons from different AD-affected brain regions, lending further support to the proposal that altered expression of these genes may be involved in AD pathogenesis.
Peptidylprolyl isomerase A (PPIA), also known as cyclophilin A, is believed to accelerate protein folding. Evidence from neural cell lines suggests PPIA can protect against Ab-induced oxidative stress, possibly by acting as a ROS scavenger [100]. Proteomics studies have revealed decreased expression of PPIA in brains of patients with non-Alzheimer's disease tauopathies [41], suggesting alterations in PPIA may be associated with the general process of neurodegeneration rather than AD specifically. Curiously, PPIA has been proposed as a suitable reference gene for PCR studies of AD brain due to its stable expression [101,102]. The results of our analysis strongly argue against this and instead indicate that PPIA expression, particularly when considered as part of a metafeature, is strongly correlated with AD progression.
ATP synthase subunit gamma (ATP5C1) encodes a subunit of mitochondrial ATP synthase, important for catalyzing ATP synthesis in oxidative phosphorylation. While this gene has not previously been implicated in AD, its transcriptional correlation with AD progression may reflect disturbances in energy production as a result of cellular loss.
Lactate dehydrogenase A (LDHA), another metabolic gene, is responsible for catalysing the conversion of lactate to pyruvate, the final step in anaerobic glycolysis. A recent study has demonstrated that increased LDHA activity is a feature of nerve cell lines that are resistant to Ab-induced cell death and that the phenomenon of aerobic glycolysis might contribute to the mechanisms by which certain neurons in the AD brain survive apoptosis [103].
Phosphatase and tensin homolog (PTEN) has generally been studied in the context of cancer, as it is a tumor suppressor with phosphatase activity that negatively regulates the AKT/PKB signalling pathway. However PTEN has also been shown to be necessary for proper migration of neurons and glia [104]. There is decreased expression and altered distribution of PTEN in AD brain [105,106], where it localizes with neuritic pathology such as neurofibrillary tangles in damaged neurons [36]. PTEN affects phosphorylation and aggregation of tau [106,107] and appears to be regulated by presenilin, as presenilin deficient neurons show a substantial reduction in PTEN [108]. Furthermore, mutations in the PTEN induced putative kinase 1 (PINK1) gene have been linked to early-onset familial Parkinson's disease [109,110], while ablation of PTEN in dopaminergic neurons is neuroprotective in mouse models of Parkinson's disease [111].
Islet cell autoantigen 1 (ICA1) is believed to be an autoantigen in insulin-dependent diabetes mellitus. ICA1 is the major binding partner of PICK1 and together these proteins regulate trafficking of AMPA glutamate receptors to the synapse [112]. It has also been proposed that ICA1 participates in the process of neuroendocrine secretion through association with certain secretory vesicles [113].
Carnitine palmitoyltransferase 2 (CPT2) is involved in the oxidation of long-chain fatty acids in the mitochondria. This gene has not previously been associated with AD.
In addition, we highlight some relevant genes that correlated with more than one progression marker in one of the metafeature We have selected 50 metafeatures (25 most positively correlated and 25 most negatively correlated) and ordered them by Spearman's rank correlation with Braak staging. Genes in boldface indicate that they were previously discussed in [5] and genes with underlined boldface represent the cases for which the gene has been discussed in the context of AD in the published literature (see File S3 for details). doi:10.1371/journal.pone.0045535.t010 Table 11. Ratio-sum-difference-product metafeatures clustered with JSD control . datasets. Protein kinase C beta, encoded by PRKCB1, is involved in a wide range of signalling pathways. Increased expression of protein kinase C beta has been observed in membrane fractions of aged Tg2576 mice, a model of AD [114]. Furthermore, one of the best biomarkers in [5], visinin-like 1 (VSNL1), is only expressed in neurons and shows decreasing expression as AD progresses. Levels of VSNL1 in the CSF have recently been proposed as an effective biomarker of early AD [97].
It is interesting to remark that the individual markers identified in this study are bringing new insights to the pathological mechanisms involved in AD but that an integrative approach is required to understand them. For example, increased VSNL1 in the CSF observed in [97] may be a consequence of increasing neuronal death rather than an marker of early AD. On the other hand, as LDHA expression is currently being considered as a possible marker of aerobic glycolysis in Ab-resistant neurons [103], the correlation between LDHA expression and AD progression makes sense if we think that Ab-resistant neurons will be proportionally more abundant in samples with greater disease severity.
One limitation of the present study is that the low number of samples (17) available for these analyses may have resulted in a large number of highly correlated probe sets or metafeatures that are false positives (see Figure S2, Figure S3, Figure S4, Figure S5, for a validation of our correlations). The selection of just a few features out of this large data space has been a critical task that we attempted to solve by focusing on those which appear most recurrently. Unfortunately, there are strong possibilities that even with a set of random data and a very large search space, a set of false positive markers may recurrently appear. However, since a relatively higher number of published AD studies can already be found that implicate these markers, we feel comfortable in making the claim that they warrant further investigation in future AD research.
The results presented here support the hypothesis that systematically considering relationships between two or more features (''metafeatures'') can improve biomarker discovery, particularly when explored within a multivariate framework. While univariate approaches may still provide important and complementary insights to those obtained using multivariate methods, we believe that utilizing both approaches in conjunction is likely to produce the most robust and relevant findings. Computational advances such as the external memory implementation of our clustering algorithm will hopefully make investigations of this type more commonplace, and we are currently working towards more sophisticated parallel applications that would enable the study even of larger datasets across a range of diseases.

Datasets
This analysis draws on the data set contributed by Blalock et al. [3] which can be accessed from the NCBI Gene Expression Omnibus under the accession number GSE1297. The Blalock study used Affymetrix HG-U133A microarrays to generate data on 22,286 probe sets. From this dataset, we focused our analysis on the 1,372-probe set signature identified by Gomez Ravetti et al. [5].
In the present study, instead of using all 31 samples, we excluded 14 samples that had gene expression profiles similar to the representative profile of the ''Control'' group (n = 7) or similar to the representative profile of the ''Severe AD'' group (n = 7). The 17 remaining samples correspond to the central 17 columns of the supplementary material 'File S2 (sheet:1372-probe)' of [5] and the data set containing the signature of Gomez Ravetti et al. [5] for these 17 samples is termed 1,372-probe set signature throughout the paper. Our rationale for excluding these 14 samples is two-fold. Firstly, little information about disease progression is likely to be gained by considering participants at either extreme of a disease spectrum. Control participants will not have developed any molecular characteristics of early AD and participants with severe AD may have already progressed to the disease endpoint. Secondly, as the ''Control'' and ''Severe AD'' groups were used to generate the 1,372-probe set signature, inclusion of these samples would likely influence correlations in a biased way. By excluding these samples, we can assess correlation with AD progression in a truly independent 'test' set of samples.
The advantages of using pair-wise relational features (i.e. metafeatures) have recently been demonstrated by Rocha de Paula et al. [116] in the context of plasma protein biomarkers for the early detection of AD. We therefore expanded the 1,372-probe set signature by applying different operators between each possible pair of probe sets. This led to the creation of two ''artificial'' data sets: N The first data set includes all the pair-wise ratios of the gene expression values in 1,372-probe set data. It contains a total of 941,885 probe sets, metafeatures and progression markers. We refer to this data set as the 941,885 ratio metafeatures data set.
N The second data set includes all the pair-wise differences, summations, ratios and products of the gene expression values in We have selected 50 metafeatures (25 most positively correlated and 25 most negatively correlated) and ordered them by Spearman's rank correlation with JSD control . Genes in boldface indicate that they were previously discussed in [5] and genes with underlined boldface represent the cases for which the gene has been discussed in the context of AD in the published literature (see File S3 for details). doi:10.1371/journal.pone.0045535.t011 1,372-probe set data. It contains a total of 3,763,403 probe sets, metafeatures and progression markers. We refer to this data set as the 3,763,403 ratio-sum-difference-product metafeatures data set.
In each of these two data sets, we have included the original 1,372-probes gene expression signature and five measures of AD progression: MMSE score, NFT count and Braak staging from Blalock et al. [3] and the Jensen-Shannon divergences, JSD control and JSD severe , from Ravetti et al. [5]. We assume here that correlation between the values associated with these measures and microarray probe set expression values can highlight important biomarkers of AD progression.

Data Clustering
We have adapted, enhanced and re-implemented the MSTkNN graph partitioning algorithm in [10] to cluster the features in the large data sets. The proposed method utilizes a graph partitioning approach that optimizes both the local minimum (by using the kNN graph) and global minimum (by using the MST). Therefore, the clusters presented are not necessarily a sorted a list of probes by their correlation to the phenotypes.
The algorithm first constructs an undirected and complete graph from the data set where each node is a biological feature and each edge represents a correlation between two features. Then, the algorithm starts the clustering process by computing two proximity graphs: a minimum spanning tree (G MST ) and a knearest neighbour graph (G kNN ); where the value for k is adaptively selected from the following equation: Subsequently, the algorithm inspects all edges in G MST . If for a given edge (x,y) neither x is one of the k nearest neighbors of y, nor y is one of the k nearest neighbors of x, the edge is eliminated from G MST . This results in a new graph G' = G MST -{(x,y)}. Since G MST is a tree, after the first edge is deleted G' is now a forest, as it is a graph that composed of two subtrees. The algorithm continues applying the same procedure to each subtree in G' thus generated (with a value of k re-adjusted by eq. (1) above where n is now the number of nodes in each subtree), until no further partition is possible. The final partition of the nodes of G' induced by the forest is the result of the clustering algorithm. Figure 6. Demonstration of the modified MSTkNN algorithm. (a) An MSTp created from a data set with n = 10 features/probe sets. Each edge is labeled with an integer value p, where the value of p is determined using a sorted list of nearest neighbors for each feature (see eq. (2)). The edge between F9 and F10 is a candidate for elimination, since it has a value of p. tln (10)s = 2 (b) Two connected components are identified and we apply the same procedure with the component that has more than three elements. (c) The final outcome of the clustering.
The original algorithm requires(n|(n{1)=2) distance values (between all pairs of the n elements) as the input. For a large data set, this may be too large to fit in the computer's in-memory and, for even larger values of n, it may not even fit in external memory. Even if we can store the distance matrix in the external memory, the computational speed will slow down dramatically because of the increased number of I/O operations. Therefore, we modified this step and instead of creating the complete graph from the distance matrix, we create a q-nearest neighbor graph (G qNN ), where q = tln (n)s+1. This procedure reduces the input graph size, but still creates a reasonable clustering structure of the data set. The value of the q is determined from the inclusion relationship [117] of the G MST and the family of the nested sequence of graphs (G kNN , where k . ln(n)).
Next, we compute the MST of the G qNN graph. We term it as, G MSTp . We annotate each edge in G MSTp according to the following procedure: for each edge (a,b) in E(G MSTp ) we assign an integer value p such that if f(a,b) is the index of b in the sorted list of nearest neighbors of a in G qNN , the value of p is given by, We define the maximum value of p in the MSTp (or any of its components) as p max and then, we partition the G MSTp with the following criteria: C 1 . If p. tln (n)s; remove the edge, C 2 . If p max , tln (n)s; remove the edges with weight p max -1, and; C 3 . If p max = 1 or p max = tln (n)s; do not remove any edge and the result is a ''cluster''.
The final output of our algorithm is a set of partitions or clusters of the input data (See Figure 6). The algorithm does not require any pre-determined value for q but it is possible to change the threshold from tln (n)s to any other user-defined parameter. The complete algorithm can be found in [13]. To accelerate the data preprocessing we employed General Purpose Graphics Processing Unit (GPGPU) computing and implemented a fast and scalable approach to compute the distance metrics and the q-nearest neighbor graph (G qNN ). An illustrated example of our GPU-based nearest neighbor search algorithm is given in File S4.
To create the MST from large data set we adapted the EM MST algorithm in [11] and modified it to annotate the edges according eq. (4). The I/O complexity of this algorithm is O(sort(m)Nlog(n/M)), where n is the number of nodes of the original We have selected 50 metafeatures (25 most positively correlated and 25 most negatively correlated) and ordered them by Spearman's rank correlation with JSD severe . Genes in boldface indicate that they were previously discussed in [5] and genes with underlined boldface represent the cases for which the gene has been discussed in the context of AD in the published literature (see File S3 for details). doi:10.1371/journal.pone.0045535.t012 graph, m is number of edges and M number of nodes that fit into computer's internal memory, respectively, and the sort(m) is the time required for sorting the m edges. After partitioning the MST, we identify the connected components using the EM connected component algorithm in [11,12,118]. The I/O complexity of this algorithm is O(mNlog(log(n)). Unlike other clustering tools, we store the connected components/clusters in external memory and only keep the list of the components in computer's in-memory. This eliminates the excessive use of the in-memory even when there are a large number of components or clusters. Additionally, we tuned the implementations of the adapted algorithms [11,12,118] for better performance with denser graphs. Since our algorithm has been implemented in external memory approach, we term our algorithm as EM MSTkNN algorithm. Please note here that our proposed method can be implemented either in-memory or external memory paradigm. To make this method further scalable, we have taken the advantages of external memory algorithms and environments.
The computational tests were performed on a Xenon Nitro T5 Supermicro server (16 CPU cores, 32 GB RAM, 4x NVIDIA Tesla C2050 ''Fermi'' GPU cards (1792 GPU Cores and 12 Gb RAM total) and 800GB Hard-disk) and the programs were written in C/C++ with the support of CUDA [119], STL, STXXL [120] and BOOST [121] library and compiled using the g++ and nvcc compiler on a Linux operating system with kernel version 2.6.9.  Table S4., for further details of correlation of these markers to the phenotypes. doi:10.1371/journal.pone.0045535.g003

Monte-Carlo Random Permutation Test
Assessing correlations involving a large number of metafeatures and a small number of samples has the potential to lead to spurious, false positive results. To estimate the false discovery rate when assessing correlations involving metafeatures, we performed a simple a Monte-Carlo permutation test, a useful resampling test when there are many possible orderings of the samples. In this test, we randomly permuted (rearranged) the values of the progression marker in question and computed the correlation of each metafeature against it. A total of 1,000 iterations of the test were performed and the average number of metafeatures passing various correlation coefficient thresholds determined.

Functional Annotation
After performing the clustering on the expanded data sets, we identified the specific clusters that contained the progression markers  Table S5., for further details of correlation of these markers to the phenotypes. doi:10.1371/journal.pone.0045535.g004 (MMSE score, NFT count, Braak staging, JSD control and JSD severe ) and analysed the correlation (using Spearman's rank computation) of the clustering probe sets or metafeatures with these progression markers. If the size of the cluster was very big, we noted the top most positively and negatively correlated probe sets or metafeatures.
GATHER [122], a popular online tool for interpreting genomic signatures, was used to assess possible biological relationships between the two transcripts targeted by the probe sets comprising a metafeature. We checked each of the clustering metafeatures to determine if the relevant transcripts share any common biological pathway (KEGG pathways). Our objective here is to detect the pair of transcripts that not only appear in the same pathway but also jointly activate the progression of AD.

Validation Using an Alternative Dataset
For validation of changes in select genes, we analyzed the dataset contributed by Liang and colleagues [95,96], which can be accessed from NCBI Gene Expression Omnibus under the accession number GSE5281. This microarray dataset was generated using Affymetrix Human Genome U133 Plus 2.0 Arrays and assessed gene expression in healthy neurons isolated by laser capture microdis- Figure 5. Validation of robust markers of AD progression in an alternative dataset. Transcript levels for selected genes of interest were investigated in the microarray dataset of Liang and colleagues [95,96], which assessed gene expression in healthy neurons isolated from four different regions of control and AD brain: entorhinal cortex (EC), hippocampus (HIP), middle temporal gyrus (MTG) and posterior cingulate cortex (PC). Data presented in this figure were normalized using Robust Multichip Average (RMA). In the box and whisker plots, the bottom and top of the box represent the lower and upper quartiles, respectively, and the band within the boxes represents the median, while the ends of the whiskers represent the minimum and maximum values. doi:10.1371/journal.pone.0045535.g005 section from different regions of post-mortem control and AD brain. We refer the reader to [95,96] for full experimental details.
In the analyses presented here, we investigated gene expression in the entorhinal cortex (13 controls, 10 AD), hippocampus (13 controls, 10 AD), middle temporal gyrus (12 controls, 16 AD) and posterior cingulate cortex (13 controls, 9 AD). Microarray data were normalised with RMA in the Affymetrix Expression Console (v1.1). For each region, genetic signatures that discriminate control and AD samples were generated as described in [5]. Briefly, data were first preprocessed by discretization of the expression values using an implementation of Fayyad and Irani's algorithm [6], an entropybased heuristic. This was followed by a filtering step to discard probe sets that do not provide sufficient information to discriminate between the control and AD classes, based on the Minimum Description Length principle (reviewed in [123]). The matrix of discrete values returned after entropy filtering was then used to create an instance of the (a,b)-k-Feature Set problem (for details see [123]). The optimal solution to this problem was used as the genetic signature.     File S1 Clusters containing the progression markers (from the 1,372-probe AD signature data set).

(XLSX)
File S2 Clusters containing the progression markers (from the 941,885 ratio metafeatures data set). (XLSX) Figure 6. Demonstration of the modified MSTkNN algorithm. (a) An MSTp created from a data set with n=10 features/probe sets. Each edge is labeled with an integer value p, where the value of p is determined using a sorted list of nearest neighbors for each feature (see eq. (2)). The edge between F9 and F10 is a candidate for elimination, since it has a value of p > = 2 (b) Two connected components are identified and we apply the same procedure with the component that has more than three elements. (c) The final outcome of the clustering. doi:10.1371/journal.pone.0045535.g006 File S3 Clusters containing the progression markers (from the 3,763,403 ratio-sum-difference-product metafeatures data set).

(XLSX)
File S4 Implementation of the proposed clustering method. (PDF) Author Contributions