Visualising the Cross-Level Relationships between Pathological and Physiological Processes and Gene Expression: Analyses of Haematological Diseases

The understanding of pathological processes is based on the comparison between physiological and pathological conditions, and transcriptomic analysis has been extensively applied to various diseases for this purpose. However, the way in which the transcriptomic data of pathological cells relate to the transcriptomes of normal cellular counterparts has not been fully explored, and may provide new and unbiased insights into the mechanisms of these diseases. To achieve this, it is necessary to develop a method to simultaneously analyse components across different levels, namely genes, normal cells, and diseases. Here we propose a multidimensional method that visualises the cross-level relationships between these components at three different levels based on transcriptomic data of physiological and pathological processes, by adapting Canonical Correspondence Analysis, which was developed in ecology and sociology, to microarray data (CCA on Microarray data, CCAM). Using CCAM, we have analysed transcriptomes of haematological disorders and those of normal haematopoietic cell differentiation. First, by analysing leukaemia data, CCAM successfully visualised known relationships between leukaemia subtypes and cellular differentiation, and their characteristic genes, which confirmed the relevance of CCAM. Next, by analysing transcriptomes of myelodysplastic syndromes (MDS), we have shown that CCAM was effective in both generating and testing hypotheses. CCAM showed that among MDS patients, high-risk patients had transcriptomes that were more similar to those of both haematopoietic stem cells (HSC) and megakaryocyte-erythroid progenitors (MEP) than low-risk patients, and provided a prognostic model. Collectively, CCAM reveals hidden relationships between pathological and physiological processes and gene expression, providing meaningful clinical insights into haematological diseases, and these could not be revealed by other univariate and multivariate methods. Furthermore, CCAM was effective in identifying candidate genes that are correlated with cellular phenotypes of interest. We expect that CCAM will benefit a wide range of medical fields.


Introduction
In order to fully understand pathological processes in clinical settings at the genomic level, it is necessary to compare the transcriptomes of pathological processes in individual patients with those of the physiological processes of normal counterparts. Although transcriptomic analysis has been extensively applied to various diseases, the way in which the transcriptomic data of pathological cells relate to the transcriptomes of normal cellular counterparts has not been fully explored, and may provide new and unbiased insights into the mechanisms of these diseases. To achieve this, it is necessary to develop a method to simultaneously analyse components across different levels, namely genes and physiological and pathological processes (e.g. normal and abnormal cellular phenotypes). It is anticipated, if successful, this approach will reveal hidden relationships between pathogenesis, developmental mechanisms, and gene regulation.
Gene signature has been a most commonly employed approach to address this type of problem. A number of methods have been proposed to measure the degree of inclination towards a certain signature in individual disease samples: correlation coefficient to the average gene expression of the signature genes [1,2]; the median fold change [3]; the (weighted) sum of the expression levels of signature genes [4,5]. Sandberg et al developed a method using Singular Value Decomposition (SVD) to measure the similarities between cancer subtypes and cell lines [6], providing univariate scores for individual cell lines. Although these approaches are easy to deal with and can be understood intuitively, there is a pitfall when they are applied to compare disease and normal cellular phenotypes: it cannot be assumed that the two cell signatures to be analysed are independent from each other. For example lymphoid and myeloid signatures cannot be equally compared and therefore the analysed results of these gene signature scores should not be plotted on the same plot, as the relationship between these two signatures is unknown. This fundamental problem complicates comparisons between multiple gene expression signatures of different haematopoietic cell populations. Considering that haematopoietic cells are classified into tens of different populations by cell lineage and developmental stage, and that each cell population is closely related to others [7], those existing methods are apparently insufficient for obtaining an integral view. Therefore, multidimensional analysis is required to effectively address this problem.
Among multidimensional analysis methods, principal component analysis (PCA) is most commonly used to analyse the relationships between samples, although PCA is vulnerable to the addition of subtle phenotypes and aberrant samples, and is more suitable for visualising data structure [8,9]. In addition, although it may be more straightforward to estimate the identity of cells by directly comparing their transcriptomes with other microarray datasets using multivariate analysis, for example using Multidimensional Scaling [10,11], this is often not successful because large between-experimental variations can easily dominate relatively small differences in gene modification between experimental and control groups even with meta-analysis methods. Considering that the variations between malignant cells and normal cells are generally much bigger than between-group variations of different normal cell phenotypes, as often seen in the analysis using hierarchical clustering [12], a new multidimensional approach is required to make a direct comparison of malignant cell phenotypes and their corresponding normal counterparts.
Thus, in order to reveal the cross-level relationships between diseases, genes, and normal cells, we have adapted a multidimensional approach, Canonical Correspondence Analysis (CCA), to microarray data. Currently, CCA is widely used in ecology and social science, as it can simultaneously analyse two totally different types of data -one as response data and another as explanatory data, revealing the relationships between these two data [9]. CCA is a variant of Correspondence Analysis (CA), which has previously been employed to analyse a single microarray dataset, visualising the associations between samples (arrays) and genes in single datasets [13,14]. Baty et al reported a method using a variant of CCA for the analysis of microarray expression data with respect to binary response data [15]. As far as we know, the present study is the first to adapt CCA so as to simultaneously analyse two microarray data, which is designated as CCA on Microarray data (CCAM).
In order to examine the validity and efficiency of our method, we have analysed two haematological disorders: leukaemias and myelodysplastic syndromes (MDS). Table 1 is the summary of microarray datasets used in this study.
Haematological disorders are classified and understood by referring to normal haematopoietic cell differentiation. Leukaemias are classified on the basis of the cell type involved and the state of maturity of the leukaemic cells, and categorized into major four groups: acute lymphoblastic leukaemia (ALL), acute myeloid leukaemia (AML), chronic lymphocytic leukaemia (CLL), and chronic myelogenous leukaemia (CML) [16]. The classification of leukaemias has been further developed by assigning leukaemic cells to normal haematopoietic cell counterparts based on morphology, cytochemistry, immunophenotype, genetics and clinical features, so as to define clinically significant disease entities [17,18]. This framework is based on the well-known hypothesis that the genetic lesions of leukaemia result in a block of differentiation (maturation arrest) that allows leukaemic cells to continue to proliferate and/or prevents the terminal differentiation and apoptosis seen in normal white blood cells [19].
MDS are a group of clonal haematopoetic disorders marked by ineffective haematopoiesis, peripheral cytopenias, and an increased risk of transformation to AML [20]. MDS have been classified into subgroups, and individual patients are scored, in order to predict prognosis, especially for assessing the risk of leukaemic transformation. The International Prognostic Score System (IPSS) for MDS is composed of three factors: blasts in bone marrow (BM), karyotype, and cytopenia, and higher scores are associated with poorer prognosis [21]. The World Health Organization (WHO) classification of MDS is based on morphologic evaluation of bone marrow cells and genetic abnormalities, and classifies MDS into 6 major subtypes: refractory anaemia (RA, or Refractory cytopenia with unilineage dysplasia (RCUD)), refractory anaemia with ring sideroblasts (RARS), and refractory cytopenia with multi-lineage dysplasia (RCMD), and 5q-syndrome (MDS associated with isolated del(5q)), and refractory anaemia with excess blasts (RAEB-1 [blasts v5%] and RAEB-2 [5{19% blasts]) [22,23]. Blast percentage of more than 20% is defined as AML, and reasonably, RCMD and RA show better prognosis with longer leukaemia-free survival than RAEB-1 and RAEB-2 [24,25].
Genome-wide gene expression analysis (transcriptomic analysis) has been extensively used for improved understanding of the diagnosis, prognosis, and pathogenesis of these haematological diseases [22,26]. In these transcriptomic studies, gene expression signature (or, gene expression profiles [GEP]) has been most commonly used to classify haematologic diseases and predict prognosis [26]. Gene expression signature is typically composed of tens to hundreds of genes, so that all these genes stably contribute to classify samples in cross-institutional settings [4,27]. Hierarchical clustering is most often employed in analyses using gene expression signatures to classify samples into disease subtypes [28].

Analysis 1: Leukaemia
Based on the assumption that leukaemia is classified by referring to normal haematopoietic cell differentiation, we attempted to analyse using transcriptomic data the relationships between leukaemia disease samples and normal haematopoietic lineage cells. We aimed in this analysis to determine the transcriptomic identities of individual leukaemia patients by analysing a transcriptomic dataset of leukaemia (GSE13159 [29]) and that of haematopoietic cell differentiation (GSE24759 [30]). As this is the first exemplary analysis using a univariate approach by gene signature, we also show why we need to introduce a multidimensional method. In the subsequent section, we demonstrate how CCAM is applied to microarray datasets, and examine the validity of the method by addressing haematologically well-known relationships between pathological and physiological processes.
A univariate approach using gene expression signature. First, we employed a univariate approach to address this problem, using provisional gene signatures of haematopoietic cell populations (see Methods). Here, we aim to score individual disease samples by the degree of maturation into each cell population. Given that some haematopoietic cell populations may be too similar to each other to provide meaningful results that discriminate disease samples, hierarchical clustering was used to cluster haematopoietic cell gene signatures based on their correlations to individual disease samples (Fig. S1). Based on this clustering, we chose four distinct (classified in different groups) gene signatures from relatively immature cells (proxy to haematopoietic stem cells [30]. We analysed the distribution of gene signature scores of disease samples for these cell populations. As shown in Fig. S2, CLL showed high correlations with the signature of Mat-B (Fig. S2a), while CML and AML showed higher correlations with those of NM and GMP ( Fig. S2b and S2c). Generally, ALL showed higher correlations with the signature of Pro-B (Fig. S2d). These results seemed haematologically appropriate, considering the immunophenotype of these leukaemia subtypes [31][32][33][34]. It was, however, unclear how these four results in Fig. S2 were related.
Cross-level relationships between leukaemia subtypes, haematopoietic cell differentiation, and genes by CCAM. The results above indicated that it was necessary to simultaneously analyse components at three different levels: genes, normal haematopoietic cells, and individual disease samples. To achieve this, we have developed a new multidimensional and canonical analysis of two microarray datasets by adapting Canonical Correspondence Analysis, which was developed in ecology and sociology, to microarray analysis (we designate the method as CCA on Microarray data, CCAM) (see Methods, Fig. 1a). Briefly, in our application of CCAM, pathological data (disease data) are treated as response data, and physiological data (normal haematopoietic cell differentiation) are used as explanatory variables (environmental variables), and thereby we aim to reveal the relationships between gene expression and pathological and physiological processes. Assuming that leukaemias are classified by referring to normal haematopoietic cell differentiation, CCAM is expected to assign individual disease samples to most correlated normal counterparts. We used the four representative haematopoietic lineage cells that were analysed in the gene signature approach in Fig. S2 (GMP, NM, Pro-B, and Mat-B).
We have employed a map approach in our method in order to avoid the pitfalls of simultaneously analysing the complex relationships between components at three different levels [9]. CCAM provides a map that shows the correlations between genes, normal haematopoietic cells, and disease samples. In other words, the more correlated, the nearer components are positioned on the map [9]. Fig. 1b shows all the components that were analysed at all the levels (gene, normal cell, and disease). On the map, CLL showed high correlations with Mat-B, and not with Pro-B and myeloid cells, compatible with the fact that the phenotype of CLL is similar to antigen-experienced B cells rather than immature B cells [31,34] (Fig. 1b and 1c). Although the number of samples is small, mature B cell-ALL (mature B-ALL) showed a clear correlation with CLL and Mat-B [33] (Fig. 1d), which is also a reasonable result. On the other hand, ALL (excluding T cell-ALL [T-ALL] and mature B-ALL) showed higher correlations with the signature of Pro-B, which is consistent with the immunophenotype of non T cell, immature B cell ALL [32,33]. CML and AML showed higher correlations with the signatures of NM and GMP, and comparing with AML, CML was more distinct from lymphocytic leukaemias (ALL and CLL) and deviated more to the direction to which NM and GMP were correlated ( Fig. 1b-1e), confirming a more differentiated granulocytic phenotype of CML than AML. T-ALL was distinct from other ALL, and positioned between B-ALL and AML ( Fig. 1d-1e).
By analysing CCA triplot at the gene level, B cell-and Bleukaemia-related genes have high (positive) scores in axis 1, while genes related to myeloid cell differentiation and myeloid leukaemia have low (negative) scores (Fig. 1b). Myeloid genes such as MMP8 and CD33 are in quadrant I (Axis1 lo Axis2 hi ), which is correlated with myeloid lineage NM and GMP. Genes related to nave or immature B cell (e.g. POU2AF1, CD19, ID3, VPREB3, RAG1) are apparently enriched in quadrant III (Axis1 hi Axis2 lo ), which is correlated with Pro-B. Genes related to mature, antigen-experienced B cells (e.g. CD40, CD86) are found in quadrant II (Axis1 hi Axis2 hi ), which is correlated with Mat-B. The associations of components at the three different levels could be observed in this analysis. For example, in quadrant II, CCL and mature B cells are correlated with FCER2 (CD23, FcReII), CD180 (RP105), and CXCR5 (Fig. 1b). In fact, increased expression of these genes is characteristic in CLL and also associated with maturation of B cells [35][36][37][38][39]. Interestingly, quadrant IV (Axis1 lo Axis2 lo ), which is not annotated by haematopoietic cells but correlated with AML and T-ALL, includes RUNX1, ERG, and MYB, which have wellestablished roles in AML and early haematopoietic differentiation including myeloid and T-lymphocyte lineages [40][41][42][43] (Fig. 1b).
Thus, the map analysis in Fig. 1 can be summarised as follows: Axis 1 represents''myeloid cells vs. B lymphocyte'', while Axis 2 represents ''immature vs. mature cells''. Individual leukaemia samples and gene expression were successfully characterised on this map. The analysis of variation (precisely, inertia [9]; see Methods) showed that Axis 1, 2, and 3 comprised 68%, 17%, 11% of variations, respectively. This means that the leukaemia data that was interpretable by the haematopoietic cell data was mostly visualised (85% and 97% of the information in the constrained data in Fig. 1d and 1e, respectively), and that the difference between myeloid and lymphocytic lineages dominated that of the maturity of cells in this dataset.
Next, we further analysed the phenotypes of AML subtypes by CCAM (Fig. S3). We included in the analysis the cell populations of the myeloid lineage that are relevant in AML, namely, Common myeloid progenitor (CMP), Colony forming unitmonocyte (CFU-M), Neutrophilic metamyelocyte, and mature Neutrophils. CCAM classified AML subtypes with the features of the granulocytic and monocytic lineages (Fig. S3). CCAM showed that the subtypes AML with 11q23/MLL and AML with inv(16)/ t(16;16) were more associated with CFU-Monocyte than other granulocyte lineage cells, which is consistent with the facts that these subtypes are morphologically more correlated with the   [44,45]: a study showed that a majority (81%) of AML with 11q23/MLL showed an involvement of the monocytic lineage [44]; AML with inv(16)/t(16;16) has the fusion gene CBFb/MYH11, and is morphologically associated with the French-American-British (FAB) AML-M4 subtype (acute myelomonocytic leukaemia with an abnormal eosinophil component) [45]. In addition, CCAM showed that the subtypes AML with t(15;17) and AML with t(8;21) were more related to Neutrophil, which is consistent with their morphological associations with the granulocytic lineage: AML with t(15;17) has the PML-RARA fusion gene and corresponds to the FAB M3 subtype (acute promyeolocytic leukaemia) [46,47]; AML with t(8;21) has the fusion gene AML1(RUNX1)/ETO and corresponds to the AML-M2 (acute myeloid leukaemia with maturation) [48]. See File S1 and Fig. S4 for the further analysis of these two datasets.

Analysis 2: Myelodysplastic Syndromes (MDS)
The analyses above showed that CCAM successfully revealed known relationships between leukaemia, haematopoietic cell differentiation, and genes in a concise and transparent way. In this section, we examined whether the approach was effective in generating and testing hypotheses, and questioned whether this method could provide meaningful insights into clinical problems. Accordingly, we analysed two independent MDS datasets. The first analysis was carried out in order to generate a new hypothesis. We then tested the hypothesis by analysis of another independent dataset.
Analysis for hypothesis-generation: comparison of MDS and normal bone marrow (BM). We analysed the transcriptomic data by Sternberg et al (GSE2779 [49]) along with that of haematopoietic cell differentiation (GSE24759). Sternberg et al showed that CD34 z progenitor cells from normal-karyotype, lowblast-count MDS patients consistently showed decreased expression of B-cell lineage-affiliated genes [49]. We attempted to confirm this result, while obtaining a bigger picture using not only the data of immature B cells (Pro-B) but also those of other progenitors including MEP (megakaryocyte-erythroid progenitor), GMP, and CMP, and haematopoietic stem cells (HSC). In order to find the unique features of MDS BM using the relatively small number of disease samples analysed in this dataset, we filtered genes using the disease data.
CCA triplot showed that axis 1, which was composed of the largest variation in the dataset, is primarily represented by the difference between MDS and normal BM as well as that between HSC/MEP and CMP/Pro-B. As Sternberg et al reported, MDS samples had negative correlations with Pro-B (Fig. 2a). In addition, CCA triplot showed that MDS samples had positive correlations with HSC and MEP. Although the number of samples is small, non-MDS anaemia samples were in the middle of MDS and normal BM in axis1 (Fig. 2b). Interestingly, KIT and NPM1, the mutations in which are suggested to play roles in leukaemic transformation [50,51], were correlated with MDS and HSC/ MEP. B cell-related genes including POU2AF1, PAX5, and CD19, were associated with normal BM as reported [49] (Fig. 2a).
Examination of hypothesis: BMs from high-risk MDS patients showed the deviation towards HSC/MEP at transcriptomic level. Based on these findings, we generated a hypothesis that MDS patients with higher correlations with both HSC and MEP (and negative correlation with Pro-B) had a higher risk for leukaemic transformation. To test this hypothesis, we have applied CCAM to another independent dataset of MDS that is composed of only MDS patients (without normal), and analysed the results in conjunction with the clinical data (GSE15061 [22]).
CCAM created a new scoring system that has a prognostic value and biological relevance in haematopoietic cell development. The results in Fig. 3 suggest that a positive association with HSC/MEP and a negative association with CMP has a prognostic value. Thus, using top ranked genes (top 100 and bottom 100 genes by the wa score of CCAM, see Method), we analysed the relationships between individual patients and their associations with the four haematopoietic cell populations, and thereby established a scoring system for MDS patients (designated as the HSC-CMP score). As expected, genes with high HSC-CMP scores were specific to HSC/MEP, while those with low scores were specific to CMP/GMP (Fig. S6). MDS patients were stratified into two or three groups by the HSC-CMP score, and two to four groups by well-established prognostic scores and the disease categories in the WHO classification. Kaplan-Meier survival analysis showed that the HSC-CMP score had prognostic values for overall survival: patients with scores above the 50th percentile showed worse survival ( Fig. 4a pv0:02) and those above the 95th percentile had the worst prognosis by log-rank test (Fig. 4b pv2|10 {4 ). While the IPSS and cytopenia scores showed p-values just above the significant level(p~0:054 and p~0:055, respectively), disease categories showed significant difference between patient groups (Fig. 4g,  pv0:002). Similarly, Kaplan-Meier survival analysis for time to AML transformation showed that patients who scored above the 50th percentile had worse prognosis (Fig. 5a, pv0:02), while the stratification of patients into three groups by the HSC-CMP score was less significant (Fig. 5b, p~0:054). Regarding the AML transformation, reasonably, blast score and disease categories showed the lowest p-values ( Fig. 5d and 5g, pv1|10 {5 and pv1|10 {4 , respectively).
To further address the significance of the HSC-CMP score, we employed a Cox proportional hazard regression analysis for overall survival and time to AML transformation. The univariate Cox analysis showed that the HSC-CMP score and the IPSS score were significant both for time to AML transformation (pv0:02 and pv0:002, respectively) and for overall survival (pv0:001 and pv0:05, respectively, Table 2). Cytopenia and Blast scores were also significant for time to AML transformation (pv0:05 and pv0:0005, respectively). Next, we performed a multivariate Cox regression analysis with the IPSS score and the HSC-CMP score (three stratified groups), which showed that the HSC-CMP score remained significant for overall survival but the IPSS score did not (pv0:01, hazard ratio [HR]~2:11; and p~0:90, HR~1:02; respectively), while the IPSS score remained significant for time to AML transformation but the HSC-CMP score did not (pv0:05, hazard ratio [HR]~1:43; and p~0:50, HR~1:78; respectively). Lastly, we performed a multivariate Cox regression analysis with the HSC-CMP score, Cytopenia score, Blast score, and Karyotype score. The analysis for time to AML transformation did not show any statistically significant results except that Blast score was significant (pv0:01,HR~2:52). The analysis for overall survival showed that the HSC-CMP score had the largest impact (pv0:005, HR~2:29), and other scores failed to show significance ( Table 2). These results suggest that although the HSC-CMP score was not independent from the IPSS and other scores, it was a dominant prognostic factor for overall survival.

Discussion
CCAM has provided novel insights into the cross-level relationships between gene expression and pathological and physiological processes, which could not be obtained by the analysis at each single level. Importantly, many medical problems require the analysis of disease samples in the context of some particular biological processes (e.g. cell differentiation), which are most often multidimensional in nature and are often not straightforward. CCAM has effectively solved this type of problems by analysing two independent transcriptomic data. Visualisation of the analysed results has made it transparent which cell populations are being compared for the relationship with disease samples. In addition, CCAM allows the exploration of novel molecular mechanisms that are highly associated with particular cell and/or disease. For example, in Fig. 1, CCAM has identified known genes that had roles in haematopoietic cell differentiation and leukaemia. This result suggested that other genes that are associated with (in juxtaposition to) these known genes and with particular diseases and cells are reasonably good candidates for undefined molecular mechanisms of these diseases and cells, considering the nature of the underlying algorithm, Correspondence Analysis [9,14]. Thus, CCAM with the map approach is useful for generating hypotheses with least assumptions, and is expected to lead to hypothesis-driven studies. In addition, we used CCAM effectively to test a hypothesis on the transcriptomic characteristics of MDS patients. Furthermore, the proposed approach can identify individuals with worse clinical outcomes, and infer the mechanisms underlying poor prognosis, as shown by the analysis of MDS. The clinical utility of this approach is thus demonstrated.
New biological insights from CCAM: the positive correlation of transcriptomes of high risk MDS patients and those of HSC/MEP (and the negative correlation of MDS with CMP) Consistent with a previous report [49], MDS samples showed decreased or negative correlations with the process of early B cell differentiation compared with healthy controls. In adddition, our analysis has revealed that MDS samples are more correlated with the processes of both HSC maintenance and MEP differentiation. Interestingly, while the process of early B cell differentiation did not show correlation with the severe phenotype of MDS, the transcriptomic statuses of HSC and MEP showed remarkable correlations with it (Fig. 3). The leukaemic transition of MDS is widely known to be associated with the immaturity (the acquisition of stem-ness [HSC]) [22,23], while the association between MDS and MEP (in comparison with myeloid/lymphoid differentiation) has not been recognised. This is the power of the new approach: to simultaneously and fairly consider multiple phenotypes, providing an integral view on the system. As RNA for hybridization was extracted from unsorted, mononuclear BM cells from MDS patients in this dataset (GSE15061) [22], the result should be interpreted considering possible compensatory mechanisms in BM by non-MDS cells. It is noticed, however, that another dataset, GSE2779, analysed purified CD34 z progenitor cells, resulting in a similar conclusion. In addition, our results showed that RARS BM, which characteristically show hyperplastic ineffective erythropoiesis [52], did not have positive correlations with MEP and HSC, and that 5qsyndrome, which typically shows normal to increased megakaryocytes [23], did not show positive correlations with MEP. Rather, 5q-syndrome and RARS showed lower values in Axis 1, and were more correlated with CMP (Fig. 3f). Given that the largest variation can be observed in axis 1 of CCAM results, these results  indicate that the distinct feature of BM from severe MDS patients dominated the variations in MDS including those compensatory mechanism and secondary responses. Interestingly, a recent report showed that decreased expression of erythroid-specific genes was correlated with the responsiveness to the thalidomide derivative lenalidomide in patients with 5q-syndrome, which is the most homogenous subtype of MDS [53]. Our results, along with the results of this study, suggest that MDS with higher correlations  with the erythroid lineage (and HSC) are more difficult to treat with lenalidomide. Importantly, the HSC-CMP score showed a prognostic value for overall survival (Fig. 4-5 and Table 2). Given that the HSC-CMP score was correlated with the characteristic gene expression in HSC/MEP and CMP/GMP (Table S1 and Fig. S6), those genes can be investigated to elucidate the molecular mechanisms of MDS pathogenesis in relation to the cellular and differentiation processes of these haematopoietic cell populations. These findings above suggests that disease progression in MDS is accelerated by the aberrant utilisation of the molecular mechanisms of erythroid/ megakaryocyte (MEP) differentiation and stemness (HSC) and the loss of the mechanisms of CMP and/or GMP. The investigation of these processes may provide clues to identify new therapeutic targets that improve overall survival. Thus, CCAM provides biologically effective and meaningful solutions because it analyses simultaneously and cohesively two different phenotypic levels -in this case, MDS pathology and normal haematopoietic cell development.
Our analyses using transcriptomic data obtained from batch samples have revealed the transcriptomic identities of the phenotype of dominant cells or 'average' cells in the context of normal haematopoietic cell differentiation. It is known that variations can occur within individual cancers, in which the cancer cells often have a range of functional properties and diverse expression of markers [54]. In addition, it is thought that leukaemia has a hierarchical organization similar to that of normal haematopoiesis in which there is a rare subpopulation with limitless self-renewal potential (leukaemic stem cells) that gives rise to progeny that lack such potential [55]. Considering this, in conjunction with the use of our method, gene expression analysis at the single cell level will be the key to reveal further relationships between normal cells, cells of origin, and leukaemia stem cells. CCAM provides the framework to analyse this type of data.

Technical considerations on the use of CCAM
CCAM does not produce ready-to-go results, but provides a platform where the existing hypotheses are examined and new hypotheses are formed and generated. Depending on how explanatory variables are set and used, CCAM can be used for exploratory purposes in a data-oriented way (c.f. Fig. 1) or for examining the original hypothesis (c.f. Fig. 3). The map approach enables the comparisons of more than two variables, while the regression process in CCAM allows the analysis across two different experiments. These are advantages of CCAM but can mislead the analysis if inappropriately used. The users should be aware of the following two points. First, because the underlying algorythm, CCA, uses multiple regression, one needs to avoid the pitfalls of multiple regression when choosing explanatory variables: the number of explanatory variables should be less that that of samples, in order not to overfit the data to explanatory variables; and the interpretability of the results is directly dependent on the choice and quality of the explanatory variables [56]. Second, when the final result of CCAM has only a very small part of the original ''information'' (i.e. %Explained is very low [e.g. v5%], see Methods), interpretation should be cautious. Although the absolute value of %Explained does not reflect the biological relevance, if %Explained is comparable between analyses (see Methods), the results with larger %Explained values may be biologically more reasonable and straightforward.
Importantly, the effective use of CCAM requires deep knowledge of both biology/medical science and multidimensional analysis. Because CCAM can help the process of hypothesis generation and testing, this method is best performed when biologists/medical scientists actively participate in the analysis. Readers with biological backgrounds are encouraged to understand the procedures of CCAM in Fig. 6, and those with bioinformatics/statistical knowledge can refer Fig. S7 for the theoretical background of CCAM, and use an R script of CCAM in File S1. If successfully applied, CCAM will allow one to address more complex questions which could not be done by conventional methodologies. The use of CCAM and related multidimensional methods will extend the power of experimental technologies, as it has benefited the field of ecology [56,57]. Table 3 summarises the features of CCAM in comparison with univariate approaches including gene signature and multivariate/multidimensional approaches including PCA and CA.
Notably, CCAM has clarified in the analysis exactly what is compared and analysed, with the power of the map approach. Although it might be obvious to haematologists that HSC is correlated with the progression of MDS, CCAM revealed this without prior knowledge, and more importantly, showed that this statement was true only when patient samples were compared with not only HSC but also some more differentiated cells. CCAM identified that CMP was an appropriate cell population to be compared with HSC (c.f. Fig. 3). In addition, CCAM showed that MEP was correlated with MDS patients with worse prognosis at an equivalent level to HSC (Fig. 2-3). These led to the establishment of the HSC-CMP score, which had a prognostic value and biological relevance in haematopoietic cell development ( Fig. 4-5, Table 2, Table S1). Such a complicated comparison of multiple phenotypes is important for a deep understanding of the system, but this is generally difficult by conventional approaches. The power of CCAM lies in its ability to deal with multiple phenotypic data based on quantitative measurements (i.e. gene expression).

Conclusions
CCAM provides a practical solution for comparing multiple groups of samples with multiple cellular phenotypes. In addition, CCAM reveals hidden relationships between pathological and physiological processes and gene expression, providing novel clinical insights into haematological diseases. In fact, CCAM provided new insights such as the correlation of the severe phenotype of MDS and MEP, in addition to the known correlation with HSC. Furthermore, CCAM can be effectively used for exploring the genes that are correlated with cellular phenotypes of interest using the map approach (c.f. Fig. 1).
The strength of CCAM is its ability to explore datasets for hypotheses-generation using the map approach. Once a hypothesis is generated, we can test the hypothesis by this method using other datasets (c.f. Fig. 3) or by other methods such as statistical comparison of a few selected groups. The map-based approach is popular in social science and ecology [9], because in these research fields it is mandatory to analyse the relationships between many different, but closely associated qualitative data (e.g. different human attitudes [9,58], species, and the analysis without Dataset to be explained (e.g. disease), "Main data" Dataset to explain (e.g. normal cells) "Constrained data" CCA decomposes constrained data into "uncorrelated" components

Explanatory variables
The intrepretable part of main data  visualisation can be misleading [9,59]). In fact, qualitative data (e.g. disease classification, cell subtype) are becoming more and more complex in modern medicine with the emergence of genomics, and we believe that this type of approach is essential and should be incorporated into medical genomics. In addition, considering the generality of the method, the proposed approach can be applied to common problems in broad fields of molecular biology.

Canonical Correspondence Analysis on Microarray data (CCAM)
Conceptual definition of CCAM. CCA is a variant of Correspondence Analysis (CA), which is a key method in sociology, and is developed in ecology for identifying the relationships between ecological data at three different levels [9,59]. We have chosen CCA because it can analyse the rows and columns of two transcriptomic data that are derived from two independent experiments with totally different experimental designs and materials. When these data are analysed by conventional approaches, the distance between different experiments (between-experiment variations) dominates important biological effects in the datasets, and compromises the direct comparison of leukaemic cells and normal haematopoietic cells. On the other hand, CCA looks at the intersection of the two completely different datasets (but obtained for the same genes) and thus avoids this problem. CCA in our method analyses the relationships between genes and disease samples in the context of haematopoietic cell differentiation (corresponding to geological locations, species, and environmental variables, respectively, by ter Braak [59]; gene expression levels correspond to frequencies). Note that in CCAM, CCA is applied to a matrix of data with genes in rows (observations, equivalent to geological locations by ter Braak) and cellular phenotypes in columns (variables, equivalent to species and environmental variables by ter Braak).
Briefly, first, CCAM projects the dataset of disease samples onto the data of haematopoietic cell differentiation, which are averaged for each cell population using scaled data with an average of zero and standard deviation of one for each gene (Fig. S7). Importantly, the gene expression data of haematopoietic cells represent the environmental variables that define the phenotypes of each cell population. Thus, CCAM analyses the interpretable part of the original disease dataset by haematopoietic cell data [9,59]. Mostly the interpretable part (%Explained, see below) was 10-20% of the original data in the presented analyses [9]. Next, CCAM finds new axes by assigning numerical values to samples and genes so that the dispersion of samples is maximized [59].
Instructions on the practical usage of CCAM. Fig. 6a depicts how to use CCAM. (1) Prepare datasets. ''Dataset to be explained'' will be the data that analysed more ambiguous materials (e.g. disease samples) and are of most interest in the analysis. ''Dataset to explain'' will provide explanatory (environmental) variables, and be the one that analysed well-characterised materials (e.g. normal cells). (2) Choose genes using ''dataset to explain'' by some statistical method (e.g. a moderated t-test). Besides, prepare explanatory variables from ''Dataset to explain'' by taking the average for each group. Carefully choose explanatory variables by both biologically thinking and statistically examining: identify and exclude the variables that have high correlations with others. (3) Perform CCA using ''Dataset to be explained'' and explanatory variables. CCA regresses the former on the latter and thereby identifies the interpretable part of ''Dataset to be explained'' (designated as ''Constrained data''). Subsequently CCA perfomes singular value decomposition of Constrained data and obtain ''uncorrelated'' axes (components). The axes are ordered by variation (eigenvalue), and first axes have largest eigenvalues. (4) Visualise the result of CCA using the first axes, and perform map analysis. Use two-dimensional map analysis before using three-dimensional plot (c.f. Fig. 1). First, interpret axes using a deep knowledge on the biological system and the experimental settings, although axes are not always interpretable [9]. It is generally helpful to analyse the relationships between axes and explanatory variables. Second, identify key sample clusters. Note that correlated elements gather towards the same direction from the origin. Strictly speaking, the size of the sample space is different from that of the gene space [9], therefore, do not use the Euclidean distance to measure the similarities between elements across different levels (e.g. cell samples and genes), but use angular distance instead to analyse the relationships between them. Third, explore key genes that are correlated with key clusters of samples and explanatory variables. (5) Lastly, design the next analysis. Optimise the combination of explanatory variables by performing CCAM using the same genes, and comparing %Explained of the analyses with different combinations of variables. It is also important to consider to remove unnecessary samples and/or genes, which can blunt the important difference between elements. The outputs of CCAM can be used as a scoring system. Fig. 6b summarises how the variation in ''Dataset to be explained'' is decomposed and retained in the result of CCAM. In the analysis using CA and CCA, the variation in data is measured by inertia, which plays the same role as the total variance in PCA. Technically, inertia is the sum of total Pearson x 2 divided by the

The analysis of variations in CCAM: decomposition of inertia
Although the methods can be used for the application without ''ticks'' in some limited situations, the maximal productivity may be obtained by those with ''ticks''. 1) Includes various methods for the signature approach. Commonly analysed using clustering methods. doi:10.1371/journal.pone.0053544.t003 total sum [9]. Total inertia, I T , is decomposed into two parts, constrained inertia, I C , and unconstrained inertia (I T {I C ). %Explained~I C =I T defines how much of the information in the original data is retained at this stage, and is useful for addressing the relevance of the findings by CCAM in terms of variation. Next, CCAM performs singular value decomposition, and constrained inertia is decomposed and distributed into new axes, I C~a1 ,a 2 ,a 3 .... %Visualised is defined as a ratio of (The sum of eigenvalues in the visualised space)=(Constrained inertia), and is useful to determine how many dimensions should be used. %Explained is comparable between the results from different combinations of explanatory variables, when the same genes are used and the main data matrix is the same. Algebraic definition of CCAM. Suppose that the normalised gene expression of reference cell subsets (environmental variables) is F~½d 1 ,:::,d q [< l|q , and the microarray data Z[< l|p . We assume that F is standardized using r as weights for rows in the calculation of means and variances (see the constrained conditions of g in the previous subsection). CA standardizes Z in the x 2metric, and subsequently performs singular value decomposition (SVD) as in the following steps ( Fig. S7): (1) Using n, D r~d iag(r) and D c~d iag(c) (see above), Z can be standardized in the x 2metric, S~D  Suppose thatd d n (n~1,::,q) is standardized, the standardized regression coefficient b Ã~c ov(d d n ,u j )(n~1,::,q; j~1,:::,K), which is used for displaying environmental variables in triplots. (7) Using r~(r 1 ,:::,r l ) and c~(c 1 ,:::,c p ), total inertia of the original gene %Explained~P k a 2 k w 2 is an estimate of how much of the original information is retained in the solution.

Microarray data and processing
We have used GSE24759 for haematopoietic cell differentiation [30], GSE2779 and GSE15061 for MDS data (Table 1) [22,49], GSE13159 for leukaemia data [29]. Microarray data were normalised by rma of the Bioconductor package, affy. Crossplatform comparisons were made by a commonly employed algorithm [60]. Batch-effects within each dataset, if any, were adjusted by an established approach using empirical Bayes methods [61]. Data were further normalised across genes. PCA was done by the function dudi.pca of the CRAN package ade4.
We used the CRAN package, vegan, for the computation of CCA [62]. An R script for CCAM is available as File S1. Microarray data of diseases and those of haematopoietic cell populations (environmental variables) were seprately normalised, as CCA analyses the data that were projected onto environmental variables, and normalisation of data of diseases and environmental variables did not have impacts on the results of CCA (data not shown). For sample scores, we employed wa score, and biplot arrows are based on weighted correlation of sample scores and haematopoietic cell differentiation [62,63]. Genes were filtered by the same method used for gene signature, using haematopoietic cell data only, unless indicated. Thus, these analysis results are unsupervised for disease samples and do not use clinical information and sample identities. Sample scores of CCA results were first analysed by Kruskal-Wallis test for comparisons of multiple groups. Mann-Whitney U-test was used for the comparison of two groups. P-value was adjusted by the Bonferroni correction. Three-dimensional plots in Fig. 1e and Fig. S6 were produced by the R packages, rgl and lattice, respectively.
Provisional gene signatures were identified by a moderated, empirical Bayes t-statistic implemented in the Bioconductor package limma [64]. The top ranked genes (n~500) in comparison with all other cell populations by topTable of limma were designated as the provisional signature genes of each cellular population. Gene signature score was derived by analysing the averaged expression of signature genes and the gene expression of each disease sample by Pearson's correlation coefficients. Obtained scores were further scaled so as to have an average of zero and standard deviation of one, because the primary interest of the analysis is to compare between disease samples.

Statistical analysis of survival data
Log-rank test was used for analysing the survival data of the stratified groups of patients. Cox proportional hazard models were used for analysis of survival data. The CRAN package, survival, was used for these computations. Figure S1 Hierarchical clustering of gene expression signatures of haematopoietic cells by correlation to a leukaemia disease dataset. All the haematopoietic cell populations in GSE24759 were analysed. A gene signature was identified for each haematopoietic cell population, and gene signature score was calculated as a correlation between each signature and individual leukaemia sample, and shown by a heatmap. Yellow indicates positive correlation, and red indicates negative ones. Identified clusters of haematopoietic cell signatures are shown by colours of cell population names and dendrograms. Rows are individual patients. The disease categories are shown by colours on the left side of the heatmap. This analysis did not aim to classify disease samples, therefore reasonably, hierarchical clustering of rows did not classify disease samples. Abbreviations: CFU, colony forming unit; HSC, haematopoietic stem cell; CMP, common myeloid progenitor; GMP, granulocyte monocyte progenitor; MEP, megakaryocyte-erythroid progenitor. It is not sound to use too many explanatory variables in CCAM, as CCAM regresses main data on explanatory variables, which is one of the major limitations of this method. Thus, in order to analyse the differentiation pathways of many normal haematopoietic cell populations in terms of the features of leukaemias, haematopoietic cell data has to be treated as main data, using leukaemic data as explanatory variables (note that this is the opposite way of Fig. 1). Map analysis indicates that the largest difference was observed between myeloid and lymphoid lineages (axis 1, a), and the second largest difference was between ALL and CML (axis 2, a). CCAM shows that (1) acute leukaemias (AML and ALL) are more associated with immature haematopoietic stem cells; (2) AML is more associated with immature myeloid lineage cells including CMP than differentiated myeloid cells; (3) CML is correlated with relatively differentiated monocytic and granulocytic lineage cells (especially, neutrophilic metamyelocyte and CFU-monocyte); (4) CLL is correlated with naive and mature B cells [31,34]. (b-c). The results were compatible with and confirm at the transcriptomic level the common understandings on leukaemias and haematopoietic cell differentiation.  Figure S5 Comparison of the proposed method using CCA and a conventional multivariate analysis, principal component analysis (PCA). The CCA results are from the analysis in Fig. 3. (EPS) Figure S6 Three-dimensional plot of the HSC-CMP score, gene expression (average expression in each cell population), and four haematopoietic populations used for making the HSC-CMP score. CCAM assigned scores to both MDS disease samples and these genes. High scores were associated with MDS patients with poor prognosis (c.f. Fig. 4 and 5), while, at the gene level, genes with high scores showed higher expressions in HSC and MEP than GMP and CMP. On the other hand, genes with low scores are more highly expressed in CMP and GMP than HSC and MEP. See Table S1 for the relative contributions of those four haematopoietic populations to the HSC-CMP score. (EPS) Figure S7 Graphical representation of the proposed method. Microarray data of undefined cells are standardized and designated as S, and those of well-characterized cell subsets are preanalysed by PCA and Canonical Variate Criterion to produce F. Generally, the number of environmental variable q is much smaller than that of samples p and genes l, thus the number of new axes by singular value decomposition (SVD) is q.

(EPS)
Table S1 Excel file of the list of genes that were used for constructing the HSC-CMP score, the gene scores of the CCAM result, and the biological/haematological features of these genes.