Meta-Analysis of Microarray Studies Reveals a Novel Hematopoietic Progenitor Cell Signature and Demonstrates Feasibility of Inter-Platform Data Integration

Microarray-based studies of global gene expression (GE) have resulted in a large amount of data that can be mined for further insights into disease and physiology. Meta-analysis of these data is hampered by technical limitations due to many different platforms, gene annotations and probes used in different studies. We tested the feasibility of conducting a meta-analysis of GE studies to determine a transcriptional signature of hematopoietic progenitor and stem cells. Data from studies that used normal bone marrow-derived hematopoietic progenitors was integrated using both RefSeq and UniGene identifiers. We observed that in spite of variability introduced by experimental conditions and different microarray platforms, our meta-analytical approach can distinguish biologically distinct normal tissues by clustering them based on their cell of origin. When studied in terms of disease states, GE studies of leukemias and myelodysplasia progenitors tend to cluster with normal progenitors and remain distinct from other normal tissues, further validating the discriminatory power of this meta-analysis. Furthermore, analysis of 57 normal hematopoietic stem and progenitor cell GE samples was used to determine a gene expression signature characteristic of these cells. Genes that were most uniformly expressed in progenitors and at the same time differentially expressed when compared to other normal tissues were found to be involved in important biological processes such as cell cycle regulation and hematopoiesis. Validation studies using a different microarray platform demonstrated the enrichment of several genes such as SMARCE, Septin 6 and others not previously implicated in hematopoiesis. Most interestingly, alpha-integrin, the only common stemness gene discovered in a recent comparative murine analysis (Science 302(5644):393) was also enriched in our dataset, demonstrating the usefulness of this analytical approach.


Introduction
Microarray-based studies of global gene expression have led to dramatic advances in our understanding of various biological processes. This technology has become one of the most rapidly growing investigational methods in medical research and numerous studies have been completed using this method. There are many available platforms [1] for microarray analysis, and newer technologies and better gene annotations have led to constant refinement of these platforms. This has resulted in a large amount of data in public repositories, like the Gene Expression Omnibus [2]. Meta-analysis of these data has the potential to yield important biological information, but is hampered by technical issues. Cross-platform comparability has been a major hindrance to this approach. This problem arises because matching probe-sets across platforms is a difficult task. Different platforms use different probe lengths and sequences, and mapping them to one common gene or set of genes is beset with problems. Another limitation is different gene annotations used by different platforms. The nucleic acid sequences for various species are submitted to and maintained in the GenBankH database by the National Center of Biotechnology Information (NCBI) [3]. There are different annotation methods in use to parse these sequences into genes or gene clusters. UniGene is one method for partitioning GenBank nucleic acid sequences into unique gene-oriented clusters, each of which represents a unique gene. These UniGene identifiers (IDs) are created by finding transcript sequences that match distinct transcription areas or genes. UniGene IDs have been used as the matching criterion to merge data across various platforms, but this has led to a substantial portion of the data remaining unmatched in previous studies [4,5,6,7,8]. Recent approaches have tried using Reference Sequence (RefSeq) IDs as the matching criterion [9]. RefSeq is a public access database, also maintained by NCBI. This database is built by using sequence data from GenBank, EMBL Data Library (UK) and DNA Data Bank (Japan) [10]. This set is also constantly updated, and input from various investigators is also used to maintain this set. Since both UniGene and RefSeq are billed as non-redundant sets of transcript IDs, and have been used in prior studies with mixed results, it is still unclear as to which approach is better.
We attempted to conduct a meta-analysis of all gene expression studies using hematopoietic progenitor cells to determine a gene expression signature characteristic of these cells. Our aim was to integrate data from all studies that used normal hematopoietic progenitors and stem cells into a unified normalized database. This was done using both UniGene as well as RefSeq gene IDs to assess which identifier provides the best yield. Our results show that experimental conditions, laboratory where the experiments were performed and different microarray platforms can result in significant variability in gene expression patterns from similar sources of cells. In spite of experimental variability, meta-analytical studies do have the power to discriminate biologically distinct tissues on the basis of their normalized gene expression patterns. Gene expression datasets from similar cells of origin cluster together despite diseased phenotypes and genetic alterations. The similarity seen among gene expression profiles of leukemias, myelodysplasia and normal hematopoietic progenitors, when compared to non-hematopoietic tissues, validates the functional discriminatory power of this meta-analysis. Finally, analysis of merged normal hematopoietic progenitor cell gene expression datasets led to the discovery of a common gene expression signature characteristic of these cells. Genes that are most uniformly expressed in normal hematopoietic tissues and at the same time being differentially expressed as compared to other normal tissues were found to be involved in important biological processes such as hematopoiesis and development. The expression patterns of these genes were validated in a different microarray platform using material from three different hematopoietic progenitor and stem cell experiments.

Data collection
Normal hematopoietic cell gene expression data were collected from the NCBI's Gene Expression Omnibus (GEO) database ( Figure 1). Bone marrow, hematopoietic, CD34 and stem cells were used as search terms to locate datasets containing gene expression profiles of normal human hematopoietic cells. Normal bone marrow/peripheral blood CD34 profiles used as controls in studies of leukemia and other hematological diseases were also included. Most studies used the Affymetrix U95, the U133A/B and the U133 Plus 2.0 Array Platforms. A handful of studies using older Affymetrix platforms, like the HG-Focus Target Array and the Full Length HuG Array, were discarded because combining data from these yielded a lot of non-matching probe-sets.
Gene expression data for other normal tissues assayed on the same platforms were also obtained from GEO. In a study where multiple sets were available, we picked one set each for every tissue, again to minimize correlation within individual datasets. These were picked using computer-generated random numbers.
To obtain diseased stem cell data, we identified a few studies with multiple datasets on myelodysplasia, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) samples. Where several datasets were available per study, we picked a subset, again using random numbers, to obtain about 10 samples per each study. Table 1 shows the details of these datasets [11,12,13,14,15,16,17,18,19].

Integration of datasets
Initially, we used the comparison spreadsheets provided by the chip manufacturer, Affymetrix [20]. These files link the probe-set IDs of various platforms. However, the yield therein was poor. For example, using the link file between the U133 and the U133 Plus platforms, 44635 U133 IDs matched to only 9908 U133 Plus IDs. Therefore, UniGene and RefSeq IDs were evaluated as variables to cross link data from various platforms. Individual probe-set IDs for each platform were linked to the corresponding UniGene IDs using annotation files, again provided by Affymetrix [20]. These UniGene IDs were then used to combine data across the three platforms. Once probe-set IDs and their expression values were combined, the expression value for each UniGene ID was obtained. In many instances, more than one probe-set matched to the same UniGene ID, resulting in multiple expression values for each such ID. In such cases, the expression value was calculated as the mean of the various values for each UniGene ID.
Probe-set IDs which did not match to any UniGene ID were dropped. Also, if any UniGene ID had data for only one platform, it was dropped, as it was considered to not match across at least two of the three platforms. Moreover, in many cases, one probe-set ID matched to more than one UniGene ID. In such cases, each UniGene ID was considered to have the same gene expression value and the data were expanded accordingly.
An identical process was used to merge data across platforms using RefSeq ID as the match identifier, instead of UniGene ID. Of interest, RefSeq IDs can be either protein IDs or transcript IDs. Protein IDs provided slightly better results (as detailed further) than transcript IDs, and were therefore used in this study.

Data analysis
Once expression values for each UniGene or RefSeq ID were obtained, these were used to do the analysis. First, the datasets were normalized using quantile normalization to ensure that inherent large-scale expression differences in the datasets based on different sources and laboratories were minimized. Unsupervised hierarchical clustering using average linkage with (1 -Pearson correlation coefficient) as the distance measure was done for each of the three 'types' of tissue -normal hematopoietic cells, normal non-hematopoietic tissues and diseased hematopoietic cells. This allowed us to look at how the datasets cluster -whether by platform, laboratory, experiment or otherwise.
To determine a gene signature for hematopoietic progenitor and stem cells, we used the datasets derived from 57 CD34+ sets, as whole bone marrow sets may not be a true reflection of these progenitors, being as they are a mixture of various cell types. To find out which genes were most consistently expressed across these samples, we used the coefficient of variation -defined as the standard deviation divided by the mean -of the expression values for each ID, calculated across all stem cell samples. The coefficient of variation was used to incorporate consistency in gene expression as well as ''enrichment'' of genes in the hematopoietic progenitor cells. Prior studies have used similar reasoning [21].
We then used this set of consistently expressed genes and compared their expression in normal hematopoietic progenitors versus that in non-hematopoietic tissues, to identify which genes could differentiate these two tissue sources. This was done using significance analysis of microarrays (SAM) [22,23]. Similarly, normal hematopoietic progenitor gene expression was compared to diseased hematopoietic data, to identify a subset of genes that may be most relevant to hematological stem cell disorders. All IDs with missing values for any of the samples were deleted.
All data analyses were done using SAS (SAS Institute, Cary, NC), the R language and ArrayAssist Expression software package (Stratagene Corporation, La Jolla, CA).

Integration of data using protein identifiers
A total of 66 individual normal hematopoietic cell expression profiles were identified in NCBI's GEO database (Table 1). Nine were derived from whole bone marrow samples and 57 were from selected CD34-positive cells. These studies were performed on 3 different microarray platforms (Table 2). Since the probe-set identifiers and complementary oligos were different on these platforms, we integrated the data using both UniGene and RefSeq protein IDs (Figure 1, showing schema).
The Affymetrix annotation files yielded 12,626 unique probesets in the U95 platform, 44,761 in the U133 A/B platform and 54,676 unique probe-sets in the U133 Plus 2.0 platform. Using UniGene IDs as the matching criterion, 11,635, 40,787 and 45,867 probe-sets matched to at least one other platform, respectively. After combining data from all the three platforms, we ended up with a total of 20,717 UniGene IDs. Since one probe-set can match to more than one UniGene ID and vice versa, a relatively small number of U95 probe-sets matched to 20,717 UniGene IDs. Using RefSeq protein IDs as the matching parameter, 11,722, 37,395 and 42,462 probe-sets matched to at least one other platform, respectively. As many of the probe-sets from the two newer platforms were coding for the same protein, we ended up with a total of 28,497 unique RefSeq protein IDs that were common to all three platforms. After removing the probe-sets where expression values were missing for any dataset, a total of 8,598 unique UniGene IDs and 8,345 unique RefSeq IDs were obtained that were common to all platforms. These were quantilenormalized using ArrayAssist (Strategene Corporation, California, USA) to adjust for hybridization intensities and used for the metaanalysis.

Experimental conditions, microarray platforms and source of cells can influence gene expression patterns
Sixty-six hematopoietic gene expression profiles from either whole bone marrow or selected CD34 cells were grouped using unsupervised clustering based on Pearson correlation coefficient. In spite of similar cell types, the studies grouped primarily based on the laboratory where the data was obtained from. The next level of clustering was defined by the microarray platform used for the studies. Barring two bone marrow samples from the Plus 2.0 platform that were similar to one bone marrow sample from the 133A/B platform, all the samples clustered depending on which platform they were from. The samples from the U95 platform stayed as a separate group (Figure 2). The last level of similarity was based on the exact source of the cells used for the RNA.
The correlation coefficients between various datasets validated the clustering order of laboratory, platform and source ( Table 3). The correlation was strongest between samples obtained from the same laboratory/study, with a mean (median) absolute correlation coefficient of 0.87 (0.95). When the platform was the same, a slightly lesser though still strong correlation of 0.83 was obtained. These results illustrate that the cause of variability in gene expression studies can be due to experimental conditions/ protocols used in individual laboratories, platforms used as well as sources of cells in that order.

Gene expression studies from biologically distinct tissue types can be compared despite varying platforms and experimental conditions
We next wanted to determine the degree of dissimilarity of hematopoietic datasets to gene expression (GE) datasets obtained from other biologically distinct tissues. GE profiles from human adrenal, appendix, brain, breast, colon, heart, kidney, liver, lung, ovary, pancreas, pituitary, prostate, salivary gland, skin, small intestine, smooth muscle, spleen, stomach, testis, thyroid, urinary bladder and uterus samples were obtained from the GEO database and used for this analysis. Unsupervised clustering showed that samples from the same tissue of origin clustered tightly together in spite of different platforms/laboratories used for the analysis   ( Figure 3). Clustering of triplicate sets of liver, heart, brain, salivary gland, testis, kidney and thyroid tissues from different laboratories and platforms clearly indicates that our analysis can detect the similarity of expression at the source tissue level. The mean (and median) correlation coefficients were also not very dependent on the laboratory/study or the platform ( Table 4). The highest correlation was observed between similar tissues. These results demonstrate that despite inter-platform and inter-study variability, meta-analysis of gene expression profiles has the potential of revealing differences between tissues with a high degree of dissimilarity (Table 4).

Gene expression datasets from similar cells of origin can cluster together despite diseased phenotypes and genetic alterations
To further test the discriminatory ability of the meta-analysis, we next grouped datasets from hematologic malignancies with the normal hematopoietic and non-hematopoietic tissues analyzed within the same microarray platform (U133 A/B). We wanted to determine whether biological variability seen in hematopoietic stem cell disorders such as acute leukemias and myelodysplastic syndromes would be distinguishable in our analysis. Unsupervised clustering showed that even though diseased hematopoietic cells were separated from the normal cells, they were significantly more dissimilar to non-hematopoietic tissues ( Figure 4A). In fact, some individual GE profiles from bone marrow CD34+ samples from myelodysplastic syndromes were very similar to normal CD34+ cells and clustered within their groups. We believe that this was a strong validation of our analytical approach as myelodysplasia is a preleukemic disorder with varying levels of pathology and can have cases that are genetically very similar to normal hematopoietic stem cells [24]. We did a similar analysis using RefSeq IDs as the matching criterion between different datasets. Interestingly, clustering using RefSeq IDs provided more heterogeneous results ( Figure 4B) and grouped non-hematopoietic tissues along with hematopoietic tissues, thus demonstrating that UniGene IDs are better at discriminating biological subsets.

Hematopoietic progenitor and stem cell signature
After validating the strength of the meta-analysis, we wanted to determine a gene expression signature of hematopoietic progenitors. Using the lowest 20 th percentile, to obtain the best possible initial yield, a total of 1,719 genes were obtained with a low coefficient of variation among the 57 CD34+ GE profiles (range 0.15-0.39). These were the genes deemed to be most characteristic of the stem and progenitor cells as their expression was most consistently enriched among all the samples.
Using this list of genes, we next determined the genes that were able to discriminate normal hematopoietic and non-hematopoietic cells by using significance analysis of microarrays (SAM). We used 100 permutations to compute the expected significance 'score', and a false discovery rate (FDR) of 0.29% was achieved by using the lower-and upper-most 10% of genes. A total of 349 genes were called as significant ( Figure 5).
To better understand how differentially expressed genes were integrated into specific regulatory and signaling pathway networks, we used Ingenuity Pathway Analysis (Ingenuity Systems, Redwood City, USA). Functional analysis of overexpressed genes indicated that this list is highly enriched for proteins involved in hematopoiesis and cell cycle, further validating our approach ( Figure 5, Table 5). Several of these genes have already been described to have important roles in development of the hematologic system. In addition, our analysis revealed a variety of novel functional genes like SWI/SNF family member 4, SMARCE1 and Septin 6. Many of the genes identified in our database were also found to be enriched in 3 independent HSC studies performed in our laboratory using a different Nimblegen platform (Supplementary Table S1). Cross validation suggests that   these genes need to be tested as potential markers of HSCs and may have functionally important roles in hematopoiesis. We also found 171 genes to be differentially underexpressed in hemato-poietic progenitors (Table 6). Our database and integration files will be online in a searchable format to aid other hematology and stem cell researchers (http://greallylab.aecom.yu.edu/).

Discussion
Microarray analysis of global gene expression has led to rapid advances in our understanding of various physiological and pathological processes. Although many hundreds of studies have been done, doubts have been raised about the reproducibility and applicability of this data [25,26,27,28]. Inter-study variability can be attributed to differing probes on the arrays, different protocols for RNA extraction, labeling and hybridization, and differences in the quality of cells. In spite of these factors, a number of studies have also demonstrated reproducibility of microarray studies performed at different platforms and laboratories, though most used the same source of RNA for these analyses [29,30,31]. The MicroArray Quality Control consortium (MAQC) was formed to address these questions and recently reported that reproducibility can be enhanced by better matching of microarray probes between platforms [32]. They concluded that matching probe-sets within the same exons and using similar experimental protocols can lead to more reproducible results when performed on major commercial microarray platforms. Our results take these findings a step further and demonstrate that GE studies done using different platforms and distinct sources of material have the power to discriminate between biologically distinct tissues and thus can also be used to analyze various scientific questions. Earlier attempts to address study specific biases have used statistical algorithms including ANOVA based correction models [33,34]. We did not use these algorithms as we found adequate discrimination between biologically distinct tissues, demonstrating that the degree of differential gene expression is so large that it is found even in presence of possible study-specific biases. It is possible that some of the more subtle results seen in our analysis, however, may prove artificial once these biases have been removed by appropriate methods.
Furthermore, this meta-analysis can be accomplished simply by using UniGene and RefSeq identifiers as common variables between array platforms, though UniGene is shown to be slightly better at achieving this discrimination in our dataset. This difference between UniGene and RefSeq results, albeit small, is likely due to the different methods of identifying and assigning transcripts used in the process, and has been observed in prior studies also [4,5,6,7,8,9,10]. Even though we did observe variability due to different laboratory protocols as seen by previous studies, a superior correlation between tissues with similar sources of cells was able to surpass this limitation and make the metaanalysis scientifically useful.
Our study demonstrated that results obtained through this approach can be reconciled with the biology of hematopoietic cells and malignancies thereof. For example, samples from acute myeloid leukemia and myelodysplasia were found to be transcriptionally closer to normal hematopoietic cells than non-hematopoietic cells, even though these studies are done in many different laboratories. MDS is a preleukemic disorder of varying grades of pathology and can have an indolent course in most patients [15,24]. The fact that MDS samples clustered with normal hematopoietic samples in some cases shows that our analysis can interpret biological relationships even between studies performed by different experimental protocols and laboratories.
After demonstrating that our approach can be used to biologically characterize sources of cells, we attempted to use this database to discover gene signatures characteristic of hematopoietic progenitor and stem cells. Due to the heterogeneity of our source dataset, we imposed very stringent criteria to discover genes characteristic of hematopoietic progenitors. Out of the 349 genes that were differentially expressed in normal progenitors, 124 are differentially expressed in diseased hematopoietic cells, demon-   [21,26,35]. Two recent seminal studies searched for gene signatures of stem cells by comparing genes enriched in hematopoietic, neural and embryonic stem cells and arrived at a total of 283 and 230 common 'stemness' genes respectively [21,35]. Even though the experimental techniques and cell types in these two papers were similar, an initial comparative analysis showed that only 7 'stemness' genes were common between these two studies. Comparison to a subsequent third analysis [26] showed even less overlap, with only one gene being consistently enriched between these three independent similar studies. Repeat analysis done using different statistical methods did lead to more gene overlap, but the final conclusion was that gene array studies of stem cells are influenced by cell purity and can be contaminated by a high level of nonspecific observations in the data. Consequently, the authors determined that commonly expressed genes among different studies may be better representatives of functionally important stemness genes. Thus, meta-analytical approaches may be a way to separate functionally important information from experimental noise. As the genes discovered by our analysis are common in an extremely variable dataset, they may have a high chance of being characteristic of human HSCs. Most importantly, alpha-6 integrin, the one gene that was found be enriched in all three murine stem cell studies, is similar to alpha-4 integrin that was found to be enriched in our human dataset. Both of these integrins are known to be expressed on the surface of HSCs and are implicated in cell migration and homing to the bone marrow. The functional similarities between these two integrins and the concurrence of our findings with three landmark stemness gene studies published in the literature validate our analytical approach. Our analysis also yielded a set of genes not previously implicated in hematopoiesis. Some of these genes have interesting functions and can be potential regulators of HSC function. SMARCE (SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily e/BAF57) is a key member of the mammalian SWI/SNF chromatin remodeling complex that is involved in transcriptional regulation [36]. SMARCE has been shown to mediate the interaction between the chromatin remodeling complex and transcription factors and thus could be partly responsible for the unique chromatin associated with stem cells [37]. Lyn kinase is a member of the src family of kinases and has been implicated in granulopoiesis and erythropoiesis and needs further exploration as a stem cell marker [38,39]. Septin 6 is a member of a class of proteins involved in cell division, membrane trafficking and cytoskeletal organization. The roles of septins in hematopoietic stem cells remain unexplored [40]. Amyloid beta precursor protein is a cell surface protein with signal-transducing properties, and it is thought to play a role in the pathogenesis of Alzheimer's disease [41]. This protein can activate NEDD8, a ubiquitin-like protein required for cell cycle progression through the S/M checkpoint and thus can be potentially involved in cell cycle control of hematopoietic stem cells. The protein Dp-2 (E2F dimerization partner 2) belongs to a family of transcription factors that play an essential role in regulating cell cycle progression [42]. These transcription factors regulate the expression of numerous critical genes (e.g. cyclin E, CDC2, cyclin A, B-Myb, E2F1, and p107) involved in cell cycle progression as well as several enzymes (DNA polymerase a, thymidine kinase, and dihydrofolate reductase) required for DNA replication [42]. Thus Dp-2 could certainly be involved in stem cell regulation. In summary, our analytical approach provides a list of interesting genes for further scientific and functional validation. Additionally, this dataset can be used as an online resource for stem cell and hematology researchers as a control database for comparisons with disease state GE profiles done in their laboratories.