Key Issues in Conducting a Meta-Analysis of Gene Expression Microarray Datasets

Adaikalavan Ramasamy and colleagues outline seven key issues and suggest a stepwise approach in conducting a meta-analysis of microarray datasets.

curation. We discuss the sixth issue-choosing a meta-analysis technique-using the two-class comparison as an example. The seventh issue of analyzing, presenting, and interpreting data is discussed briefly using an illustrative meta-analysis of 25 datasets. We provide a practical checklist, shown in Table  1, that should enable the reader to make informed decisions on how to conduct a meta-analysis, and to understand better the underlying concepts that make this approach so attractive for analysis of microarray data.
Having a detailed review protocol can further help to clarify the research objectives and methods and to minimize bias from unplanned data-driven analysis. We suggest developing the review protocol by outlining the solutions to the steps in the checklist shown in Table 1. For example, Step 7 (Check the selected study against inclusion-exclusion criteria) might be expanded in the review protocol as follows: "Two reviewers will check the eligibility of the identified studies, with disagreements resolved by a third reviewer. A log of excluded studies, with reasons for exclusions, will be maintained." The protocol can be turned into a useful project management tool by incorporating timelines and division of labor.
The inclusion-exclusion criteria (Step 2) are eligibility criteria for studies that will help achieve the stated objectives. These criteria could be biological (e.g., specific disease, type of outcome, type of tissues) or technical (e.g., density of array, minimum number of arrays). The retrieved articles must be evaluated as to whether they met the inclusion criteria.
Once the inclusion-exclusion criteria have been defined, one needs to perform a comprehensive literature search (Step 3) to identify suitable studies, usually based on Table 1

. A Checklist for Conducting Meta-Analysis of Microarray Datasets
Step Action Identify suitable microarray studies (Issue 1) 1 Formulate objectives and a review protocol. 2 Define inclusion-exclusion criteria and suitable keywords. 3 4 Search public microarray repositories listed in Table 2. 5 Contact collaborators and experts in the field to help find published and unpublished data. 6 Search the reference section of retrieved studies for other relevant studies. 7 Check the selected study against inclusion-exclusion criteria.
Extract the data from studies (Issue 2) 8 Scan the literature to identify FLEO data (e.g., CEL, GPR files). 9 10 If multiple publications use overlapping data, identify the most comprehensive one. Combine any training and validation dataset together. 11 Identify and remove any arrays with poor quality. 12 Preprocess the FLEO data into a GEDM. 13 Check for batch effects among arrays, especially in large studies. 14 Filter out any probes with poor spot quality in the arrays (optional). 15 Aggregate any technical replicates. 16 Check that the processed expression values from multiple platforms are compatible. 17 Identify either (a) the probe sequence or (b) the most sequence-specific probe annotation information. 18 Either (a) cluster the probe sequences or (b) map the most sequence-specific probe annotation to a gene-level identifier. Use the same mapping build for all datasets.

Annotate the individual datasets (Issue 4)
Resolve the many-to-many relationship between probes and genes (Issue 5) 19 Discard any probe that does not map to any GeneID. 20 For every GeneID within a study, calculate the study-specific estimate(s). 21 If a probe maps to multiple GeneIDs within a study, "expand" it by replacing it with a new record for each GeneID with the same study-specific estimate(s) or expression profile. 22 For GeneIDs with multiple records within a study, "summarize" them by either selecting one of the records or by aggregating them.
Combine the study-specific estimates (Issue 6) 23 For every GeneID, identify the studies that provide usable information. Optionally, discard any GeneID that is not found in at least a prespecified number of studies. 24 For every GeneID, combine the study-specific estimates across the studies using a meta-analytic technique. Record the resulting summary statistic(s). 25 Calculate the nominal p-value of the summary statistic(s) for every GeneID and adjust for multiple testing.

Analyze, present, and interpret results (Issue 7)
26 Examine the sensitivity of results to individual studies with a leave-one-out analysis and by varying the selections made (e.g., type of data available). 27 Present the summary statistics graphically (e.g., forest plot) for genes of interest. 28 Analyze findings using computational tools (e.g., gene set enrichment analysis). 29 If possible, validate using an alternative technology and/or different samples. 30 Consider strength of evidence, limitations, and generalizability of current findings. appropriate keywords for automated queries. We recommend searching all the major online repositories of abstracts listed in Table 2 to maximize data acquisition. Reading the latest review articles and directly contacting researchers in relevant fields (Step 5) may help to identify both work potentially missed by automated search, and ongoing research efforts with possibly unpublished data.
In the case of microarrays, one should also search public microarray data repositories [44][45][46] recommended by the Minimum Information About a Microarray Experiment (MIAME) requirements [47,48], as well as a few more specialized repositories [49,50], listed in Table 2 (Step 4).
Having identified potentially eligible studies from abstracts, one needs to retrieve the articles, where available, and confirm eligibility (Step 7). This process may best be done by at least two people.

Issue 2: Extract Data from Studies
Before we consider how to extract the data, we need to first decide what type of data to extract. This partially depends on the choice of meta-analysis technique (Issue 6), but the underlying principles will be discussed here. Figure 1 shows the four types of data arising from microarray analysis.
A published gene list (PGL) represents the genes that are declared as differently expressed in a given study. PGLs are often presented in the main or supplementary text of microarray-based studies and are thus easy to obtain. Unfortunately, such PGLs are of limited use for metaanalysis since they represent only a subset of the genes actually studied, and information from many genes will be completely absent. Furthermore, PGLs depend heavily on the preprocessing algorithm, the analysis method, the significance threshold, and the annotation builds used in the original study, all of which usually differ between studies [51]. Thus individual patient-level data (IPD), which for microarrays represents the measurement for every probe in every hybridization, are far more useful. Ioannidis et al. [52] discuss further the advantages of a meta-analysis using IPD versus PGLs.
The gene expression data matrix (GEDM) represents the gene expression summary for every probe and sample and is thus ideally suited as input for meta-analysis. Published GEDMs, however, are unsuitable for meta-analysis because they depend on the choice of the preprocessing algorithms used, which may produce non-combinable results. At present, image files are neither routinely deposited in public microarray repositories nor technologically uniform enough to be used as input for meta-analysis.
In order to eliminate bias due to specific algorithms used in the original studies, and to allow consistent handling of all datasets, we recommend obtaining the feature-level extraction output (FLEO) files (Step 8), such as CEL and GPR files, and converting them to GEDMs in a consistent manner (see Issue 3). FLEO files are likely to be available, especially for newer studies, because the widely supported MIAME requirements [48] now ask authors to make the FLEO data available in public microarray repositories.
If the main text and supplementary information do not state the location of the FLEO data, then one should try searching public microarray repositories or the research group's Web page before contacting the authors (Step 9). If multiple publications use overlapping sets of data, one should identify and use the most comprehensive dataset available (Step 10), and combine any datasets that were split for algorithm training and validation purposes.

Issue 3: Prepare Datasets from Different Platforms
FLEO data have to be converted into GEDMs, which can then be used as input for the meta-analysis. The same preprocessing algorithm should be used for multiple studies conducted on the same platform. To combine studies from different platforms, which may have different designs and thus have different options of preprocessing algorithms, it is desirable to try to identify comparable preprocessing algorithms. There are many microarray platforms, but we focus on the most popular: the Affymetrix platform and a set of platforms that could be generically classified as "two-color technology" platforms.
Before the preprocessing step, one may wish to first identify and remove any arrays that are of poor quality (Step 11). There are many comprehensive, free, and open-source packages in BioConductor [53] for quality assessment including arrayMagic [54] for the two-color technology platform and Simpleaffy [55], and affyPLM [56] for the Affymetrix platform.
Next, all good quality arrays should be preprocessed consistently to remove any systematic differences (Step 12). This is an important stage, since preprocessing directly affects the gene expression measurements, and thus all subsequent steps. In practice, researchers are likely to combine datasets from multiple platforms and there are very few preprocessing algorithms that can be applied universally, such as the  [57], which accounts for the dependence between variance and mean of the output expression measure. By contrast, it is more common to use different preprocessing algorithms for each platform [58][59][60][61].
Unfortunately, there is currently no consensus on which preprocessing algorithm(s) produce comparable expression measurements across different platforms. Third, one may also want to check and correct for any batch effects (Step 13), especially in large studies. Unsupervised visualization [62] can help to identify any grouping caused by experimental factors.
Fourth, one needs to decide whether to use all available probes on the array, or a filtered set of probes (Step 14). It is common to filter out probes that have visible defects (e.g., using quality flags), probe-set calls (e.g., absent/present calls from MAS 5.0 preprocessing algorithm), or probes that show little variation (e.g., using minimum coefficient of variation) in single-study analysis. However, it is unclear if such filtering is beneficial from a meta-analysis perspective.
Fifth, one needs to deal with multiple technical replicates (i.e., multiple measurements from the same biological subject) if relevant (Step 15). These should not be treated as independent observations. One approach is to select one of the replicates at random. Alternatively, one can average the replicates. If we assume that all technical replicates have similar array quality, then a simple average or median can be used.
Finally, one could check that the processed expression values from multiple platforms are comparable (Step 16). Microarray platform manufacturers typically include housekeeping genes or negative controls, which are genes expected to be transcribed at a constant level, and may be used for this purpose. Additionally, one may use a visualization technique such as multidimensional scaling [63,64] to inspect for any clustering of arrays by studies.

Issue 4: Annotate the Individual Datasets
Microarray probe designers use short, highly specific regions in genes of interest because using the full-length gene sequence can lead to non-specific binding or noise. Different design criteria lead to the creation of multiple probes for the same gene. Therefore, one needs to identify which probes represent a given gene within and across the datasets.
One option is to cluster the probes based on the sequence data (Step 17a) using the BLAST algorithm [65], for example, by using the Ensembl browser [66] (Step 18a). It has been shown that sequence-matched datasets can increase cross-platform concordance [67]. Such methods can also accommodate Affymetrix probe-set redefinitions [68], which better addresses the problem of alternative splicing. However, the probe sequence may not be available for all platforms and the clustering of probe sequences could be computer intensive for very large numbers of probes.
Alternatively, one can map probe-level identifiers such as I.M.A.G.E. CloneID, Affymetrix ID, or GenBank accession numbers to a gene-level identifier such as UniGene, RefSeq, or Entrez Gene ID. UniGene [69], which is an experimental system for automatically partitioning sequences into nonredundant gene-oriented clusters, is a popular choice to unify the different datasets. For example, UniGene Build #211 (released March 12, 2008) reduces nearly 7 million human sequences to 124,181 clusters. To translate probe-level identifiers to gene-level identifiers, one can use either the annotation packages in BioConductor [53] or Web tools such as SOURCE [70] and RESOURCERER [71] (Step 18b). We suggest using I.M.A.G.E. CloneID [72] or Affymetrix ID first, if available, as they are more sequence-specific (Step 17b). The same mapping build, ideally the most recent, should be used for all datasets to avoid inconsistencies between releases [73,74].

Issue 5: Resolve the Many-to-Many Relationships between Probes and Genes
In this section, we will refer to either the sequence cluster ID or the gene-level identifier (such as UniGene ID or RefSeq ID) used to annotate the datasets, simply as the GeneID.
Many probes can map to the same GeneID because of the clustering nature of the UniGene, RefSeq, and BLAST systems involved, or because the microarray chips used contain duplicate spotted probes. On the other hand, a probe may map to more than one GeneID if the probe sequence is not specific enough. Sometimes, a probe has insufficient information to be mapped to any GeneID, and we recommend omitting these from further analysis (Step 19). Inconsistencies between annotation databases or releases and software [73][74][75] complicate the matter further. The illustrative example of a meta-analysis of 25 datasets presented later in this paper contains 537,686 probes. Of these probes, 47,154 (or 8.7%) could not be mapped to any UniGene ID, while 29,774 (or 6.1%) of the remaining probes mapped to more than one UniGene ID.
This "many-to-many" relationship can fragment the available information for meta-analysis. For example, a probe could map to GeneID X in half of the datasets but to both GeneIDs X and Y in the remaining datasets. Software that performs automated meta-analysis on several thousand  genes will treat such probes as two separate gene entities, failing to fully combine the information for GeneID X from all studies.
A simple approach is to use only the probes with oneto-one mapping for further analysis, but this means losing information, and so is not recommended. In the example above, potentially half of the information for GeneID X (i.e., from probes mapping to both X and Y) will be ignored. Therefore, when relevant, we recommend replacing probes with multiple GeneIDs by a new record for each GeneID (Step 21). This greedy approach of "expanding" the probes with multiple GeneIDs ensures the software uses all possible information.
On the other hand, how should one deal with multiple probes that map to the same GeneID within a given study? Grützmann et al. [24] treated these as independent observations in the meta-analysis, but we recommend summarizing them (Step 22) into a single representative value per key within a study.
Several options are available to summarize information in this situation. First, one could select a probe at random, but this means losing information. Simply averaging the expression profiles before proceeding is not desirable either, as different probe sequences have different binding affinity, giving rise to the problem of different measurement scales. Thus, it is preferable to work with standardized measures such as the p-value or effect size. When working with standardized measures, one could select the most extreme value, since it is least likely to occur by chance. For example, Rhodes et al. [19] used the smallest p-value of the probes that corresponded to each GeneID. A more sophisticated approach, when working with effect size, is to meta-analyze the probes.
Recently, the MicroArray Quality Control (MAQC) project [61] described another alternative to resolve the many-to-many mapping. For a probe that mapped to multiple RefSeq IDs, the authors selected the RefSeq ID that was annotated by TaqMan assays and, secondarily, one that was present in the majority of platforms. Next, if many probes mapped to a given RefSeq ID, they chose the one closest to the 3' end of the gene.
After resolving for the many-to-many relationship by expanding and summarizing probes, we are left with one summary statistic per GeneID per study. In the next step, we proceed with meta-analyzing the summary statistic for each GeneID in turn across the studies.

Issue 6: Choosing a Meta-Analysis Technique
The choice of meta-analysis technique depends on the type of response (e.g., binary, continuous, survival) and objective. In this article, we focus on a fundamental application of microarrays: the two-class comparison where the objective is to identify genes expressed differentially between two wellknown conditions. There are four generic ways of combining information in such a situation. (For clarity of presentation, we indicate the steps only for the inverse-variance technique.) Vote counting. Here, one counts the number of studies in which a gene was declared significant [76]. For very small numbers of studies, the results can be visualized using a Venn diagram [77]. Vote counting in the context of microarrays is perhaps best described by Rhodes et al. [22], who also suggest calculating the null distribution of votes using permutation testing. Alternatively, one could calculate the significance of the overlaps using the normal approximation to binomial as described in Smid et al. [30]. Yang et al. [35] extend both of these techniques into the concept of meta-analysis pattern matches.
Combining ranks. Unlike vote counting, this technique accounts for the order of genes declared significant. DeConde et al. [37] use three different approaches to aggregate the rankings of, say, the top 100 lists (the 100 most significantly up-regulated or down-regulated genes) from different studies. Two of the algorithms use Markov chains to convert the pair-wise preference between the gene lists to a stationary distribution; the third algorithm is based on an order-statistics model. Zintzaras and Ioannidis [40] proposed METa-analysis of RAnked DISCovery datasets (METRADISC), which is based on the average of the standardized rank and has the advantage of incorporating the between-study heterogeneity (sum of squared deviations from the average). The null distributions for the average rank and heterogeneity are then estimated using non-parametric Monte Carlo permutation testing and matched for pattern of occurrence in studies. Hong et al. [38] proposed the RankProd [78], which calculates the product of the rank of pair-wise differences between every biological sample in one group versus another group across the studies.
Combining p-values. Rhodes et al. [19] use Fisher's sum of logs method [79], which sums the logarithm of the (onesided hypothesis testing) p-values across k studies for a given gene. The test statistic can be compared against a chi-square distribution with 2k degrees of freedom.
Combining effect sizes. Choi et al. [29] and others [24,32,80] used the inverse-variance technique [81,82] in the context of microarrays. The first step is to calculate the effect size and the variance associated with the effect size for every gene in every study (Step 20). Effect size can be calculated as the Cohen's d [83], which is the difference in two group means standardized by its pooled standard deviation [84]. Hedges and Olkin (1985) showed that this standardized difference overestimates the effect size for studies with small sample sizes. They proposed a small correction factor to calculate the unbiased estimate of the effect size, which is known as the Hedges' adjusted g. The study-specific effect sizes for every gene are then combined across studies into a weighted average (Step 24). As the name suggests, the study weights are inversely proportional to the variance of the studyspecific estimates.
Additionally, the integrative correlation technique proposed by Parmigiani et al. [33] could be first used to select only the "reproducible" genes for meta-analysis. First, the correlation profile of gene G is calculated as the correlation between gene G and every other gene in a study. Next, the correlation of correlation profiles of gene G in every pair of studies is computed, and if the average exceeds a certain threshold, the gene is called reproducible.
Given the various statistical options for meta-analysis, how should one choose the most suitable technique? We present a series of questions that could help a meta-analyst make an informed choice.
First, what are the minimum data required for each technique? Fisher's method, the inverse-variance technique, METRADISC, and the RankProd all require IPD, which are less readily available than PGLs. Vote counting, DeConde and colleagues' algorithms, and combining p-values are techniques that in theory could use the PGLs, but may not be able to do so in practice. For example, most publications report the significant genes or their rankings based on two-sided p-values, while vote counting and rank aggregation techniques require a one-sided p-value. Using p-values from two-sided testing means ignoring the directionality of the significance and may lead one to select genes that are discordant in direction of gene regulation between the studies. As noted earlier in Issue 2, we strongly prefer to use the IPD to minimize the influence of differing methods across datasets.
Second, which set of genes does each technique use? Vote counting and rank aggregation techniques (using PGLs) only consider the genes declared significant in the original studies. Thus, these techniques depend on an arbitrary threshold, and completely ignore genes that fall below this selected threshold. By contrast, the rank aggregation technique (using IPD), Fisher's method, and the inverse-variance technique consider information from all available genes. However, it is also important to note that the ranking of genes in an individual study depends on which other genes are included in the chip, and thus can influence the rank aggregation techniques. Since microarrays are often used as a hypothesis generating tool, we would prefer a technique that captures information from as many genes as possible.
The third question, related to the previous question, is how does each technique treat frequently studied and rarely studied genes? Newer microarrays chips have more comprehensive sets of genes compared to older chips. Thus some genes will be studied more frequently across the studies than others. For example, Affymetrix version HGU-133 plus 2.0 (released in 2003) contains almost all of 6,065 UniGene IDs available in Affymetrix version HU-6800 (released in 1998), plus a further additional 13,624 UniGene IDs. Ideally, we would prefer a technique that treats a frequently studied and a rarely studied gene equally.
Since vote counting and rank aggregation use the genes declared significant in the original studies, they do not account for the frequency of the genes. For example, a gene found significant in four studies and not significant in 16 studies will be favored over a gene found significant in three studies but absent in the other 17 studies. METRADISC accounts for this by matching each gene to the null distribution of genes that have the same absent/present patterns. Although the test statistic for Fisher's method is based on an unstandardized sum, it can address this problem by comparing it to a chisquare distribution where the degree of freedom is determined by the number of studies or by permutation. The inversevariance technique addresses this problem directly as it calculates a weighted average of the effect sizes.
Fourth, what is the ability of each technique to rank the genes, especially if only a small number of studies, say three to five, are available? A ranked list can help researchers to prioritize genes for further testing and validation. The vote counting technique produces very granular results, while other techniques produce results on a much finer scale.
Fifth, what is the computational complexity involved for each technique once the datasets have been prepared and annotated? The computing time for meta-analyzing the prepared and annotated GEDM for the 25 datasets in the illustrative example that follows, using vote counting, Fisher's method and inverse-variance technique are approximately two minutes, two minutes, and eight minutes respectively. We used R version 2.5.1 [85] on a Windows-based personal computer with a 1.86 GHz Intel Pentium M processor and 1 GB of RAM memory. Further, any technique that uses PGLs has to extract the information and annotation in a standardized format. The question of computational complexity becomes important, especially when one wants to estimate the null distribution using permutation techniques.
We believe that combining the effect sizes using an inverse-variance model is the most comprehensive approach for meta-analysis of two-class gene expression microarrays. In addition to the characteristics discussed above, this method has several other decisive advantages. First, it yields a biologically interpretable discrimination measure-the pooled effect size of differential expression and its standard error. Second, it is the only technique that weights the contribution of each study by its precision, which is related to the study sample size. Third, one is able to use a forest plot [86] to visually investigate the contributions of individual studies and the amount of heterogeneity across datasets. The use of effect size, a unitless measure not dependent on sample size, facilitates the combining of signals from one-color and expression ratios from two-color technology platforms.

Illustrative Example: Differential Gene Expression in Cancer Tissues
We demonstrate one exemplary meta-analysis using a subset of an ongoing meta-analysis where we look at the differences between cancerous tissues relative to normal tissues across various cancer types. This example stops short of discussing the biological significance of the findings, which is beyond the scope of this article.
We concisely describe the meta-analysis protocol in Table 3, using the same ordering as in Table 1. Figure 2 shows the data acquisition process, and Table 4 lists the characteristics of the 21 studies included . Arrays from the Affymetrixbased studies were preprocessed using the robust multichip average [108], and arrays from two-color technology were LOESS (local regression) normalized [109,110]. All analysis (unless stated otherwise) was carried out in R version 2.5.1 [85] and BioConductor release 2.0 [53]. The R codes are available upon request.
We chose to combine the effect sizes using the inversevariance model for the reasons described previously. Note that there are two variants of the inverse-variance technique. The random effects model used differs from the fixed effect model in that it incorporates the between-study heterogeneity into study weights. We use the random effects model in Step 24, where we can expect significant betweenstudy heterogeneity since the studies combined are both biologically (e.g., different tumors) and technically diverse (e.g., different platforms, laboratories). We used the fixed effects used in Step 22 to summarize probes within a study as we can expect a reasonable level of homogeneity within a study.
The pooled effect size and its 95% confidence interval for all 16,803 genes can be visualized simultaneously as in Figure 3.
The z-statistic (ratio of the pooled effect size to its standard error) for every UniGene ID was compared to a standard normal distribution to obtain the p-value and adjusted for false discovery rate (FDR) [111] (Step 25). Table 5 shows the output from the inverse-variance technique for the top five statistically significant up-regulated and down-regulated genes. Table 3. Outline of the Illustrative Example of Meta-Analysis Step Action 1 Objective: To identify genes that are consistently up-or down-regulated in cancers globally. 2 Inclusion-exclusion criteria: Any human studies investigating at least 7 patients with primary cancer and at least 7 patients with corresponding normal samples using high-density arrays. Any patients with metastatic tumors, cell lines, or benign tumors, or studies using specialized arrays, were excluded.

3-10
Data identification and acquisition: See Figure 2 for data retrieval of 21 studies. Ramaswamy et al. [104] had 5 sets of cancer-normal tissues that satisfied our criteria, and thus we have 25 datasets (see Table 4). 11 Array quality check: Not performed as this information was not available for all retrieved studies. 12 Preprocessing FLEO files: Arrays from the Affymetrix platform were RMA-preprocessed [108], and arrays from two-color technology were LOESS-normalized the expression values are on log base 2 scale.

& 14
Batch effect and spot quality check: Not performed as this information was not available for all studies. 15 Aggregate technical replicates: Only Bhattacharjee et al. (2001) [90] had technical replicates, which we averaged using a simple mean. 16 Compatibility check: Not performed.

& 18
Annotation matching: For the Affymetrix studies, we mapped the probe-sets to UniGene using the annotation packages in BioConductor 1.8.0 (built on March 26, 2006]. For the two-color technology arrays, we mapped the clone IDs, and if not available the GenBank Accession number, to UniGene using the web tool SOURCE [70] in March 2006. 19 Discard non-identifiable probes: 20 Calculate study-specific estimates: For every probe and for every study, we calculated effect size as the Hedges' adjusted g. 21 Expanding probes with multiple UniGene IDs: As described in text of Issue 5. 22 Summarizing multiple probes per UniGene ID within a study: 23 Discard poorly represented probes: The probes map to 28,365 unique UniGene IDs, but we restricted the analysis to the 16,803 that were identified in at least 5 of the 25 sets. 24 Combine study-specific estimates: its standard error. 25 The z-statistics (ratio of the pooled effect size to its standard error) for every UniGene ID was compared to a standard normal distribution to obtain the nominal p   Table 3 In total, 21 studies (6 + 3 + 8 + 4) are included in the meta-analysis. The characteristics of the included studies are given in Table 4.
At the FDR rate of 1%, we found 168 significantly downregulated and up-regulated genes. At this rate, we should expect 1% of the significant genes list, and in this case 1.68 and 3.25 in each list respectively, to be false positives.
After having identified the genes of most interest, we can proceed as in a traditional meta-analysis and visualize the contribution of individual studies using forest plots (Step 27). Figure 4 shows the forest plot for the most significantly up-regulated (Hs.478481) and down-regulated (Hs.117835) genes.
We can also proceed as in a typical single-study analysis. For example, using significant genes identified from the meta-analysis, we can use computational tools such as pathway enrichment (Step 28), conduct a literature search, and/or validate them on an alternative technology or on different patient sets (Step 29).
In this illustrative example of a meta-analysis, we have shown how the inverse-variance technique can identify consistently up-or down-regulated genes, information that suggests further lines of investigation.

Discussion
Meta-analysis of microarray datasets shares many features with meta-analysis in other areas of health care research. Perhaps the main differences are the large numbers of variables involved and technical complexities of integrating data across multiple platforms. Furthermore, most microarray studies are not prospectively planned and often do not have detailed protocols, but rather tend to make use of existing samples. Table 6 gives an overview of the advantages and disadvantages of various aspects of meta-analysis of microarray datasets. We discuss some of these points below.
Working with FLEO files allows for better standardization of information and the incorporation of data from unpublished studies, but it also requires significant effort to acquire and manage the datasets due to increased data complexity. This is further hampered by data sharing issues ( [112][113][114][115] and Ramasamy et al., unpublished data).
Sample matching between "cases" and "controls" may be a problem in meta-analysis as much as in single studies. Leaving aside the choice of biological equivalency of cases and controls, the numerical problem is highlighted by the imbalance of samples between the two groups in the illustrative example (see Table 4). For example, while the proportion of normal to total biological samples in prostate and lung cancer (the two tissues with the greatest number of biological samples in the illustrative example) is far less than half, the proportions do vary (105 out of 452 or 23.2% in prostate cancer versus 60 out of 356 or 16.9% in lung cancer).
Another major concern associated with meta-analysis in many clinical and epidemiological studies is the problem of publication bias, which is a consequence of selectively publishing statistically significant and favorable results [116,117]. On the surface, we do not expect to find a publication bias at a gene level in a given study because of the discovery-driven and high-density nature of microarrays. However, anecdotal evidence based on sales figures (J. P. Ioannidis, personal communication) suggests that data from only 10% of all the Affymetrix chips sold are published. The possibility of publication bias in microarray research needs further investigation. Furthermore, within a single-study microarray analysis, the particular choice of down-stream analysis may lead to different results depending on the objective of the study [118,119]. It is unclear to what extent this problem affects meta-analysis of microarrays, even with coherently preprocessed datasets.
Finally, the sensitivity of the results from meta-analysis, as with any other research study, should be tested before a final conclusion is reached (Step 26). We did not present any sensitivity analysis for the illustrative example presented here, but there are several possibilities. First, we could investigate sensitivity of the results to the choices we made here (e.g., using probes present in at least five studies). Secondly, we can test if any particular study is particularly influential, by repeating the meta-analysis without each study in turn and comparing the change. Finally, we could test if the inclusion of studies that provide only the GEDM into the   Study selection issues such as study diversity and data quality need to be addressed Financially inexpensive as it uses existing studies Time and effort to acquire and manage data can be significant Potential publication bias might limit meta-analysis

IPD versus PGLs
IPD avoids selective reporting of genes Harder to acquire IPD compared to gene lists IPD permits re-analysis of individual studies for reproducibility or to carrying out other analysis

FLEO data versus GEDM
FLEO data allows us to standardize preprocessing algorithms and analysis methods More time and effort required to acquire and manage FLEO relative to GEDM In practice, some researchers may withhold access to FLEO and thereby introduce a possible bias

Inverse-variance technique versus other techniques for combining data
Interpretable results with standard error to construct confidence intervals A large study in the collection may influence the overall results of meta-analysis Treats a rarely studied and frequently studied genes equally Ignores correlation between genes (as do most techniques) IPD may not be available for all studies. Good ability to rank results when applied on small number of studies  meta-analysis along with the studies that provide FLEO data changes the results. In this paper, we have formulated and explored key issues encountered in conducting a meta-analysis of microarray datasets. We considered the available solutions and made some practical recommendations. First, we showed how to obtain suitable datasets by searching the published literature and public microarray repositories. Second, we proposed that using FLEO files allows for better standardization of information. Third, we outlined the issues involved in preparing datasets from multiple platforms. Fourth, we discussed how to match the different datasets using gene-level identifiers. Fifth, we explained how to resolve the problems caused by the many-to-many relationship between the probes and genes by "expanding" probes with multiple GeneIDs and then "summarizing" the multiple probes that correspond to a GeneID within a study. Sixth, we argued that the inversevariance technique, initially proposed in the microarray context by Choi et al. [29], has many desirable properties over other techniques used for two-class comparison of gene expression microarray studies. Finally, we presented an illustrative meta-analysis of 25 datasets to briefly demonstrate the issue of how to present, analyze, and interpret a metaanalysis of microarray datasets. All of this information is neatly captured in a practical checklist, shown in Table 1.