Non-Gaussian Distributions Affect Identification of Expression Patterns, Functional Annotation, and Prospective Classification in Human Cancer Genomes

Introduction Gene expression data is often assumed to be normally-distributed, but this assumption has not been tested rigorously. We investigate the distribution of expression data in human cancer genomes and study the implications of deviations from the normal distribution for translational molecular oncology research. Methods We conducted a central moments analysis of five cancer genomes and performed empiric distribution fitting to examine the true distribution of expression data both on the complete-experiment and on the individual-gene levels. We used a variety of parametric and nonparametric methods to test the effects of deviations from normality on gene calling, functional annotation, and prospective molecular classification using a sixth cancer genome. Results Central moments analyses reveal statistically-significant deviations from normality in all of the analyzed cancer genomes. We observe as much as 37% variability in gene calling, 39% variability in functional annotation, and 30% variability in prospective, molecular tumor subclassification associated with this effect. Conclusions Cancer gene expression profiles are not normally-distributed, either on the complete-experiment or on the individual-gene level. Instead, they exhibit complex, heavy-tailed distributions characterized by statistically-significant skewness and kurtosis. The non-Gaussian distribution of this data affects identification of differentially-expressed genes, functional annotation, and prospective molecular classification. These effects may be reduced in some circumstances, although not completely eliminated, by using nonparametric analytics. This analysis highlights two unreliable assumptions of translational cancer gene expression analysis: that “small” departures from normality in the expression data distributions are analytically-insignificant and that “robust” gene-calling algorithms can fully compensate for these effects.


Background
Microarray-based assays of gene expression have become a mainstay of basic and translational cancer research. A significant number of modern investigations rely on these tools to inform hypothesis generation [1], for pathway analysis [2,3], for pharmacogenomics and drug discovery [4], and for developing molecular-based disease classification strategies [5,6]. Additionally, gene expression data are becoming progressively more important for informing clinical diagnosis and patient management [7,8], and microarray-based genomic profiles are now being used to guide patient enrollment and stratification in large-scale clinical trials [9,10].
Against this backdrop, the importance of accurate interpretation of microarray results and the significant consequences of systematic analytic errors becomes apparent. In the early days of microarray analysis, high experimental costs and significant technical variability limited the available information with which comprehensive analyses of the practical effects of subtle biases in microarray data or in its interpretation could be studied [11]. This, in turn, necessitated that certain mathematical and biological assumptions be made [12,13], and the lack of adequate data precluded in-depth investigation of the validity of these assumptions.

The Assumption of Normality in Two Related Types of Expression Datasets
One common assumption is that data from microarray-based genome expression analyses conform to a standard Gaussian (normal) distribution. This assumption is rarely explicit but rather is most commonly made implicitly when investigators apply analytic algorithms predicated upon the Gaussian assumption. Distribution-related assumptions are relevant to at least two, distinct sets of expression data generated in microarray analyses, and the normality assumption has been variably (often implicitly) applied to both [12][13][14][15].
The first dataset to which distribution is relevant comprises the complete set of individual expression values across all genes and all samples in a given experiment. For example, in a study examining the expression of 25,000 genes in 100 tumors, this is the set of all 2.5 million gene expression values. The distribution of this composite dataset may be particularly relevant to downstream clustering and class discrimination analyses, as many of these algorithms are typically applied to the entire dataset as a whole. When algorithms predicated upon a standard Gaussian distribution are used, the normal assumption is implicitly introduced.
The second dataset to which distribution is relevant is the dataset comprising the individual expression values for a single gene across the entire range of experimental samples. Continuing the previous example, this experiment would generate 25,000 such datasets, each with 100 data points. The distribution of these 100 data points may be particularly relevant to studies that examine the consistency of behavior of a specific gene in a specific tumor type or analyze the pattern of its change across a range of ''classes'' or ''grades'' of a specific tumor. Here the distribution may provide a useful description of the behavior of this single gene across multiple independent samples, but the normal assumption may be implicitly introduced if algorithms used to analyze the behavior of this gene are predicated upon a standard Gaussian distribution.

Significance
The possibility that gene expression data violate the normality assumption may be of considerable significance to clinical and translational investigators. Most current and proposed medical applications of microarray expression data are derived from analyses predicated upon this assumption, many of which have relied upon parametric statistics for gene calling and class discovery [6][7][8]. Translational oncologists are among the most avid consumers of microarray data and the most likely to propose its clinical application, so a logical place to begin an investigation of the magnitude, extent, and clinical implications of non-Gaussian distributions in gene expression data is with large, publicly-available cancer genome databases [22,23]. Notwithstanding, this issue is fundamental to the current analytic paradigm for gene expression data in general, and we anticipate that the findings of this investigation will have significance beyond the sphere of translational molecular oncology.
The present investigation has two objectives and has been structured in two parts: the first is theoretical -to study the distributions of cancer gene expression data -both at the individual gene and at the complete dataset level -and to assess the extent to which these deviate from normality. This provides the foundation for the second, translational objective: to study the implications of non-Gaussian gene expression distributions on clinically-oriented genomic analyses. The experimental model has been deliberately designed to recapitulate faithfully the workflow of a typical, translational pipeline for gene expression analysis ( Figure 1).

Distribution Analysis -Complete Datasets
We first examined the distributions of the complete set of individual expression values across all genes and all samples in each of five experiments (the first type of data set described in the introduction). Table 1 summarizes the results of the central moments analysis of five, large-scale (n = 180, each) human cancer genomes, which was performed after normalization with either the robust multichip average (RMA) [24] or the DChip [25] methods. These data demonstrate that, while the means and standard deviations suggest approximate normality (m range: 20.18-0.10; s range: 0.84-1.58), the third and fourth central moments depart from normality in a statistically-significant manner. Fisher's indices of skewness and kurtosis, which are considered significant at a,0.05 when they exceed 61.96, are .100 for all samples. Additionally, the F-test of the variance demonstrates statisticallysignificant departures from normality for all samples (Tables 1, S1). All five cancer gene expression distributions therefore depart significantly from the normal distribution. This is further supported by the results of the one-way and two-way KS tests, which demonstrate significant departures from normality for all of the datasets. Moreover, the findings of the central moments analysis suggest that these distributions have slight but significant skewness, are markedly kurtotic, and are heavy-tailed ( Figure 2). Similar results from data normalized using both the RMA [24] and the DChip method [25] suggest that this departure from normality is unlikely to be a function of the normalization algorithm, and analysis of both Log 2 -transformed and Log 2subtracted data suggests that it is not related to Log subtraction (Tables 1, S1; Figures S1, S2).
These findings are not necessarily surprising, as neither of the normalization methods nor the process of log-transformation are specifically intended to produce normality; however, this analysis demonstrates using multiple expression datasets that none of these transformations are sufficient to produce Gaussian data. Accordingly, it cannot be safely assumed that data that have been ''normalized'' using any of these methods actually conform to a ''normal'' (standard Gaussian) distribution.

Distribution Analysis -Individual Genes
We also examined the data distributions of individual genes across the 180 samples of each of the 5 cancer data sets. Many investigators examining data from an experiment containing microarrays of multiple, similar tumors may assume that an ''overexpressed'' gene would exhibit a Gaussian distribution centered around a positive mean value, an ''underexpressed'' gene will have a similar distribution around a negative value, and a gene whose expression is unchanged will have a Gaussian distribution centered around zero. Our analysis, however, demonstrates that variable degrees of skewness and kurtosis as well as marked deviations from unity among the standard deviations are characteristic of the expression distributions for individual genes. Table 2 summarizes the results of this analysis, and Figure 3 gives an illustrative example of this effect by plotting the distributions selected genes from the brain tumor (glioblastoma) data set.

Curve Fitting
Empiric curve fitting was used to further investigate the actual morphology of the cancer gene expression distributions (Table 3;  Figures 4, S3, S4, S5, S6). This analysis suggests that complex, multi-parameter distributions are required to more accurately model the expression data distributions. In general, the best-fit distributions were those that are parameterized to model skewness, kurtosis, and heavy tails. These include multi-parameter distributions related to the b-prime (Pearson VI, capable of modeling skewness) (e.g. Log-logistic, Dagum, Burr), kurtotic distributions (e.g. hyperbolic-secant), and the versatile, 4-parameter Johnson SU [26].   While these distributions fit the data more accurately than the normal distribution, KS testing indicates that they are imperfect fits (Table 3). Moreover, there is no single distribution that is clearly superior for modeling all sets of expression data. Overall, this analysis confirms the significant departures from normality associated with the cancer genome expression data and demonstrates the complex nature of the underlying expression distributions.

Gene Calling & Functional Annotation
Up to this point the analysis has been focused on investigating the actual distributions of gene expression datasets and comparing these to a theoretical, normal distribution. This analysis has demonstrated that human cancer gene expression data is not normally-distributed, either on the experiment or on the singlegene level. An appropriate next question would be whether these deviations from normality affect commonly-performed gene expression analytics, including molecular classification, gene calling, and functional annotation.
To investigate this question, we performed an analysis of a gene expression dataset from 23 low-grade gliomas (LGG), including a unique subset of eleven tumors with intact chromosomes 1p and 19q (arbitrarily designated Class 1) and another subset of eight oligodendrogliomas with chromosome 1p/19q codeletions [5,27] (arbitrarily designated Class 2), was used to study the effects of the data distribution on identification genes that are differentiallyexpressed between known tumor subsets. This was accomplished by applying a uniform transform (Box-Cox [28]) to the expression dataset to improve the normality of the data distribution and then comparing the results of gene calling algorithms applied to the parent and transformed datasets ( Figure 5). In this way only the shape of the distribution has changed, and the null hypothesis is that this transformation should have no effect on gene calling if the methods are sufficiently ''robust'' to distribution morphology or are truly ''distribution-independent.'' The two-sided student's t-test with a standard Bonferroni correction (p,0.01), identified 50 differentially-expressed genes between Class 1 and Class 2 using the parent distribution and 55 using the transformed distribution (9.1% difference). Forty-nine (49) of 56 total differentially-expressed genes were common to both lists (87.5%), while 7 were uniquely identified in only one of the two lists (12.5%) (Tables 4A, S3).
Even with the stringent Bonferroni correction, the t-test is a parametric test that makes assumptions regarding the shape of the underlying distribution. To eliminate this effect, we applied two, nonparametric methods for gene calling. A two-class, unpaired significance analysis of microarrays (SAM) [29] identified 759 differentially-expressed genes in the parent and 478 in the transformed distribution (37.2% difference). Of 760 total genes, 477 (62.8%) were common to both lists while 283 (37.2%) were unique to only one of the two lists (Tables 4A, S4). A two-class, unpaired Kruskal-Wallis (KW) test identified 1,801 differentiallyexpressed genes in the parent distribution and 1800 in the transformed distribution. There was 99.9% overlap in these gene lists (Tables 4A, S5).
An alternate strategy for gene calling uses linear modeling for microarrays (LIMMA) [30] a Bayesian approach to linear modeling to calculate a moderated t-test. While this method assumes normality of the underlying data, it is viewed by many to be superior to standard and corrected t-tests and is considered robust to a variety of confounding mathematical and statistical effects [31]. LIMMA identified 2,866 differentially-expressed Figure 3. Single-Gene Expression Distributions are not Gaussian. These graphs illustrate the wide range of potential skewness (A) and kurtosis (B) that exist in the expression distributions of individual genes comprising the cancer expression datasets. This refutes the assumption that the expression data for individual genes follow an approximately Gaussian distribution around the gene's mean expression level. Data for these graphs was taken from the log 2 -subtracted, RMA-normalized glioblastoma expression data. For the skewness comparison, five genes with comparable means, standard deviations, and kurtosis were selected from subsets of genes representing approximately the 10 th , 25 th , 50 th , 75 th and 90 th percentiles for per-gene skewness contained in the dataset. Similarly, for the kurtosis comparison, five genes with comparable means, standard deviations, and skewness were selected from subsets of genes representing approximately the 10 th , 25 th , 50 th , 75 th and 90 th percentiles for per-gene kurtosis contained in the dataset. The identities of the genes are not germane for comparative purposes. doi:10.1371/journal.pone.0046935.g003 genes in the parent and 2,981 in the transformed distribution. Of 3,047 total genes, 2,710 (88.9%) were common to both lists while 337 (11.1%) were unique to only one of the two lists (Tables 4A,  S6). The effects of the distribution on functional annotation were studied first by using DAVID [32,33] to annotate for gene ontology (GO) [34,35] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [36] terms in the gene lists previously generated by the SAM and KW analyses and then by performing a statistical enrichment analysis for the annotated terms. This identified 46 unique terms in the SAM lists, with 60.9% overlap between the enriched terms in the parent and transformed lists. Conversely, analysis of the lists generated by the KW analysis identified 49 enriched terms, all of which were identical in the lists from the parent and transformed datasets (100.0% overlap) (Tables 4B, S7, S8). Distribution fitting for the brain cancer dataset for RMA (top) and DChip (bottom) normalized data. The three bestfit curves are superimposed on the histogram, and the normal distribution curve is included for comparison. The specific parameters for the best-fit distributions are given. The inset displays the quantile-quantile (QQ) plot for the best-fit and normal distributions. These charts demonstrate that multiparameter distributions capable of modeling skewness and kurtosis better characterize the data than the standard Gaussian (normal) distribution. Similar graphs for additional tumor types are given in figures S2, S3, S4, S5. doi:10.1371/journal.pone.0046935.g004 Classification Gene expression data are frequently used as the basis for attempts at molecular-based subclassification of tumors with similar histology but different clinical phenotypes. We exploited the a priori knowledge [5] of two such groups within the low-grade glioma dataset (Class 1 and Class 2) to simulate the classification process and to study the relationship of the results to the shape of the underlying data distribution. Discriminant analysis (DA) and knearest neighbors (KNN) classifiers were trained on a subset of the tumors with representatives from each class and were then used to classify ten, novel tumors into one of the two classes. Identical analyses were performed on data from the parent and transformed distributions. The results of these analyses demonstrate a 20% difference in class assignment (2/10 samples) for the DA and 30% (3/10 samples) for the KNN classifier when used with the parent data but identical classifications for both models when used with the transformed dataset ( Figure 6). This effect is independent of the initial method of data reduction (SAM or t-test) ( Figure S7).

Gene Expression Data are not Normally-Distributed
The distribution of gene expression data is typically assumed to conform to a standard Gaussian (normal) distribution [11,17]. This assumption may be attributable to a combination of three factors. First, this behavior may be (arguably) predicted by the central limit theorem [16]. Second, basic analyses of gene expression datasets, which generally include calculations of the mean and standard deviation as well as visual inspection of the data distribution, usually reveal bell-shaped curves with means (m) centered near zero and standard deviations (s) approximately equal to one. Third, in the early days of gene expression analysis when these assumptions were codified, datasets were small and observed differences from these theoretical values may not have achieved statistical significance.
The modern era of expression analysis, characterized by decreased cost and increased sample availability, now affords the luxury of working with datasets that include several times more samples and exponentially-more features than those of the past. These datasets, like the ones examined herein, allow more precise analysis of the distributions of expression data. In this analysis we have gone beyond calculating m and s (which, in fact, appear at first glance to be consistent with normality in these data) and have performed a comprehensive analysis of the higher-order central moments for these distributions. This analysis exploits the availability of nearly 10 8 features per dataset to allow statistical significance assessments of seemingly-minor deviations from normality. In so doing, it reveals that these deviations achieve a high degree of statistical significance for all of the first four central moments. This provides convincing evidence that these cancer gene expression data do not conform to a standard Gaussian distribution ( Figure 2, Table 1) and that categorical assumptions of normality for these types of datasets may be invalid.

Gene Expression Data Exhibits Complex Distribution Characteristics
Empiric curve fitting identifies, in an unbiased fashion, distributions that most accurately model the observed distributions of the expression data. Analysis of the empirically-fit distributions provides additional information regarding data distribution and can be used to draw general conclusions regarding the types of downstream analyses that may be applicable to these datasets. This analysis demonstrates that the expression distributions are not well modeled by simplified, two-parameter distributions (such as the normal distribution) but instead require distributions with multiple (3-4) shape parameters to model the data accurately. Several derivatives of the b-prime distribution (e.g. Log-logistic, Dagum, Burr [37,38]) were empirically identified as useful models for this data. This is logical given that the b-prime is related to the Pearson type VI distribution, which is one of a family of distributions originally used to model skewed data [38]. The hyperbolic secant distribution was also commonly identified among these empiric models. This is a more straightforward, 2parameter distribution with an exaggerated kurtosis [39], and its identification as a useful model for these data underscores the kurtotic nature of the datasets. Finally, the 4-parameter Johnson SU [26] is a versatile distribution to model skewed and kurtotic data. Together the Johnson family of distributions covers the entire skewness-kurtosis spectrum, and the SU distribution is particularly useful with logarithmic data [38]. In aggregate, the identification of these particular families (b-prime/Pearson, hyperbolic-secant, Johnson) highlights the skewness and kurtosis of these datasets and emphasizes the inadequacy of the normal distribution to model accurately cancer gene expression data.
The goal was to use the process of distribution fitting to learn as much as possible about the underlying data structure of the cancer transcriptome, not to identify a single, ''best-fit'' distribution for cancer gene expression data. In fact, the KS analysis (Table 3) demonstrates that none of the 57 distributions (Table S2) against which these data were tested provided an ideal model for the underlying data. It remains unclear if a single distribution can describe the cancer transcriptome faithfully, and it is likely that no two cancer gene expression datasets will have the same, ''best-fit'' distribution. We hypothesize that the complex shape of the aggregate distributions may reflect their composition of various, unique distributions of the component genes. Further investigating this mixture-model hypothesis and its implications for gene calling is outside the scope of this report but merits further investigation.
Notwithstanding, identifying such a theoretical model for the aggregate distribution is not necessarily required to conduct highquality analysis of expression data. Instead, investigators who work with gene expression data may wish to perform similar analyses to those described in order to understand the nature of the distribution of their unique datasets. This will then allow them to verify that their downstream analyses are not confounded by inaccurate assumptions regarding the shape of the data distributions.

Non-Gaussian Distributions Affect Gene Calling and Functional Annotation
Having demonstrated that cancer gene expression data are not normally-distributed, a critical question is the degree to which these deviations from normality affect downstream, translational analyses. Considerable effort in translational oncology has been applied to identifying unique, genotypic subsets of tumors with clinically-significant phenotypic correlations, so we focused our Figure 5. Distribution Transformation. A Box-Cox transformation applied to the low-grade glioma dataset (left) results in a distribution that more closely approximates a normal distribution (right). Note that the parent distribution was recentered to a zero mean to compensate for the default mean of the Robust Multichip Normalization output of 7. This transformed distribution was then used to analyze distribution-dependent effects on identification of differentially-expressed genes, functional annotation, and prospective molecular classification. doi:10.1371/journal.pone.0046935.g005 Table 4. Gene Calling and Functional Annotation. analysis of the analytic effects of non-Gaussian distributions in this domain.
One common goal of translational investigation is to identify a set of genes with differential expression between two, known or suspected tumor subsets. We investigated this question by applying a normal-transformation to the LGG dataset, using three different algorithms to identify differentially-expressed genes between Class1 and Class 2 in both the parent and in the transformed data, and then performing a semi-quantitative analysis of the resulting gene lists.
The Bonferroni-corrected t-test identified 50 differentiallyexpressed genes in the parent and 55 in the transformed distribution and resulted in a distribution-dependent variability of 12.5% (see Text S1, for additional discussion of this calculation) ( Table 4A). The extent to which this variability reflects the parametric assumptions of the classifier is difficult to determine, because the stringency of the Bonferroni correction results in a small list of differentially-expressed genes. LIMMA [30], which is considered more robust than basic and corrected t-tests despite its fundamental assumption of normality, was also sensitive to changes in the underlying data distribution, with an 11.1% difference in gene calling noted between the parent and transformed distributions (Table 4A, S6). Conversely, the nonparametric KW test identified 1,801 differentially-expressed genes, of which 1,800 (99.9%) were common to both lists (Table 4A, S5). Although this simulation cannot definitively demonstrate the superiority of nonparametric methods in gene calling, the result suggests that the output of nonparametric methods for identification of differential expression may be less sensitive departures from normality in the underlying distribution than their parametric counterparts. This may be logical, given the nature of the assumptions (or the lack thereof) made by parametric and nonparametric algorithms (respectively), but additional, theoretical and applied investigations will provide clarity.
Nonparametric methods may not be sufficient, however, to completely offset the effect of non-Gaussian distributions on gene calling. SAM, a nonparametric algorithm [29], identified 759 differentially-expressed genes in the parent distribution and 478 in the transformed distribution, with a distribution-dependent variability of 37.2% (Table 4A, S4). Despite SAM's classification as a nonparametric test, these data suggest that the shape of the underlying distribution has a measurable effect on the result of its analyses. This degree of variability is noteworthy, particularly from an algorithm that is considered to be among the most robust of the nonparametric gene calling tools [29,40].
This simulation illustrates the potential effects of non-Gaussian distributions on gene calling. More importantly, it provides direct evidence against two common beliefs regarding gene expression analysis: that ''small'' departures from normality are insignificant to practical gene expression analyses and that ''robust'' gene calling algorithms can compensate completely for these effects.

Non-Gaussian Distributions Affect Molecular Classification
An important objective of translational oncology research is to identify patterns of gene expression that distinguish phenotypically-significant subclasses of malignant disease. A common strategy is to use gene expression profiles in conjunction with a priori knowledge of the clinical phenotype of interest (e.g. prognosis, response to therapy) to train molecular classifiers that are subsequently capable of making phenotypic predictions for novel tumors using gene expression data. This process is mathematicallycomplex, and we performed a series of simulations to analyze the extent to which these training and prospective classification strategies may be affected by non-Gaussian data distributions.
We used a representative subset of 13 randomly-selected LGGs as the training set for two ''robust'' molecular classifiers, the parametric DA classifier and the nonparametric KNN classifier. Each classification strategy was used in conjunction with both the parent and the transformed data distributions, allowing the effects of the shape of the distribution on the classification algorithm to be examined. The results of this analysis ( Figure 6) demonstrate a 20% difference in the resulting classification for DA and a 30% difference for KNN when using the parent versus the transformed data. These results suggest that the shape of the distribution has measurable effects on molecular classifiers that are considered ''robust'' and that both parametric and nonparametric classifiers may be sensitive to these effects.
It is important to note that the goal of this analysis was to perform a semi-quantitative analysis of the differences in the classifications attributable to the shape of the data distribution rather than to compare the classification accuracy. While both classifiers resulted in classifications that better fit our current disease models when used in conjunction with the transformed data, it is impossible to know if this truly represents an improvement in accuracy without knowing with certainty that our current disease models accurately reflect the true nature of the disease. At present the classification scheme applied in this analysis [5] remains only one potential model for subclassification of lowgrade gliomas, the ultimate accuracy of which remains to be definitively determined. Notwithstanding, the primary purpose of these simulations was to investigate the possibility of distributiondepended differences in classification (regardless of the ultimate accuracy), and we have illustrated that the potential for such differences exists for both parametric and nonparametric, prospective, molecular classifiers.

Conclusions
Cancer gene expression profiles are not normally-distributed, either on the complete-experiment or on the individual-gene level. They exhibit complex, heavy-tailed distributions characterized by statistically-significant skewness and kurtosis. The non-Gaussian distribution of these data affects identification of differentiallyexpressed genes, functional annotation, and prospective molecular classification. These effects are not fully corrected by multipletesting and other ''robust'' modifications to t-test strategies, including Bonferroni corrections or linear modeling (LIMMA). The effects may be mitigated in some circumstances, although not completely eliminated, by using nonparametric analytics. This analysis therefore provides direct evidence refuting two, common beliefs in translational cancer gene expression analysis: that, ''small,'' departures from normality in the expression data distributions are insignificant in practice and that ''robust'' gene calling algorithms can fully compensate for these effects.

Definitions
The term normal distribution will refer in this manuscript to the special instance of the Gaussian distribution with mean (m) equal to zero and standard deviation (s) equal to unity (also known as the ''standard normal'' or ''standard Gaussian'' distribution). The term first central moment will be taken to mean the first moment, which describes the mean of a distribution. All other references to higherorder central moments will be used literally.

Analytic Model
We examined the distributions of cancer gene expression data and investigated the translational implications of deviations from normality. To increase the practical applicability of these results, the experimental model has been structured to simulate the flow of expression data through a prototypic analytic pipeline, similar to those used in modern basic and translational gene expression investigations. An overview of the pipeline, the statistical methods used at each step, and the corresponding analyses of the data distributions and their implications are presented in Figure 1.

Data Acquisition
Five publicly-available gene expression datasets derived from five unique, primary human cancer types were analyzed, including brain (glioblastoma), breast (adenocarcinoma), colon (adenocarcinoma), gastric (adenocarcinoma), and ovarian (adenocarcinoma) cancers, each representing the largest available dataset meeting inclusion criteria (Table S9). These five datasets were downloaded from the Gene Expression Omnibus [23].
To control for potential effects of cross-platform and cross-assay variability, all included expression data were assayed using the AffymetrixH Human Genome U133A-Plus 2.0 array. Because the five datasets contained a variable number of arrays, sample sizerelated bias was controlled by setting the experimental sample size to 180 for each group (corresponding to the smallest of the 5 datasets). A total of 180 samples for each of the five cancer types were therefore selected at random from the parent datasets. These five sets of 180 samples (a total of 900 samples) comprised the experimental datasets. A sixth, smaller dataset containing expression data for 23 low-grade gliomas was also used for analyses of gene calling, functional annotation, and tumor classification. This dataset has been previously generated by our group and comprises at least two molecular tumor subclasses, thereby facilitating this analysis [5].
Similarly, gene expression data from normal corresponding tissues, assayed with the U133A-Plus 2.0 array, were analyzed and used as the normal controls in the Log-subtraction-based analyses.

Data Processing
To simulate typical analytic workflow of a translational molecular oncology investigation, raw expression data were normalized using the RMA method [24] with the RMA Express software package [41]. This method consists of three steps: background adjustment, quantile normalization, and summarization. To control for the possibility of bias introduced by the RMA method, the same source data was also independently normalized using the DChip algorithm [25], with default parameters, and the effects of these two normalization methods were compared (see Results). All normalized data were represented in Log 2 converted form.
Remaining consistent with a prototypic translational analysis workflow, we next calculated expression ratios for each feature (relative to normal expression) using Log-subtraction. The gene expression profiles for normal tissue corresponding to each tumor type were calculated by averaging three, comparable gene expression profiles from the appropriate normal tissue, and the average value was Log 2 -converted and then subtracted from the Log 2 -converted expression of the corresponding feature in each tumor sample. This is standard practice in many semi-quantitative gene expression analyses; see the Text S1 for further details.

Distribution Analysis
A mathematical analysis of the first four central moments was performed for each of the sample distributions. Differences between the observed values of the central moments and those of the theoretical normal distribution were calculated. Testing for statistical significance of these differences was accomplished by comparing the values from the expression data sets to those of a simulated normal distribution of 10 7 elements (m = 0.000, s = 1.001, skew = 20.025, kurtosis = 0.272, range = 29.54-5.36). Means were compared using both parametric and nonparametric tests (student's t-test and KW), variances were compared using the F-test, and skew and kurtosis were compared using standard error and Fisher's coefficients of skewness and kurtosis [42]. Additionally, both the one-sample and two-sample Kolmogorov-Smirnov (KS) test [43] (a = 0.05) were used to compare the sample distributions to the theoretical or simulated normal distribution, and the KS statistic was tested for significance using the p-value of the KS calculation (Figures 2, S1, S2; Tables 1, S1).

Empiric Curve Fitting
Curve fitting was performed on all ten Log 2 -subtracted datasets (five normalized with RMA and the identical five normalized with D-Chip) by first constructing expression density histograms for each of the datasets and then by empiric fitting and parameter estimation for each of 57 potential distributions (Table S2). Optimal bin widths were calculated using the Freedman-Diaconis method [44]. Goodness-of-fit was assessed using the KS test (a = 0.05) and by examining the probability-probability (PP) plot of the empirical versus the theoretical cumulative density functions (CDF) and the quantile-quantile (QQ) plot of the observed versus expected distribution quantiles. The three best-fit distributions for each dataset were primarily determined based upon rank of the KS coefficient. When this index was not adequately representative or when approximate equivalence existed, the best-fit PP and QQ plots were used as tie-breakers (Figures 4, S3, S4, S5, S6; Table 3).

Gene Calling and Functional Annotation Analysis
A semi-quantitative analysis of the effects of the non-Gaussian distributions of the dataset on gene calling and sub-classification within a single tumor type was performed using previouslypublished gene expression data from a set of low-grade gliomas [5]. This dataset comprises 23 LGGs, and previous investigations have suggested the presence of at least two distinct, molecular subgroups within this dataset, corresponding to a set of chromosome 1p/19q codeleted oligodendrogliomas and a set of 1p/19q intact low-grade gliomas [5].
A central moments analysis was conducted (as above) on this dataset, again suggesting that it was not normally-distributed (data not shown). A systematic, Box-Cox transformation [28] was then applied to the Log 2 -subtracted dataset in order to better fit the data to a normal distribution, and a central moments analysis was again performed to verify that the transform had significantly improved the normality of the distribution ( Figure 5).
Next, a series of identical classification and gene calling procedures was performed on the pre-and post-transformation datasets and the results were compared to estimate, in a semiquantitative fashion, the effects of the shape of the underlying distribution on identification of differentially-expressed genes, functional annotation, and molecular classification. We used the a priori definition of the two, genotypically-and phenotypicallydistinct tumor subsets contained within the LGG dataset to define two classes [5], arbitrarily designated as Class 1 for the intact gliomas and Class 2 for the codeleted oligodendrogliomas. We applied four different statistical tests to identify genes with statistically-significant differences between the two subsets: one nonparameric method, the two-class, unpaired student's t-test with a standard Bonferroni correction for multiple testing (p,0.01) [45]; two nonparametric methods, the two-class, unpaired SAM [40] test (with false discovery rate [FDR = 0]) and the two-class KW [46] test (with a Benjamini-Hochberg correction [FDR,0.05]); and a linear modeling strategy (LIMMA) [30], a = 0.05), a Bayesian approach based on a moderated t-test that assumes normality but is considered robust [31]. These tests were applied to both the parent (untransformed) dataset and the (Box-Cox)-transformed dataset, and the results were compared to identify potential differences in identification of differentiallyexpressed genes affected solely by the shape of the distribution. A summary of the results of this analysis is presented in Table 4, and the detailed results are given in Tables S3, S4, S5, S6. Functional annotation of the gene sets derived from one parametric method (SAM) and one nonparametric method (KW) for gene calling was performed using DAVID [32,33] to identify statistically-overrepresented (Bonferroni-corrected p,0.01) GO [34,35] and KEGG [36] terms in each of these lists. The functional annotation results were compared semi-quantitatively (Tables 4B, S7, S8).

Molecular Classification Analysis
The effect of the distribution on the accuracy of prospective, molecular classification algorithms was tested using a DA classifier and a KNN classifier. The DA [47] classifier is a multi-step algorithm that first applies an ANOVA analysis to select genes that should be near optimal for partitioning the unknown samples based upon permutations of gene expression in the training set. Next, a multivariate partial least squares method is used for gene dimensional reduction. This is followed by a polychotomous discriminant analysis. The KNN model [48] was applied using a 2class, 4-neighbors model. Both strategies are considered robust tools for classification.
For both approaches, the Bonferroni-corrected t-test (p,0.01) was first used to eliminate genes whose expression did not differ significantly from control. Next, a training set consisting of 8 samples randomly selected from Class 1 and 5 from Class 2 (13 total) was used to train the classifier, which was then tested against a novel set known to contain 7 Class 1 and 3 Class 2 tumors (10 total). These classification strategies were applied to both the parent and to the transformed distributions, and the classification results were compared in a semi-quantitative fashion ( Figure 5). To eliminate potential bias associated with the initial, t-test-based gene selection, the process was repeated for the KNN clustering with SAM replacing the t-test in the analytic model ( Figure S7). normalized data. The three best-fit curves are superimposed on the histogram, and the normal distribution curve is included for comparison. The specific parameters for the best-fit distributions are given. The inset displays the quantile-quantile (QQ) plot for the best-fit and normal distributions. These charts demonstrate that multiparameter distributions capable of modeling skewness and kurtosis better characterize the data than the standard Gaussian (normal) distribution. (TIF) Figure S4 Distribution Fitting. Distribution fitting for the colon cancer dataset for RMA (top) and DChip (bottom) normalized data. The three best-fit curves are superimposed on the histogram, and the normal distribution curve is included for comparison. The specific parameters for the best-fit distributions are given. The inset displays the quantile-quantile (QQ) plot for the best-fit and normal distributions. These charts demonstrate that multiparameter distributions capable of modeling skewness and kurtosis better characterize the data than the standard Gaussian (normal) distribution. (TIF) Figure S5 Distribution Fitting. Distribution fitting for the gastric cancer dataset for RMA (top) and DChip (bottom) normalized data. The three best-fit curves are superimposed on the histogram, and the normal distribution curve is included for comparison. The specific parameters for the best-fit distributions are given. The inset displays the quantile-quantile (QQ) plot for the best-fit and normal distributions. These charts demonstrate that multiparameter distributions capable of modeling skewness and kurtosis better characterize the data than the standard Gaussian (normal) distribution. (TIF) Figure S6 Distribution Fitting. Distribution fitting for the ovarian cancer dataset for RMA (top) and DChip (bottom) normalized data. The three best-fit curves are superimposed on the histogram, and the normal distribution curve is included for comparison. The specific parameters for the best-fit distributions are given. The inset displays the quantile-quantile (QQ) plot for the best-fit and normal distributions. These charts demonstrate that multiparameter distributions capable of modeling skewness and kurtosis better characterize the data than the standard Gaussian (normal) distribution. (TIF) Figure S7 Distribution-Dependent Effects on Molecular Tumor Subclassification Based on Two Different Methods of Gene Calling. Two different gene calling methods, the student's t-test (top) and the Significance Analysis for Microarrays (SAM, bottom) were used to verify that the gene calling method used in conjunction with the K-Nearest Neighbors classifier (KNN). Both analyses use the parent (untransformed) low-grade glioma expression datasets. Class 1 represents low-grade astrocytomas, and Class 2 represents chromosome 1p/19q codeleted, low-grade oligodendrogliomas. The topmost color bars represent the known class of each sample (black boxes; red = Class 1, blue = Class 2). The area below the color bars is a portion of the gene expression profile (red = underexpressed, green = overexpressed). Both gene calling algorithms lead to identical results, indicating that this is not the primary reason for the observed differences in molecular classification (see Figure 6).