Non-Gaussian Distributions Affect Identification of Expression Patterns, Functional Annotation, and Prospective Classification in Human Cancer Genomes

doi:10.1371/journal.pone.0046935

Figure 1.

Overview of Analytic Workflow.

The flow diagram depicts typical microarray analysis workflow (top section), the statistical methods used at each step (middle section), and the corresponding tables and figures in this manuscript that present analyses at each level (bottom section).

More »

Expand

Figure 2.

Cancer gene expression datasets are not normally-distributed.

The source data for these graphs are the Log₂-subtracted datasets. All bin widths have been set to 200 to improve visualization. Red curves represent the best-fit normal distribution. The primary image gives the histogram with the superimposed theoretical normal curve. The inset presents the quantile-quantile (QQ) plot, where deviation from the line (y = x, black) illustrates deviation of the empiric from the theoretical normal distribution. Left panel shows data normalized with the RMA method. Right panel shows data normalized with the DChip method. A: Brain; B: Breast; C: Colon; D: Gastric; E: Ovarian.

More »

Expand

Table 1.

Central Moments Analysis.

More »

Expand

Figure 3.

Single-Gene Expression Distributions are not Gaussian.

These graphs illustrate the wide range of potential skewness (A) and kurtosis (B) that exist in the expression distributions of individual genes comprising the cancer expression datasets. This refutes the assumption that the expression data for individual genes follow an approximately Gaussian distribution around the gene's mean expression level. Data for these graphs was taken from the log₂-subtracted, RMA-normalized glioblastoma expression data. For the skewness comparison, five genes with comparable means, standard deviations, and kurtosis were selected from subsets of genes representing approximately the 10^th, 25^th, 50^th, 75^th and 90^th percentiles for per-gene skewness contained in the dataset. Similarly, for the kurtosis comparison, five genes with comparable means, standard deviations, and skewness were selected from subsets of genes representing approximately the 10^th, 25^th, 50^th, 75^th and 90^th percentiles for per-gene kurtosis contained in the dataset. The identities of the genes are not germane for comparative purposes.

More »

Expand

Table 2.

Single Gene Distribution Variability.

More »

Expand

Figure 4.

Distribution Fitting.

Distribution fitting for the brain cancer dataset for RMA (top) and DChip (bottom) normalized data. The three best-fit curves are superimposed on the histogram, and the normal distribution curve is included for comparison. The specific parameters for the best-fit distributions are given. The inset displays the quantile-quantile (QQ) plot for the best-fit and normal distributions. These charts demonstrate that multiparameter distributions capable of modeling skewness and kurtosis better characterize the data than the standard Gaussian (normal) distribution. Similar graphs for additional tumor types are given in figures S2, S3, S4, S5.

More »

Expand

Table 3.

Empiric Distribution Fitting.

More »

Expand

Figure 5.

Distribution Transformation.

A Box-Cox transformation applied to the low-grade glioma dataset (left) results in a distribution that more closely approximates a normal distribution (right). Note that the parent distribution was recentered to a zero mean to compensate for the default mean of the Robust Multichip Normalization output of 7. This transformed distribution was then used to analyze distribution-dependent effects on identification of differentially-expressed genes, functional annotation, and prospective molecular classification.

More »

Expand

Table 4.

Gene Calling and Functional Annotation.

More »

Expand

Figure 6.

Distribution-Dependent Effects on Molecular Tumor Subclassification.

Two methods of prospective molecular classification, the parametric Discriminant Analysis (DA, top) and the nonparametric K-Nearest Neighbors classifier (KNN, bottom), were used in conjunction with the parent and transformed low-grade glioma expression datasets to study distribution-dependent effects molecular tumor subclassification. Class 1 represents low-grade, 1p/19q-intact gliomas, and Class 2 represents chromosome 1p/19q codeleted, low-grade oligodendrogliomas. The topmost color bars represent the known class of each sample (black boxes; red = Class 1, blue = Class 2). The area below the color bars is a portion of the gene expression profile (red = underexpressed, green = overexpressed). DA used in conjunction with the parent (non-normal) distribution produces two misclassifications and KNN produces 3, while both methods used with the transformed dataset result in accurate molecular subclassification.

More »

Expand