Figure 1.
Overview of Analytic Workflow.
The flow diagram depicts typical microarray analysis workflow (top section), the statistical methods used at each step (middle section), and the corresponding tables and figures in this manuscript that present analyses at each level (bottom section).
Figure 2.
Cancer gene expression datasets are not normally-distributed.
The source data for these graphs are the Log2-subtracted datasets. All bin widths have been set to 200 to improve visualization. Red curves represent the best-fit normal distribution. The primary image gives the histogram with the superimposed theoretical normal curve. The inset presents the quantile-quantile (QQ) plot, where deviation from the line (y = x, black) illustrates deviation of the empiric from the theoretical normal distribution. Left panel shows data normalized with the RMA method. Right panel shows data normalized with the DChip method. A: Brain; B: Breast; C: Colon; D: Gastric; E: Ovarian.
Table 1.
Central Moments Analysis.
Figure 3.
Single-Gene Expression Distributions are not Gaussian.
These graphs illustrate the wide range of potential skewness (A) and kurtosis (B) that exist in the expression distributions of individual genes comprising the cancer expression datasets. This refutes the assumption that the expression data for individual genes follow an approximately Gaussian distribution around the gene's mean expression level. Data for these graphs was taken from the log2-subtracted, RMA-normalized glioblastoma expression data. For the skewness comparison, five genes with comparable means, standard deviations, and kurtosis were selected from subsets of genes representing approximately the 10th, 25th, 50th, 75th and 90th percentiles for per-gene skewness contained in the dataset. Similarly, for the kurtosis comparison, five genes with comparable means, standard deviations, and skewness were selected from subsets of genes representing approximately the 10th, 25th, 50th, 75th and 90th percentiles for per-gene kurtosis contained in the dataset. The identities of the genes are not germane for comparative purposes.
Table 2.
Single Gene Distribution Variability.
Figure 4.
Distribution fitting for the brain cancer dataset for RMA (top) and DChip (bottom) normalized data. The three best-fit curves are superimposed on the histogram, and the normal distribution curve is included for comparison. The specific parameters for the best-fit distributions are given. The inset displays the quantile-quantile (QQ) plot for the best-fit and normal distributions. These charts demonstrate that multiparameter distributions capable of modeling skewness and kurtosis better characterize the data than the standard Gaussian (normal) distribution. Similar graphs for additional tumor types are given in figures S2, S3, S4, S5.
Table 3.
Empiric Distribution Fitting.
Figure 5.
A Box-Cox transformation applied to the low-grade glioma dataset (left) results in a distribution that more closely approximates a normal distribution (right). Note that the parent distribution was recentered to a zero mean to compensate for the default mean of the Robust Multichip Normalization output of 7. This transformed distribution was then used to analyze distribution-dependent effects on identification of differentially-expressed genes, functional annotation, and prospective molecular classification.
Table 4.
Gene Calling and Functional Annotation.
Figure 6.
Distribution-Dependent Effects on Molecular Tumor Subclassification.
Two methods of prospective molecular classification, the parametric Discriminant Analysis (DA, top) and the nonparametric K-Nearest Neighbors classifier (KNN, bottom), were used in conjunction with the parent and transformed low-grade glioma expression datasets to study distribution-dependent effects molecular tumor subclassification. Class 1 represents low-grade, 1p/19q-intact gliomas, and Class 2 represents chromosome 1p/19q codeleted, low-grade oligodendrogliomas. The topmost color bars represent the known class of each sample (black boxes; red = Class 1, blue = Class 2). The area below the color bars is a portion of the gene expression profile (red = underexpressed, green = overexpressed). DA used in conjunction with the parent (non-normal) distribution produces two misclassifications and KNN produces 3, while both methods used with the transformed dataset result in accurate molecular subclassification.