^{1}

^{2}

^{¤a}

^{2}

^{¤b}

^{2}

^{¤c}

^{2}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: MS DV SC. Performed the experiments: BR MM DV SC. Analyzed the data: SC DV MS. Contributed reagents/materials/analysis tools: MS. Wrote the paper: MS SC DV. Edited the manuscript: MS SC DV MM BR.

Current Address: Department of Bioengineering, California Institute of Technology, Pasadena, California, United States of America

Current Address: College of Medicine, University of Cincinnati, Cincinnati, Ohio, United States of America

Current Address: Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America

Global gene expression analysis using microarrays and, more recently, RNA-seq, has allowed investigators to understand biological processes at a system level. However, the identification of differentially expressed genes in experiments with small sample size, high dimensionality, and high variance remains challenging, limiting the usability of these tens of thousands of publicly available, and possibly many more unpublished, gene expression datasets. We propose a novel variable selection algorithm for ultra-low-

Gene expression analysis has led to profound advances in our understanding of a wide array of biological processes ranging from ecology and evolution to molecular genetics and disease therapeutics (reviewed in [

However, while the technology required to conduct microarray experiments has become relatively straightforward, data analysis remains challenging. Virtually every aspect of data analysis, from normalization to analysis of differential expression, remains a topic of ongoing discussion and often controversy in the literature [

There is a widespread and intense interest in developing new analytical strategies to address the “

Existing algorithms for differential expression detection in cases of ultra-low

In the last decade, penalized regression techniques (reviewed by Ma and Huang [

One important feature of penalized regression methods is that they are variable selectors as well as classifiers. Building a classifier with penalized regression involves assigning a weight to each gene, which determines how strongly that gene contributes to the classifier. Differentially expressed genes receive high weight, while genes that do not vary much between conditions are assigned low weights. By separating genes with low weight from those with high weight, penalized regression can identify differentially expressed genes. Unfortunately, although differentially expressed genes are expected to have high weight and insignificant genes are expected to have low weight after penalized regression, there is no

Clearly novel approaches for analyzing

A colony of

Embryos were unilaterally injected into one blastomere at the two cell stage with 1.5 ng of one of the following capped RNA constructs synthesized

To obtain total RNA, 10 embryos from each stage and condition were homogenized in Tri Reagent (Molecular Research Center) and extracted with 1-bromo-3-chloropropane phase separation reagent according to the manufacturer’s protocol. RNA from the aqueous phase was purified using the Qiagen RNeasy Mini kit. Total RNA for each of the nine samples (embryos injected with the three constructs NICD, DBM, GFP with each harvested at three different stages) was sent to the Clemson University Genomics Institute for microarray analysis using the Affymetrix Xenopus laevis 2.0 GeneChip. Affymetrix protocols were followed with the exception that the

Raw microarray data was normalized and summarized using Robust Microarray Average (RMA) [

An overview of our selection method is shown in

To determine a cutoff for significance, we generate simulations based on the experimental data (Step III). Starting from the most highly-ranked genes, we consider increasingly more genes to be provisionally differentially expressed, then use our simulations to estimate the false discovery rate of that selection. We increase the number of differentially expressed genes until the false discovery rate rises above a user-set threshold, at which point we stop and the selection is reported (Step IV). Finally, permutations of the original data, which contain the same data but with experimental labels scrambled, are analyzed as a null-signal control to test for overall presence of differential expression in the dataset (Step V).

Descriptions of each step of the method and several important implementation details are presented below.

The average expression level of two different genes can easily differ by several orders of magnitude. Differences in the scales of gene expression can bias the results of penalized regression, which we use to rank the importance of genes. To prevent this bias, raw expression data are first centered and normalized by converting them into z-scores (so that each gene has average expression 0 and standard deviation of expression 1).

Our algorithm ranks the estimated importance of genes using PED, with a generalized linear model-based method. Generalized linear models are powerful and flexible tools for binary classification that have been adapted for variable selection. A generalized linear model is broadly defined by
_{i} is the expected value of the random univariate variable _{i}, _{i} is a vector of regressor variables for the ^{−1}(_{i} is a vector of gene expression values for the _{i} is a numeric value corresponding to the experimental condition of the microarray (for example, control condition microarrays might be labeled with _{i} = 0, and treatment condition microarrays with _{i} = 1). The variable _{j}, which ‘weights’ the contribution of each gene _{i} contributes to the overall sum _{i} (for microarrays, the number of genes on a chip).

Combining all of the samples in an experiment yields the expression
_{ij} such that _{ij} is the expression value for the

To satisfy the above constraints, our method uses an efficient signal recovery strategy based on a pseudo-likelihood function shown to yield low false discovery rates and high signal recovery relative to other penalized regression methods (for example, Lasso or elastic net) when the number of replications is very small [^{p}) of the data matrix _{ij}, then the penalized Euclidean distance regression method produces a vector of weights (rankings) _{1}, _{2}…_{p}) such that

^{5}). To simplify computation, our algorithm performs PED regression in two passes. In the first pass, _{i}∣ < 10^{−6}) are removed.

Once weights (the vector

Simulations were designed with the following constraints:

Simulated data should mimic as closely as possible the intensity and differential expression patterns of the real data.

Simulated data should share, as much as possible, the correlation structure structure of the real data.

It must be known which genes are differentially expressed in simulation and which are not.

Simulations are based on an _{ij} is the intensity of _{cond} consisting of only those replicates is used to generate a simulated data matrix

To preserve as much correlational structure as possible, the first _{j} and standard deviation _{j} of each gene _{cond}, and use those estimates to generate Gaussian-distributed data with the same parameters for the second simulation condition. That is,

Differential expression is simulated by multiplying the second condition simulation data by a fold-difference if the fold-difference in the original data is large enough. First, the fold difference _{j} in the original data is measured. The fold-difference for a gene _{j}∣ is greater than or equal to some threshold _{j} (or by _{j} < 0) and that simulated gene is labeled as differentially expressed. If ∣_{j}∣ <

In summary, each simulation data matrix

Once several simulations are generated from the user’s data, these simulations are used to estimate the largest number of genes that can be considered as differentially expressed while maintaining the FDR below a threshold (supplied by the user) This is achieved by iteratively increasing the selection size and checking the estimated FDR of the new selection until the FDR increases above the set FDR threshold.

Specifically, PED regression is first performed to rank the genes in each simulation. The FDR is then calculated for a very small selection threshold _{s0} by taking the top _{s0} genes in each simulation and calculating an empirical FDR, which is simply the number of genes correctly called as differentially expressed in the simulation divided by the selection size. Because the simulations are generated to have similar distributions, levels of signal, and correlation structure to the experimenter’s data, the FDR of selections in simulation is taken as an estimate of the FDR of our real data using the same selection size threshold _{s0}. The algorithm then iteratively increases the selection size _{s} by some Δ_{s} until the FDR of any one simulation grows beyond the user-specified threshold value. The last tested _{s} before the FDR rises above the FDR threshold becomes the selection size used on the actual data set.

Because the selection of _{s} is based on the

To additionally guard against false discovery of differential expression when none is actually present, our method employs sample permutation to generate an estimate of the number of selections our method would make in the case of data similar to the user’s, but with no true differential expression. For each data set, the classification vector

The result of the differential expression validation is a list of selection sizes made by the algorithm for different permutations of the original data. If there is true differential expression in the dataset, then there should be a strong difference between the number of genes selected by our method in the real data and the number selected in null datasets. In practice, because of the discreteness and limited number of permutations possible at small sample sizes, permutations do not completely destroy correlation between sample label and signal, so that significant numbers of genes can be selected even for permutations. We suggest that if more selections are made in the real data than in any of the permuted data cases, then there is a strong case for true differential expression in the experimenter’s dataset. The farther apart the selection sizes on the real data and permuted data, the greater the strength of evidence for differential expression in the dataset.

An experimenter can quantify the significance of the differential expression validation using Chebyshev’s theorem. Chebyshev’s theorem states that no more than

The following is an algorithmic summary of our selection method.

Input: The user provides a matrix of expression data, as described under “PED regression.” The user also sets an FDR threshold

Convert expression data for each gene to z-scores (such that each gene’s expression vector has mean 0 and standard deviation 1) (Step I).

Real data first pass: using approximate PED regression according to

Sort genes by the magnitude of their weights.

Generate simulations with known signal based on the real data (Step III).

Find a maximum selection size _{s} that maintains FDR <

For each simulation, set a selection size _{s} = _{s0}.

Simulation first pass: optimize weights of differentially expressed genes using PED regression according to _{s} variables in each simulation as differentially expressed (Step II).

Simulation second pass: optimize weights of differentially expressed genes using PED regression according to _{i}∣ < 10^{−6} (Step II).

Measure the FDR in the selection made in each simulation.

If the FDR of any simulation’s selection is greater than

Otherwise, increment _{s} by Δ_{s} and go back to 5b.

Take the top _{s} genes in the real data, sorted by weight according to PED regression.

Real data second pass: optimize weights of differentially expressed genes using PED regression according to _{i}∣ < 10^{−6} (Step II).

Generate permuted versions of the real data as “null signal” cases (9 permutations for

For each permuted version of the data, perform steps 2–6. Report the number of selections in each permutation and compare to the number of selections in the real data to assess the presence of differential expression (Step V).

Code and documentation for PED-based selection are available at

To simplify computation of the objective function and achieve several theoretical properties during PED regression, we employ the first-pass approximation shown in ^{−6}). The results of this second pass are reported as the final selections.

We observed that weighting of genes are somewhat sensitive to the choice of classification vector

In our implementation, size optimization is performed using 10 simulations per dataset and permutation tests are performed using 9 distinct permutations. To optimize the selection size _{s}, we first used _{s}, then iterated again from the first stopping point with Δ_{s} = 1 to more precisely determine optimal selection size.

As a negative control experiment, we generated null-signal simulations using the same simulation strategy used in the selection method, but with the fold-difference threshold for differential expression set to +∞ so that no differential expression was introduced. We generated null-signal simulations based on the structure of our Notch-experiment microarray data for each comparison used in that experiment, then applied our selection method to these simulations. This experiment tested the behavior of our method when no differential expression is present in a dataset.

Whole mount in situ hybridization was employed for empirical validation of selected genes. In situ hybridization experiments were carried out using standard published protocols with minor modifications as previously described [

Since developmental expression profiles are already known and publicly available for most annotated Xenopus genes on xenbase.org, validation was also performed bioinformatically. Expression information for genes selected as differentially expressed by PED between RNA from GFP injected embryos extracted at st. 18 and RNA extracted from st. 38 was compared with expression profiles for the closely related species

For comparison, we applied several common penalized regression algorithms to our Notch perturbation dataset. Specifically, we used two implementations of Lasso and Iterative Sure Independence Screening (ISIS) [

Microarray data was initially analyzed by the Clemson University Genomics Institute using the limma package in Bioconductor R. We also performed this analysis to confirm the results. Testing for differential expression with limma yielded very few differentially expressed genes (See

Stage | DBM v GFP | GFP v NICD |
---|---|---|

18 | 1 | 8 |

28 | 0 | 2 |

38 | 0 | 0 |

However, an examination of the list of genes with particularly low

We applied the PED-regression-based method to our microarray data with an FDR threshold of 0.01 in order to recover a more complete list of differentially expressed genes (

Real Data | Permuted Data | z-score | Chebyshev p-value | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

18_DBM_18_GFP | 781 | 326 | 31 | 36 | 229 | 33 | 34 | 197 | 199 | 322 | 4.99 | 0.04 |

18_GFP_18_NICD | 2438 | 135 | 29 | 15 | 163 | 27 | 21 | 128 | 118 | 2149 | 3.07 | 0.11 |

28_DBM_28_GFP | 1155 | 131 | 40 | 397 | 128 | 163 | 161 | 44 | 70 | 381 | 7.40 | 0.02 |

28_GFP_28_NICD | 1595 | 56 | 10 | 57 | 60 | 97 | 60 | 17 | 54 | 95 | 52.49 | 3.6E-4 |

38_DBM_38_GFP | 238 | 84 | 99 | 34 | 76 | 54 | 87 | 106 | 83 | 68 | 7.24 | 0.02 |

38_GFP_38_NICD | 752 | 64 | 1 | 0 | 448 | 4 | 3 | 514 | 1 | 0 | 3.05 | 0.11 |

Notably, in every case, our selection method labeled many more genes as differentially expressed in the data than in permuted controls, indicating that these selections are unlikely to be the product of spurious selection of truly random data. All genes that were labeled as differentially expressed by limma (after BHY adjustment) were also selected as differentially expressed by PED.

Selection sizes for our data were consistently greater than selection sizes for null-permuted data. Using Chebyshev’s theorem, we obtained

As a negative control, we generated one simulation with no differential expression for each contrast in our experiment, then applied our selection method to those simulations. The results are summarized in

Null Signal Data | Permuted Null Data | z-score | Chebyshev p-value | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

18_DBM_18_GFP | 7 | 16 | 14 | 17 | 25 | 30 | 19 | 242 | 198 | 257 | 0.782 | 1 |

18_GFP_18_NICD | 7 | 11 | 25 | 10 | 26 | 21 | 26 | 196 | 240 | 202 | 0.792 | 1 |

28_DBM_28_GFP | 0 | 0 | 45 | 14 | 0 | 0 | 0 | 4 | 0 | 0 | 0.467 | 1 |

28_GFP_28_NICD | 0 | 87 | 23 | 47 | 93 | 55 | 8 | 21 | 27 | 23 | 1.41 | 0.50 |

38_DBM_38_GFP | 8 | 96 | 70 | 69 | 52 | 34 | 49 | 129 | 128 | 118 | 2.07 | 0.23 |

38_GFP_38_NICD | 2 | 39 | 84 | 54 | 48 | 47 | 42 | 43 | 54 | 65 | 3.62 | 0.08 |

As a positive control of differential gene expression discovery, we applied our method (again with an FDR threshold set to 0.01) to the comparison: GFP-injected stage 18 versus stage 38. Differential expression in that contrast is driven by transcriptional differences between stages, which are large relative to perturbations induced by DBM or NICD injection. Under these conditions, 20,544 genes were detected as differentially expressed. We obtained similar results by applying limma to the same contrasts with BHY correction at

Several different approaches were employed to validate our selection procedure. Firstly, we validated a number of samples empirically. Since the fold differences in our experiments were virtually all significantly less than 2, qRT-PCR was not an appropriate technique, since it reliably detects differences that are more than twofold in magnitude. We therefore conducted in situ hybridization on selected genes and assayed for differences in expression. Of the five genes tested—several of which were not previously known to be regulated by Notch signaling—all five validated the PED selections (data now shown).

Secondly, we validated the selection procedure bioinformatically using existing expression information from multiple databases available on xenbase.org. To do so we compared genes selected as differentially expressed by PED for GFP injected embryos at stage 18 and stage 38 with known expression profiles. GFP was used as an injection control, and GFP embryos display normal development. Of the genes selected as differentially expressed, 200 genes were randomly sampled. Of these 182 (91%) were validated by known expression data from

Finally, our selection procedure includes a simulation step designed to both validate and tune the procedure for the user’s data set. These simulations use a fold-difference criteria to estimate the level of signal present in the user’s data, then add a random, normally-distributed condition to one of the user’s condition data. Our procedure uses these simulations to tune the selection size to maintain an estimated false discovery rate below a user-set threshold.

A number of analysis methods exist for variable selection using penalized regression techniques. For comparison with our method, we applied lasso, Bayesian lasso, and ISIS, to our dataset. Selection sizes by each method are shown in

Comparison | lasso | Bayesian lasso | ISIS |
---|---|---|---|

18_DBM_18_GFP | 0 | 5 | 1 |

18_GFP_18_NICD | 3 | 5 | 1 |

28_DBM_28_GFP | 0 | 5 | 1 |

28_GFP_28_NICD | 0 | 5 | 1 |

38_DBM_38_GFP | 0 | 5 | 1 |

38_GFP_38_NICD | 0 | 5 | 1 |

18_GFP_38_GFP | 31 | 5 | 1 |

Although many methods exist for analysis of microarray data, none are known to reliably function for single-channel microarray data with ultra-low sample size, for instance with

Another approach to the analysis of microarray data comes from microarray classification research, which considers the problem of automatically creating a set of rules that can identify the sample type of a previously uncategorized microarray (see Ma and Huang [

One solution to the “

We present a GLM-based, penalized binomial regression approach for analyzing microarray data that uses data-based simulations to tune selections, thus avoiding the need for cross-validation and maximizing the number of differentially expressed genes detected by the algorithm. Because it does not require cross-validation, this method can be applied to experiments with extremely low sample size (

We also provide a permutation-based differential expression test, which can verify the presence of differential expression in an otherwise ambiguous dataset. The differential expression test produces selection sizes for sample permutations of the data, which represents a null distribution of selection size. Sets with differential expression will produce much larger selection sizes in the actual data than in the permuted data, while sets with no differential expression will produce similar selection sizes for all tests. We recommend either 1) considering the data differentially expressed if the data show a larger selection size than any permutation or 2) using Chebyshev’s theorem to estimate a highly conservative

There is potential for expansion of our algorithm. With few modifications, it could be applied to RNA-seq expression data. Our algorithm’s performance is currently quite slow, despite optimization—analysis of a single data set with

Our method meets an important need for analysis tools capable of analyzing ultra-low sample-size datasets with extremely high dimensionality with enough power to apply pathway analysis and other forms of global expression analysis. Many such datasets exist, and we believe that applying our PED-based approach could yield a plethora of new insights from experiments that have already been performed.

(XLS)

(XLS)

(XLS)

(XLS)

(XLS)

(XLS)

We thank Caroline Golino, Matthew Wong, and Andrew Halleran for their suggestions to the manuscript.

_{A}α subunits and GABA

_{B}subunits in xenopus laevis during development