^{1}

^{2}

^{3}

^{4}

^{2}

^{5}

^{6}

^{3}

^{3}

^{7}

^{*}

Conceived and designed the experiments: GG DA. Performed the experiments: QX. Analyzed the data: LY. Wrote the paper: GG SB GP DA.

The authors have declared that no competing interests exist.

Plasmode is a term coined several years ago to describe data sets that are derived from real data but for which some truth is known. Omic techniques, most especially microarray and genomewide association studies, have catalyzed a new zeitgeist of data sharing that is making data and data sets publicly available on an unprecedented scale. Coupling such data resources with a science of plasmode use would allow statistical methodologists to vet proposed techniques empirically (as opposed to only theoretically) and with data that are by definition realistic and representative. We illustrate the technique of empirical statistics by consideration of a common task when analyzing high dimensional data: the simultaneous testing of hundreds or thousands of hypotheses to determine which, if any, show statistical significance warranting follow-on research. The now-common practice of multiple testing in high dimensional experiment (HDE) settings has generated new methods for detecting statistically significant results. Although such methods have heretofore been subject to comparative performance analysis using simulated data, simulating data that realistically reflect data from an actual HDE remains a challenge. We describe a simulation procedure using actual data from an HDE where some truth regarding parameters of interest is known. We use the procedure to compare estimates for the proportion of true null hypotheses, the false discovery rate (FDR), and a local version of FDR obtained from 15 different statistical methods.

Plasmode is a term used to describe a data set that has been derived from real data but for which some truth is known. Statistical methods that analyze data from high dimensional experiments (HDEs) seek to estimate quantities that are of interest to scientists, such as mean differences in gene expression levels and false discovery rates. The ability of statistical methods to accurately estimate these quantities depends on theoretical derivations or computer simulations. In computer simulations, data for which the true value of a quantity is known are often simulated from statistical models, and the ability of a statistical method to estimate this quantity is evaluated on the simulated data. However, in HDEs there are many possible statistical models to use, and which models appropriately produce data that reflect properties of real data is an open question. We propose the use of plasmodes as one answer to this question. If done carefully, plasmodes can produce data that reflect reality while maintaining the benefits of simulated data. We show one method of generating plasmodes and illustrate their use by comparing the performance of 15 statistical methods for estimating the false discovery rate in data from an HDE.

“Omic” technologies (genomic, proteomic, etc.) have led to high dimensional experiments (HDEs) that simultaneously test thousands of hypotheses. Often these omic experiments are exploratory, and promising discoveries demand follow-up laboratory research. Data from such experiments require new ways of thinking about statistical inference and present new challenges. For example, in microarray experiments an investigator may test thousands of genes aiming to produce a list of promising candidates for differential genetic expression across two or more treatment conditions. The larger the list, the more likely some genes will prove to be false discoveries, i.e. genes not actually affected by the treatment.

Statistical methods often estimate both the proportion of tested genes that are differentially expressed due to a treatment condition and the proportion of false discoveries in a list of genes selected for follow-up research. Because keeping the proportion of false discoveries small ensures that costly follow-on research will yield more fruitful results, investigators should use some statistical method to estimate or control this proportion. However, there is no consensus on which of the many available methods to use

Although the performance of some statistical methods for analyzing HDE data has been evaluated analytically, many methods are commonly evaluated using computer simulations. An analytical evaluation (i.e., one using mathematical derivations to assess the accuracy of estimates) may require either difficult-to-verify assumptions about a statistical model that generated the data or a resort to asymptotic properties of a method. Moreover, for some methods an analytical evaluation may be mathematically intractable. Although evaluations using computer simulations may overcome the challenge of intractability, most simulation methods still rely on the assumptions inherent in the statistical models that generated the data. Whether these models accurately reflect reality is an open question, as is how to determine appropriate parameters for the model, what realistic “effect sizes” to incorporate in selected tests, as well as if and how to incorporate correlation structure among the many thousands of observations per unit

Plasmode data sets may help overcome the methodological challenges inherent in generating realistic simulated data sets. Catell and Jaspers

A plasmode data set can be constructed by spiking specific mRNAs into a real microarray data set

In this paper, we propose a technique to simulate plasmode data sets from previously produced data. The source-data experiment was conducted at the Center for Nutrient–Gene Interaction (CNGI, _{0}), the false discovery rate (FDR)

Steps for plasmode creation that are described herein are relatively straightforward. First, an HDE data set is obtained that reflects the type of experiment for which statistical methods will be used to estimate quantities of interest. Data from a rat microarray experiment at CNGI were used here. Other organisms might produce data with different structural characteristics and methods may perform differently on such data. The CNGI data were obtained from an experiment that used rats to test the pathways and mechanisms of action of certain phytoestrogens

Second, an HDE data set that compares effect of a treatment(s) is analyzed and the vector of effect sizes is saved. The effect size used here was a simple standardized mean difference (i.e., a two sample t-statistics) but any meaningful metric could be used. Plasmodes, in fact, could be used to compare the performance of statistical methods when different statistical tests were used to produce the P-values. We chose two sets of HDE data as templates to represent two distributions of effect sizes and two different null distributions. We refer to the 21-day experiment using the control group (8 arrays) and the treatment group (EGCG supplementation, 10 arrays) as data set 1, and the 50-day experiment using the control group (10 arrays) and the treatment group (Resveratrol supplementation, 10 arrays) as data set 2. There were 31,042 genes on each array, and two sample pooled variance t-tests for differential expression were used to create a distribution of P-values. Histograms of the distributions for both data sets are shown in

P-values were computed from the original data using two sample pooled variance t-tests.

The distribution of P-values for data set 1 shows a stronger signal (i.e., a larger collection of very small P-values) than that for data set 2, suggesting either that more genes are differentially expressed or that those that are expressed have a larger magnitude treatment effect. This second step provided a distribution of effects sizes from each data set.

Next, create the plasmode null data set. For each of the HDE data sets, we created a random division of the control group of microarrays into two sets of equal size. One consideration in doing so is that if some arrays in the control group are ‘different’ from others due to some artifact in the experiment, then the null data set can be sensitive to how the arrays are divided into two sets. Such artifacts can be present in data from actual HDEs, so this issue is not a limitation of plasmode use but rather an attribute of it, that is, plasmodes are designed to reflect actual structure (including artifacts) in a real data set. We obtained the plasmode null data set from data set 1 by dividing the day 21 control group of 8 arrays into two sets of 4, and for data set 2 by dividing the control group of 10 arrays into two sets of 5 arrays.

P-values were computed from two sample pooled variance t-tests.

A proportion 1−π_{0} of effect sizes were then sampled from their respective distributions using a weighted probability sampling technique described in the _{0} of genes in a manner also described in the

Finally, the plasmode data set was analyzed using a selected statistical method. We used two sample t-tests to obtain a plasmode distribution of P-values for each plasmode data set because the methods compared herein all analyze a distribution of P-values from an HDE. P-values were declared statistically significant if smaller than a threshold τ.

When comparing the 15 statistical methods, we used three values of π_{0} (0.8, 0.9, and 0.95) and two thresholds (τ = 0.01 and 0.001). For each choice of π_{0} and threshold τ, we ran B = 100 simulations. All 15 methods provided estimates of π_{0}, 14 provided estimates of FDR, and 7 provided estimates of LFDR. Because the true values of π_{0} and FDR are known for each plasmode data set, we can compare the accuracy of estimates from the different methods.

There are two basic strategies for estimating FDR, both predicated on an estimated value for π_{0}, the first using equation (1) below, the second using a mixture model approach. Let _{K}_{0} a density of a P-value under the null hypothesis, _{1} a density of a P-value under the alternative hypothesis, π_{0} is interpreted as before, and θ a (possibly vector) parameter of the distribution. Since valid P-values are assumed, _{0} is a uniform density. LFDR is defined with respect to this mixture model as,_{1}(τ) is the CDF under the alternative hypothesis, evaluated at a chosen threshold τ. (There are different definitions of FDR and the definition in (4) is, under some conditions, the definition of a positive false discovery rate

Genes for which there is not a real effect | Genes for which there is a real effect | |

Genes not declared significant at designated threshold | ||

Genes declared significant at designated threshold |

The methods are listed for quick reference in _{0} and, as implemented herein, proceed to estimate FDR using equation (1). Method 9 uses a unique algorithm to estimate LFDR and does not supply an estimate of FDR. Methods 10–15 are based on a mixture model framework and estimate FDR and LFDR using equations (3) and (4) where the model components are estimated using different techniques. All methods were implemented using tuning parameter settings from the respective paper or ones supplied as default values with the code in cases where the code was published online.

Method | Citation | Source of code |

1 | Benjamini and Hochberg | GeneTS |

2 | Benjamini and Hochberg | GeneTS |

3 | Mosig et al., | Website |

4 | Storey & Tibshirani | Qvalue |

5 | Storey, Taylor, Siegmund | Qvalue |

6 | Schweder and Spjøtvoll | Coded by us |

7 | Dalmasso, Broët, and Moreau | Author website |

8 | Langaas, Lindqvist, Ferkingstad | Limma |

9 | Scheid and Spang | Twilight |

10 | Pounds and Morris | Author website |

11 | Pounds and Cheng | Author website |

12 | Liao et al., | Author website |

13 | Broberg | SAGx |

14 | Broberg | SAGx |

15 | Allison et al., | From authors |

Most software was available as an R library at

First, to compare their differences, we used the 15 methods to analyze the original two data sets, with data set 1 having a “stronger signal” (i.e., lower estimates of π_{0} and FDR). Estimates of π_{0} from methods 3 through 15 ranged from 0.742 to 0.837 for data set 1 and 0.852 to 0.933 for data set 2. (Methods 1 and 2 are designed to control for rather than estimate FDR and are designed to be conservative; hence, their estimates were much closer to 1.) Results of these analyses can be seen in the Supplementary

Next, using the two template data sets we constructed plasmode data sets in order to compare the performance of the 15 methods for estimating π_{0} (all methods), FDR (all methods except method 9), and LFDR (methods 9–15).

Two cases are shown representing A. π_{0} = 0.8 and B. π_{0} = 0.9, represented by the horizontal line in the two plots A and B, respectively.

Estimates calculated at two thresholds τ = 0.01 (A and B) and 0.001 (C and D) are shown. For the plots of

_{0} using data set 2 when the true value of π_{0} is equal to 0.8 and 0.9. Methods 1 and 2 are designed to be conservative (i.e., true values are overestimated). With a few exceptions, the other methods tend to be conservative when π_{0} = 0.8 and liberal (the true value is underestimated) when π_{0} = 0.9. The variability of estimates for π_{0} is similar across methods, but some plots show a slightly larger variability for methods 12 and 15 when π_{0} = 0.9.

Researchers have been evaluating the performance of the burgeoning number of statistical methods for the analysis of high dimensional omic data, relying on a mixture of mathematical derivations, computer simulations, and sadly, often single dataset illustrations or mere

As more high dimensional experiments with larger sample sizes become available, researchers can use a new kind of simulation experiment to evaluate the performance of statistical analysis methods, relying on actual data from previous experiments as a template for generating new data sets, referred to herein as plasmodes. In theory, the plasmode method outlined here will enable investigators to choose

Our results also suggest that large, searchable databases of plasmode data sets would help investigators find existing data sets relevant to their planned experiments. (We have already implemented a similar idea for planning sample size requirements in HDEs

Other papers have used simulation studies to compare the performance of methods for estimating π_{0} and FDR (e.g., Hsueh et al.

A key implication and recommendation of our paper is that, as data from the growing number of HDEs is made publicly available, researchers may identify a previous HDE similar to one they are planning or have recently conducted and use data from these experiments to construct plasmode data sets with which to evaluate candidate statistical methods. This will enable investigators to choose the most appropriate method(s) for analyzing their own data and thus to increase the reliability of their research results. In this manner, statistical science (as a discipline that studies the methods of statistics) becomes as much an empirical science as a theoretical one.

The quantities in _{{M>0}} is an indicator function equal to 1 if ^{K}

Suppose we identify a template data set corresponding to a two treatment comparison for differential gene expression for ^{th} component of _{trt}_{ctrl}_{i}_{,trt}, _{i}_{,ctrl} are the mean gene expression levels for gene _{i}_{i}_{,trt}−_{i}_{,ctrl} for each gene.

For convenience, assume that _{ctrl}_{ctrl}_{ctrl}_{ctrl}_{0} and specify a threshold, τ, such that a P-value ≤τ is declared evidence of differential expression. Execute the following steps.

Sample without replacement (1−π_{0})^{*}. This set will denote those genes that will be differentially expressed.

Sample (1−π_{0})^{*}. This will be the set of effect sizes used to differentially express genes. The weighted probability sampling allows for the fact that the original vector

For each expression level in the plasmode treatment group and for each gene, ^{*}, add the amount _{j,ctrl}

Conduct a statistical test for differentially expressed genes on the plasmode data set and record the distribution of P-values. Determine which genes have P-values ≤τ.

Note that π_{0} and the set ^{*} are known, so a true value of FDR for this data set is available. This true value will change with each simulated data set since the set ^{*} and the vector ^{*} will be different in each simulation.

Apply a statistical method that estimates π_{0}, FDR, LFDR and other quantities of interest. Estimates of FDR and LFDR are computed at a preset threshold τ. Some methods compute these estimates at the observed P-values in which case we interpolate the estimates computed at the two nearest P-values above and below τ.

Repeat steps 1–6 B times. Record summary statistics such as the mean, standard deviation, and range of the true FDR over the B plasmodes, and the summary statistics from the estimates obtained from the statistical method that is being evaluated.

Choose another threshold τ and/or another value of π_{0} and repeat for a new simulation case.

One can then obtain another data set and repeat the entire process to evaluate a method on a different type of data, perhaps from a different organism having a different null distribution, or a different treatment type giving a different distribution of effect sizes, _{ctrl}

π_{0} = A true proportion of genes for which there is no differential expression. This value is controlled by the experimenter in a simulation study.

1−π_{0} = the proportion of genes that are truly differentially expressed.

πˆ_{0} = An estimate of π_{0} obtained using a statistical method on data from an HDE.

τ = A threshold set by the investigator below which P-values are declared statistically significant.

Boxplots plasmode simulations dataset 1.

(0.02 MB PDF)

Boxplots plasmode simulations dataset 2.

(0.02 MB PDF)

Plots of FDR & LFDR dataset 1.

(0.03 MB PDF)

Plots of FDR & LFDR dataset 1 at 0.9.

(0.03 MB PDF)

Plots of FDR & LFDR dataset 2.

(0.02 MB PDF)

Plots of FDR & LFDR dataset 2 at 0.9.

(0.03 MB PDF)

Methods comparison dataset 1.

(0.02 MB PDF)

Methods comparison dataset 2.

(0.02 MB PDF)

The authors thank Deanna Calvert for editing assistance and Vinodh Srinivasasainagendra for graphics assistance.