Stouffer’s Test in a Large Scale Simultaneous Hypothesis Testing

In microarray data analysis, we are often required to combine several dependent partial test results. To overcome this, many suggestions have been made in previous literature; Tippett’s test and Fisher’s omnibus test are most popular. Both tests have known null distributions when the partial tests are independent. However, for dependent tests, their (even, asymptotic) null distributions are unknown and additional numerical procedures are required. In this paper, we revisited Stouffer’s test based on z-scores and showed its advantage over the two aforementioned methods in the analysis of large-scale microarray data. The combined statistic in Stouffer’s test has a normal distribution with mean 0 from the normality of the z-scores. Its variance can be estimated from the scores of genes in the experiment without an additional numerical procedure. We numerically compared the errors of Stouffer’s test and the two p-value based methods, Tippett’s test and Fisher’s omnibus test. We also analyzed our microarray data to find differentially expressed genes by non-genotoxic and genotoxic carcinogen compounds. Both numerical study and the real application showed that Stouffer’s test performed better than Tippett’s method and Fisher’s omnibus method with additional permutation steps.


Introduction
We frequently encounter complex hypothesis testing problems that require the combination of several independent or dependent test results. For example, in our experiment motivating this work, we wanted to assess the pharmacological effects of carcinogens on gene expression levels. As carcinogens can be classified as genotoxic and non-genotoxic, depending on the mechanism of action, we wanted to understand the total mechanism of carcinogen action through the comparison of gene expression patterns for these two groups. The experiments were individually conducted for 3 genotoxic carcinogens (2-AAF, 39MeDAB and DEN) and 3 non-genotoxic carcinogens (clofibrate, DL-ethionine and 1,4-dioxane). Both had their own control group and the experiments were repeated three times. In both genotoxic and non-genotoxic experiments, we tested the expression levels of each compound and the common control. We thereby obtained three p-values, which are dependent on each other by sharing the common control group. In this example, the test for each compound is denoted as a partial test for the hypothesis on expression levels of the compound. The null hypothesis for the pharmacological effects of carcinogens is written as the intersection of three hypotheses for three compounds, which is denoted as a complex hypothesis.
A lot of previous literature has reported the combination of partial tests of a complex hypothesis [1]. Below, we have listed a few commonly used combining functions. Suppose we have K partial tests for the g-th gene and their p-values are p g 1 ,p g 2 ,:::,p g K . Tippett (1931) [2] proposed to use CT g T~m in K k~1 p g k ð1Þ whose null distribution, if the Kpartial tests are independent and continuous, is the minimum of K independent uniform random variables on (0,1): For a dependent partial test, it allows for bounds on the rejection probability according to the Bonferroni inequality. Fisher (1932) [3] proposed to use which is often called Fisher's omnibus test and Fisher's omnibus function. It is well known that if the Kpartial test statistics are independent and continuous, then the null distribution of CT g F follows a central X 2 distribution with 2K degrees of freedom. Stouffer (1949) [4] and Liptak (1958) [5] proposed to use where W(.) is the standard normal cumulative distribution function. Again, if the K partial test statistics are independent and continuous, then its null distribution is normally distributed with mean 0 and variance 1=K: The classical methods are well reviewed by Owen (2009) [6]. Unlike the case of independent partial test statistics, the exact distribution of the combined statistics are unknown if they are dependent to each other. The most common remedy to dependent partial test statistics is to approximate their null distributions using additional resampling-based procedure. The permutation procedure is the most common additional procedure to get the null distribution of the combined tests. However, when applied to the microarray data analysis, it has at least two shortcomings. First, the microarray data analysis commonly tests a huge number of genes simultaneously, in which the test of each gene is based on combining several dependent partial tests; this arises in our motivating example, which will be introduced in the section ''Materials and Model''. Thus, it is computationally heavy. Second, the multiple testing procedure in microarray data analysis often makes a decision among small p-values; the thresholding value to find DEGs is quite small. Thus, the required number of permutations should also be very large. For example, to approximate well the true p-value p -5 , the number of permutation samples should be much larger than p 5 : This, along with the first shortcoming, makes the resampling-based procedure inconvenient for large-scale microarray data analysis (see Westfall and Young (1993) [7] and references therein).
Some theoretical approximations to the null distribution are also reported in the literature. Brown (1975) [8] assumes that the partial statistics are from multivariate normal distribution with known covariance matrix, and develops an approximation to the null distribution of Fisher's statistic CT g F : Kost and McDermott (2002) [9] extend the Brown's approximation to the partial statistics, which are distributed as multivariate t-distribution with common denominator. Recently, Yang (2010) [10] thoroughly compare existing approximations to the null distribution of Fisher's statistic, which include the aforementioned two approximations and that based on permutations.
Despite much interest, all existing approximations on dependent partial test statistics are focused on single hypothesis testing, and their applications to multiple hypothesis testing are not studied much. In multiple testing problem, many replications of partial test statistics are available and their dependent structure (covariance matrix) can be estimated from data. For example, in microarray data to find differentially expressed genes (DEGs), we would compute a set of partial test statistics for each gene, and have many sets of partial test statistics; the number of sets is equal to the number of genes in the data.
In this paper, we propose to estimate dependence structure (covariance matrix) of partial test statistics from this replication over genes. In particular, we are interested in CT g z , Stouffer's test, which, in theory, is normally distributed with mean 0 and unknown variance even for ''dependent partial test statistics''. Thus, the dependence of partial test statistics can simply be estimated by estimating their covariance matrix. Our newly proposed procedure can be interpreted as a data dependent version of Brown (1975) [8] and Kost and McDermott (2002) [9]. They make theoretical derivation (approximation) on null distribution of combined test statistic under strong distributional assumption to the data. On other hand, we assume the covariance matrix of partial test statistics are same over genes, and estimate it from observed statistics. The null distribution of CT g z , Stouffer's test statistic, is evaluated as the normal distribution with mean 0 and the variance from the estimated covariance matrix.
We proposed a two-step procedure to estimate var(CT g z ): In the first step, we conservatively chose the number of true null genes by plotting the histogram of CT g z : We chose genes that satisfied DCT g z Dc for an appropriately chosen c to guarantee the selected genes are surely equally expressed. Details on the choice of c are followed in the section ''Combined test using z-scores''. In the second step, we estimated the common covariance matrices of z g~( z g 1 ,:::,z g K ) from pre-detected null genes in the first step, where z g k~W {1 (1{p g k ): The first step might not be required if the proportion of null gene was close to one.
In this study, we describe (1) the preparation of samples that motivated the experiment, (2) the procedure to find differentially expressed genes (DEGs) using Stouffer's z-score based method, (3) the numerical comparison of the receiver operating characteristic (ROC) curves of the Stouffer's test CT g z and two p-value based procedure, CT T and CT F , and (4) the investigation of differentially expressed genes (DEGs) between (i) control and nongenotoxic compounds, and (ii) control and genotoxic compounds.

Ethics Statement
All animal procedures were approved by the Institutional Animal Care and Use Committee of NIFDS (0901KFDA029).

Microarray Design
We conducted a series of microarray experiments to understand carcinogenicity of compounds in view of toxicogenomics. There were six target compounds: 2-acetaminofluorene (2-AAF), 39methyldimethylaminoazobenzene (39MeDAB), N-nitrosodiethylamin (DEN), clofibrate, DL-ethionine and 1,4-dioxane, which are well-known carcinogens. Among them, 2-AAF, 39MeDAB and DEN have genotoxicity; it was reported that 2-AAF and 39MeDAB bind to DNA and cause hepatocarcinogenesis by DNA adduct formation [11] [12], while DEN has genotoxicity since a mutagenic effect was observed in the comet assay [13]. Clofibrate, DL-ethionine and 1,4-dioxane are classified as nongenotoxic carcinogens. It was reported that clofibrate, a peroxisome proliferator, stimulates the peroxisomal fatty acid betaoxidation system and accordingly causes non-genotoxic cancer [14]. DL-ethionine was related to intrachromosomal recombination [15], and 1,4-dioxane was classified as a non-genotoxic since it had no activity in DNA repair assay [16]. Each compound dissolved in vehicle control was administered into rats. Once toxicity appeared, microarray was performed by using livers taken from the rats. To adjust errors in animal study, each compound was repeatedly administered into three individual rats (table 1). Only vehicle control was administered into three individual rats to produce control group. The structures of the three genotoxic and three non-genotoxic compounds are shown in Figure 1 and 2. And control was corn oil. The data sets are available from GEO, GSE31307.

Statistics Design
The experiments for genotoxic and non-genotoxic agents are designed to have four arms: one arm for the control and three arms for the carcinogenic compounds. Each arm has three replicates that record the expression levels of 30,199 genes. The main goal of the study was to find a list of genes that show differential expression between (A1) control versus genotoxic carcinogen, (A2) control versus non-genotoxic carcinogen. To do this, we tested the hypothesis for the g-th gene with g~1,2,:::,G, H g 0 : the expression level of the g-thgene is not differentially expressed. Thereby, three partial test results for three genotoxic (or non-genotoxic) compounds were obtained and the results were combined to test the hypothesis H g 0 :

Results and Discussion
Combined Test using z-scores We recalled the hypothesis for the g-th gene H g 0 that the g-thgene is not differentially expressed between two comparison groups, the control versus each of the genotoxic (or non-genotoxic) compounds. As stated in the experimental section, we had three partial tests for three compounds, 2-AAF, 39MeDAB, and DEN (or clofibrate, ethionine, and 1,4-dioxane). We let t g k , z g k , and p g k , for, k~1,2,3, g~1,2,:::,G, be their testing statistics, z-scores, and p-values.
First, let us consider the tests for the aims (A1) and (A2) in the experimental section. For these goals, Tippett's test and Fisher's omnibus test would be the two most common combining functions to test H g 0 : The reference distributions for both are well understood when test statistics t g k s are independent of each other. However, in our example, test statistics of partial tests share a common control group, and they are dependent on each other. Thus, null distributions of both CT T and CT F are not available analytically.
The re-sampling-based procedures, or permutation procedure more specifically, are commonly used to approximate the null distributions of CT g T , and CT g F : In this paper, we proposed to use z-score rather than p-value. The z-score for gene g is defined as z g k~W -1 (G 0 (t g k )), for k~1,2,3,g~1,2,:::,G, where G 0 is the cumulative distribution function (CDF) of the test statistics t g under the null (A1) or (A2). Here, G~30199: Under the null, it has the standard normal distribution. We proposed to use Stouffer's statistic and, under the null, it has a normal distribution with mean 0 and variance where s g ij~c ov(z g i ,z g j ): In practice, s g ij s are unknown but can be estimated from the replicates of (z g 1 ,z g 2 ,z g 3 ) over genes under the assumption that they are equal over equally expressed genes. Now we introduce the two-step procedure to estimate the null distribution. In the first step, we select the null genes by plotting the histograms of CT g z : We select the genes with DCT g z Dc for an appropriately chosen c: The thresholding value c is chosen to satisfy ''zero assumption'' by Efron (2004) [17], which is also termed ''purity'' by Genovese and Wasserman (2004) [18]. The details on the choice of c is very same with that in Efron (2004) [17]. We then estimate the covariance matrices of |z g~( z g 1 ,z g 2 ,z g 3 ) using the sample covariance matrixes of the selected genes. To be specific, In summary, the procedure of this paper is: S1) Find the z-scores À z g~( z g 1 ,z g 2 ,z g 3 ) for g~1,2,:::,G: S2) Compute the Stouffer's test statistic CT g z~( z g 1 zz g 2 zz g 3 )=3: S3) Find the set I~fg : DCT g z Dvcg, for c satisfying Efron's zero assumption.
S4) Compute the estimate of variance of CT g z , which is notated asv v g :

Numerical Study
We numerically compared the powers of Stouffer's test, Tippett's test and Fisher's omnibus test. In comparison, we additionally consider Dunnett's test (Dunnett (1955) [19]), which tests the difference between multiple treatment groups to a single control group. The p-values of Tippett's test and Fisher's omnibus test were approximated by additional permutation procedures.
For the study, we generated data sets that had the same structure (or dimension) as that of our motivating example. To be specific, we had one control group and three compound (agent) groups and changed the number replicates in each group to 5 and 20. We fixed the total number of genes at 4,000, and the number of differentially expressed genes (DEGs) at 200. Thus, the number of equally expressed genes (EEGs) was 3,800. Let y g ik be the expression level of the g-th gene for the k-th replicates of the i-th compound. Here, g~1,2,:::,4000, i~1,2,3 and k~0,1,:::,5 (or 20); i~0 indicated the control group. We assumed that, for the control group, y g 0k~e g 0k ; for the compound groups i~1,2,3, where m g is fixed as 1.5, e g ik s are independently and identically distributed (IID) from the normal distribution with mean 0 and variance 1; the t-distribution with 5 degrees of freedom (dfs); and the gamma distribution with parameters 3 and 1. We used 5,000 permuted samples in doing CT T and CT F : Figure 3 compared the receiver operating characteristic curves (ROCs) of three methods, CT T , CT F , and CT z for each error distribution. Figure 3 (a, b) are for the standard normal distribution; Figure 3 (c, d) are for the t-distribution with degrees of freedom 5; and Figure 3 (e, f) are for the gamma distribution with parameters 3 and 1. Here, the number of replicates (k in the model above) in (a), (c), (e) is 5, and the number of replicates in (b), (d), (f) is 20.
The ROC curve plots two accuracy measures, false positive rate (FPR) and true positive rate (TPR), of a test, where the FPR is the probability that the test mistakenly detects the EEGs as DEGs, and the TPR is the probability that it correctly detects the DEGs. In each data set, we estimated their probabilities as portion F b P PR~# of EEGs whose H g 0 is rejected: total # of EEGs: and T b P PR~# of DEGs whose H g 0 is rejected: total # of DEGs: with various critical values for tests. The plot is the average of F b P PRs and T b P PRs over 20 data sets simulated from the model described above. It shows that Stouffer's test has the highest statistical power (specificity) in detecting differentially expressed genes, whereas Tippett's procedure performs worst among three procedures based on p-values or z-scores. In addition, the computational burden of CT z was much lighter than CT F , which relied on an additional re-sampling procedure. Dunnett's test performs worse than all others in overall. Table 2 summarized rank statistics of 200 DEGs among 4000 genes. In the table, we reported the averages of the quantiles of ranks of 200 DEGs (among 4000 genes) over 20 simulated data sets. Lower ranks of DEGs tells the higher detectability of DEGs by a given procedure. Table 3 summarized the true FDR from 20 data sets. For each simulated data set, the FDR was controlled using q-value by Storey [20]) and the true FDR was evaluated (in the simulated data set, we have a list of 200 DEGs). When sample size is small (N = 5), the FDRs are slightly larger than the targetted FDR in all procedures. However, as the sample size increases (N = 20), the FDR is better controlled at the targetted level (0.1 or 0.2 in the table). In particular, the FDR levels of Tippett's test and Fisher's omnibus test are close to the aimed level, whereas the Stouffer's test gives the FDR value lower than the target. This indicates that the proposed procedure using Stouffer's test is rather conservative in detecting DEGs. It is reassured from the data example in next section.

Data Example
We analyzed the microarray data described in the experimental section. The data sets are available from GEO (Gene Expression Omnibus; http://www.ncbi.nlm.nih.gov/geo/query/acc. cgi?acc = GSE31307). The data aimed to understand genotoxicity and carcinogenicity of compounds in view of systems toxicology. In the microarray data, 2-AAF, 39MeDAB and DEN were known    Usage of Z-Scores by Stouffer's Test PLOS ONE | www.plosone.org as genotoxic carcinogens, and clofibrate, ethionine and 1,4dioxane were known as non-genotoxic carcinogens. We preprocessed the data before the analysis using the RMA (Robust Multi-array Average) method. The RMA method was developed by Bolstad et al. (2003) [21] for the normalization of Affymetrix GeneChip array data. The RMA method is a threestep procedure: (1) Background correction and data transformation; (2) Normalization at the probe level using the quantile method; and (3) Model parameter adjustment to get normalized expression levels. The RMA method is freely available from the bioconductor website (http://bioconductor.org).
We found the DEGs from the control in non-genotoxic and genotoxic carcinogens. In each of non-genotoxic and genotoxic carcinogens, we assumed that y g ik for i~1,2,3 had the common mean m g 1 whereas y g 0k had the mean m g 0 , for the g-th gene. We tested H g 0 : m g 1~m g 0 using CT g T , CT g F and CT g z : The p-values of CT g T and CT g F were approximated by the permutation method with 5,000 permutation samples. On the other hand, the p-value of CT g z was computed as and s g ij~c b o ov(z g i ,z g j ),which was estimated as the sample covariance  of z g i and z g j over g~1,2,:::,30199: The DEGs were selected to make Type I error be smaller than 0.05 in all cases. Figures 4 and 5 are Venn diagrams showing how the selected genes were distributed in the three methods. Figure 4 is the plot for the genotoxic carcinogens and Figure 5 is that for non-genotoxic carcinogens. The figures showed that the selected genes by z-score method were more similar to those by Fisher's method than by Tippett's method. This would not be unexpected as it can be seen in both CT z and CT F averaged outcomes of three partial tests whereas CT T took the minimum of them.
The number of selected genes is summarized in Table 4. The genes, which were obtained through three different methods, were interpreted based on DAVID gene ontology database. In case of Tippett's method, between 261 DEGs (a = 0.001) of genotoxic carcinogen, 10 genes such as Opa1, AIMP, YWHAB, etc., were considered as marker candidates among 259 genes whose expression was altered in the presence of genotoxic carcinogen. The other two genes were excluded as they revealed the same response when the non-genotoxic carcinogen administered. Similarly, 12 genes such as Stk3, CTNNBL1, PPP1F15A, etc., were selected among 266 genes as promising marker candidates by Fisher's method and 17 genes such as PPP2R2B, tnfrsf11b, acvr1, etc., were selected among 216 genes by Stouffer's method. Consequently, the proportion of marker candidates in selected genes obtained by Stouffer's method was the largest among those three methods. In particular, 9 genes such as pawr, Opa1, STK17B, AEN, FASTKD3, YWHAB, Stk3, ALS2 and CIDEC demonstrated high significance consistently in more than two different methods.
Through screening by Tippett's method, 88 DEGs (a = 0.001) of non-genotoxic carcinogen were chosen. Among 86 genes, whose expression was transformed in the presence of non-genotoxic carcinogen within 88 genes, 15 genes such as snai1, mgea5, Pcsk6, etc., were further selected as marker candidates according to their mechanisms. In addition, 25 genes such as EDNRA, Pgcp, ABHD2, etc., were chosen as potent maker candidates among 239 genes by Fisher's method and 28 genes such as MTDH, GCH1, ADH4, etc., were selected among 160 genes by Stouffer's method. As a result, the proportion of marker candidates in selected genes obtained by Stouffer's method was the highest among those three methods. Remarkably, 6 genes such as Tsg101, EDNRA1, EDNRA2 EDENRA3, PRKAR1A and HDAC2 are the more valuable markers of non-genotoxic carcinogen as they demon- strated high significance consistently in more than two different methods.
Pathway analysis was conducted on selected genes. It finds relationship between the genes and shows why genes have become differentially expressed genes (DEGs). Figures 6 and 7 are demonstrations of search for all relations registered in the database on the subject, the six DEG lists, which were the search result of pathway analysis, and expressed them as a pathway map. Figure 6 is a result of pathway analysis on the genes whose properties were significantly changed in the control group and genotoxic carcinogen-treated group, and is expressed in the order of (a) Tippett, (b) Fisher, and (c) Stouffer. As each map showed a similar level of networking, we discovered that the three methods were similarly efficient when searching for genes whose properties changed at the time of injecting genotoxic carcinogen. This can be interpreted as the chemical structure of genotoxic carcinogen having high affinity for DNA and damaging the genome in a short time, so that the range of changing gene properties is wide and the sorting process is relatively easy. On the other hand, in Figure 7, which compares non-genotoxic carcinogen and control, the map for (a) was not formed. Fisher's method (b) was also insufficient to find the largest part of a network.
The validity of the selected gene list is verifiable through the contents of the gene. In Figure 6, the center gene of the network was mapk1 in all results upon the injection of genotoxic carcinogen. This result matched with prior findings of existing research, because mapk1 is a major element for the Ras-MAPK network, a representative mechanism for cell proliferation signal transduction. The signal mechanism of Ras-MAPK, a representative path of proliferation stimulation, reveals a network to determine whether the cell should proliferate, stop proliferating, or die; this signaling mechanism is a core issue in the research on cancer. Therefore, the Ras-MAPK mechanism can be an indicator to examine the genotoxicity of chemicals. The results showed that mapk1 was included in the map upon genotoxic carcinogen injection as shown in Figure 6 (a) (b) (c), but it was not included upon non-genotoxic carcinogen injection as shown in Figure 7 (a) (b) (c). Thus, all of the analysis methods showed different efficiencies, but the direction chosen by each was reasonable.

Conclusions
In this paper, we revisited Stouffer's method based on z-scores and showed that it was useful for large-scale microarray data analysis. In particular, when we combined dependent partial test results, unlike that in other popular methods such as Tippett's test and Fisher's omnibus test, the null distribution of the combined statistic in Stouffer's test could be easily estimated from the data and did not require any additional numerical procedure. The numerical study showed that Stouffer's test had higher true positive rates (TPR) than Tippett's method and Fisher's method when we tested a large number of hypotheses simultaneously. In addition, real data analysis also showed the advantage of Stouffer's method over the others.
We conclude the paper with brief discussions on several issues, which are not fully covered in the main text.
First, in both numerical study and data analysis, the false discovery rate (FDR) was estimated using the q-value method by Storey (2002) [20]. We found that the q-value method with the Stouffer's test often overestimated the FDR and provided a conservative list of DEGs; it only detected a smaller number of genes, which were much differentially expressed, than other methods. This phenomenon is from deviations between theoretical (which is used in this paper) and empirical distribution of test statistics. One simple remedy for this would be the empirical Bayes (EB) procedures by Efron (2007) [22], Schwartzman et al. (2008) [23], and many others. They estimate the null distribution of testing statistics from data, and they are robust to any deviation from theoretical assumption.
Second, the reviewer points out difficulty from the dependence of expression levels among genes. As discussed in Efron (2007) [22], it is one source of deviation between theoretical and empirical distribution of test statistics (not only CT g z , but also CT g T and CT g F ). Again, the EB procedures would be a good remedy for this difficulty.
Finally, in recent, Benjamini and Heller (2008) [24] introduce partial conjunction hypothesis on the number of false null hypothesis, which is much general than the complex hypothesis considered in this paper. They propose a procedure to combine pvalues (of partial tests) to test the hypothesis, and it would be interesting to investigate the advantages of using z-scores over pvalues.