Simultaneous Statistical Inference for Epigenetic Data

Epigenetic research leads to complex data structures. Since parametric model assumptions for the distribution of epigenetic data are hard to verify we introduce in the present work a nonparametric statistical framework for two-group comparisons. Furthermore, epigenetic analyses are often performed at various genetic loci simultaneously. Hence, in order to be able to draw valid conclusions for specific loci, an appropriate multiple testing correction is necessary. Finally, with technologies available for the simultaneous assessment of many interrelated biological parameters (such as gene arrays), statistical approaches also need to deal with a possibly unknown dependency structure in the data. Our statistical approach to the nonparametric comparison of two samples with independent multivariate observables is based on recently developed multivariate multiple permutation tests. We adapt their theory in order to cope with families of hypotheses regarding relative effects. Our results indicate that the multivariate multiple permutation test keeps the pre-assigned type I error level for the global null hypothesis. In combination with the closure principle, the family-wise error rate for the simultaneous test of the corresponding locus/parameter-specific null hypotheses can be controlled. In applications we demonstrate that group differences in epigenetic data can be detected reliably with our methodology.


Introduction
Epigenetic mechanisms, such as deoxyribonucleic acid (DNA) methylation, constitute a central principle of gene regulation. In contrast to other forms of regulation, e. g., transcriptional or translational control, DNA methylation occurs without changing the primary DNA sequence, see [1]. It refers to the selective addition of a methyl group to the 5 0 -carbon of the cytosine base and occurs exclusively in the dinucleotide cytosine phosphate guanine (CpG). DNA methylation occurs non-randomly and, if the target CpGs are located in the proximity of coding regions, is often associated with inactive gene expression. Oppositely, demethylation of CpG in regulatory elements is commonly accompanied by activation of expression. Shifts in DNA methylation have been observed in cells for various diseases. These changes reflect the loss of Furthermore, several confounding factors of DNA methylation have been identified in previous work, e. g., methylation is known to be highly correlated with age. A test for case-control methylation data addressing this dependency was developed in [10] and extended in [11].
Several parametric models for the distribution of the β-values have been proposed, see [12]. Their parametric nature limits their applicability in practice. A nonparametric analysis of methylation data was suggested in [13] and [14]. However, a formal notion of multiplicity adjustment is lacking in their work. In the present paper we develop a nonparametric statistical framework for tight FWER control in the context of analyzing epigenetic methylation studies, taking the described dependencies among loci into account, leading to multivariate procedures.

Methods
Throughout the remainder, we label reported results from the literature as propositions. Our major own mathematical contribution is Theorem 1.

Basic Model
Suppose that we have two experimental groups denoted by A and B, for instance given by a disease status. We consider N 2 N observational units with n A observables in group A and n B in group B, such that N = n A +n B . We assume that all N observables are stochastically independent and that the observations in group i 2 {A,B} are realizations of independent and identically distributed (iid.) d-dimensional random vectors X ik ¼ ðX ð1Þ ik ; . . . ; X ðdÞ ik Þ > ; where the index i 2 {A, B} denotes the group and 1 k n i indexes the k-th observational unit within group i, while the superscript denotes the coordinate. The random vectors are assumed to follow the distribution L(X A1 ) = P or L(X B1 ) = Q, respectively. Example 1 (Identifying differentially methylated CpG loci) Consider an epigenetic methylation dataset comprising d CpG loci. For each locus ℓ a methylation ratio (occasionally referred to as β-value) is defined as where M (ℓ) (U (ℓ) ) is an intensity value for the amount of methylated (unmethylated) cells at locus ℓ, where we assume that suitable preprocessing steps have been performed prior to the statistical analysis. In previous literature the family of beta distributions has been considered as a model for the distribution of X (ℓ) , e. g., in [15]. However, often bimodality and skewness are encountered, questioning this parametric assumption. Notice also that numerator and denominator in Eq (1) are highly dependent. As we are not aware of a model capturing the aforementioned distributional characteristics, we propose a nonparametric approach as in [14]. An application of our general methodology to a two-sample problem involving such data is presented in Section "Identification of differentially methylated CpG loci". Example 2 (Group differences for immune relevant parameters) As a second example, consider the comparison of human colorectal tissue for two different stages of cancer as well as healthy controls. In Section "Association of immune cell counts with cancer" we analyze data from a study in which three immune relevant parameters were measured utilizing novel epigenetic markers based on methylation signatures in tissue. Since no prior information about distributional properties of these marker data are at hand, our nonparametric approach is applied, only making use of our basic model assumptions. Three two-group comparisons are made regarding differences of the immune relevant parameters between the disease stages.

Aim of the statistical analysis
We denote by F i the cumulative distribution function (cdf) of X i1 with marginal cdfs F ð'Þ i for each coordinate 1 ℓ d. We are interested in testing two families of marginal hypotheses, say H = (H ℓ :1 ℓ d) and H 0 ¼ ðH 0 ' : 1 ' dÞ. The family H corresponds to marginal homogeneity in the sense of [16]. This means, one is interested in testing which of the coordinate-specific marginal distributions are the same in both groups A and B, i. e., The family H 0 corresponds to finding a particular type of coordinate-specific differences. Namely, one is interested in detecting coordinates in which there are group differences in the central tendencies of the marginal distributions. To this end, recall the definition of the relative effect in the sense of [17].
Definition 1 (Relative effect) Let X A and X B denote two stochastically independent random variables which are defined on a common probability space with probability measure P. Assume that X A and X B have non-degenerate distributions and denote the normalized version of their cdf, as considered in [18], by F A and F B , respectively. Then, the relative effect of F A with respect to F B is defined as For a d-variate distribution the relative effects can be defined coordinate-wise for each 1 ℓ d by Let p AB ¼ ðp ð1Þ AB ; . . . ; p ðdÞ AB Þ > denote the vector of marginal relative effects in the latter case.
The functional p AB is capturing central tendencies, i. e., whether realizations of one of the distributions are tending to larger values than the ones from the other. Hence, we let H 0 ' : In the remainder, we make use of the notation and refer to H 0 as the global hypothesis in H. An analogous notation applies for intersection hypotheses in H 0 .

Test statistics and multiple test procedures
For the univariate nonparametric two-sample problem, i. e., for testing one particular hypothesis H ℓ , Wilcoxon's rank sum test (or, equivalently, the Mann-Whitney U test) is commonly applied. We make use of multivariate generalizations described in [19] (for testing H), and in [20] (for testing H 0 ).
Wilcoxon-Mann-Whitney (WMW) statistic. Definition 2 (Mann-Whitney U-Statistic) For 1 ℓ d, we let Proposition 1 (cf. Theorem 2 (iii Ã ) of [19]) Assume that n i /N ! τ i 2 (0,1) for N ! 1, i 2 {A,B}. Then, under H 0 , it holds that where S = (σ ℓr ) 1 ℓ,r d with entries where 1 j n A and 1 k 6 ¼ k 0 n B . In Eq (2) and throughout,! D denotes convergence in distribution, and N d (μ,S) stands for the d-variate normal distribution with mean μ and covariance matrix S. Corollary 1 (Theorem 9.1 in [19]) LetŜ be a consistent estimator of S. Assuming that det (S) > 0 it holds that Assuming that V N converges to a positive definite covariance matrix V as N ! 1, it holds that Furthermore, in (4.6) of [20] a consistent estimatorV N defined via the ranks of the observations has been provided.
Corollary 2 Making use of Proposition 2 and a Studentization byV N , it follows by Slutsky's lemma in analogy to Corollary 1 that, under H 0 0 : p AB ¼ 1 d =2, the statistic testing H or H 0 , respectively, is a vector of measurable mappings (individual tests) from the sample space into {0,1} d . In this, the event {ϕ ℓ = 1} means rejection of the ℓ-th null hypothesis H ℓ or H 0 ' , respectively. For given distributions P and Q, the FWER of φ is defined as the probability under (P,Q) of at least one type I error, i. e., where I 0 (P,Q) {1,. . .,d} denotes the index set of true null hypotheses in H or H 0 , respectively. The multiple test φ is said to control the FWER strongly at a given level α 2 (0,1), if FWER (P, Q) (φ) α for all possible pairs (P,Q).
One construction principle for FWER-controlling multiple tests is the closed test principle according to [21]. The general idea behind this method is to add to the system of hypotheses of interest all their intersections H S or H 0 S , respectively, where S 2 2 {1,. . .,d} . Even if these intersection hypotheses are not of scientific interest, they are tested auxiliarly in order to provide a multiplicity correction. Namely, a closed test procedure tests every such intersection hypothesis at full level α by an arbitrarily chosen level α test φ S or φ 0 S , respectively. The adjustment for multiplicity is then performed via the decision rule that only those coordinate-specific hypotheses H ℓ or H 0 ' , respectively, are rejected for which all intersection hypotheses H S (H 0 S ) with ℓ 2 S have been rejected by φ S (φ 0 S ). Thus, the price to pay for the multiplicity of the problem is that one has to perform 2 d tests. A concise description of this principle can for instance be found in Section 3.3 of [9].
Remark 1 Application of the closed test principle is particularly convenient in our context by noticing that the assertions of Propositions 1 and 2 and their corollaries remain valid if the respective full d-dimensional vector of test statistics is replaced by a subvector which only contains the indices in the subset S to which φ S or φ 0 S , respectively, refer. In the corollaries, only the degrees of freedom of the asymptotic χ 2 -distributions have to be changed from d to jSj.
Resampling-based approach. The results from the previous sections can also be used to construct asymptotically pivotal statistics for usage in a resampling approach. This strategy is assumed to keep α more accurately for finite N than the asymptotic methods resulting from Corollaries 1 and 2. In [22], multivariate multiple permutation tests have been developed for more restrictive families of hypotheses than H or H 0 , namely, for families where differences of coordinate-specific functionals of P and Q, respectively, are of interest. In contrast, the relative effect depends both on P and on Q. In Theorem 1 we adapt the theory derived in [22] to the case that multivariate relative effects are of interest. Thereby, we obtain an asymptotically FWER-controlling resampling procedure based on the statistic W N or W U N , respectively. To this end, let π denote an arbitrary but fixed permutation of the set {1,. . .,N} and let X p ¼ ðX p A1 ; . . . ; X p An A ; X p B1 ; . . . ; X p Bn B Þ be the matrix containing the permuted observation vectors from X = (X A1 ,. . .,X An A , X B1 ,. . .,X Bn B ). We make the convention that the first n A columns of X and X π correspond to group A and the remaining n B columns to group B. Denote by τ = τ(π, n A ,n B ) the fraction of observations from group B within the first n A columns of X π , and let Analogously, letp p AB ¼p A 0 B 0 ðX p Þ denote the estimator of the vector of relative effects based on the permuted data set X π . A simple calculation yields that Theorem 1 Under the general setup from above, assume that the sample sizes n A and n B fulfill the regularity assumptions given in Lemma 5.3 of [23] as N ! 1. Define the statistic whereV^p N denotes the estimator from (4.6) in [20] applied to X π . Then, the permutation distribution of W p N (i. e., its discrete distribution induced by letting π be uniformly distributed on all N! possible permutations of the set {1,. . .,N}, while keeping the data X fixed), the cdf of which we denote byR^W N , satisfies A result analogous to Theorem 1 can be obtained for the statistic W U N . Based on them, an asymptotic null distribution for W N or W U N , respectively, is given by its permutation distribution. This permutation distribution, in connection with Remark 1, can be used instead of w 2 jSj in order to calibrate each test ϕ S or 0 S , respectively, for type I error control at level α. Proof of Theorem 1. We approximate the conditional distribution (given X) of X π by an asymptotically equivalent unconditional two-groups model. To this end, denote by Z = (Z 1 ,. . ., Z N ) a random matrix, the columns of which are stochastically independent such that the first n A columns are distributed as P 0 and the remaining n B columns are distributed as Q 0 . Following the argument of Theorem 3.5 in [20] the statistic T N (Z) has asymptotically a centered d-variate normal distribution with some covariance matrix which is non-degenerate for eventually all large N under our general assumptions. Also, we note that bothp p AB and p p AB consistently estimate p p AB . Applying the reasoning of Lemma 5.3 in [23], together with the continuous mapping theorem and Slutsky's lemma, completes the proof.

Computer simulations
In this section we consider the performance of the proposed tests in terms of type I error control and power. To this end we present results of computer simulations under the following model. The dependency between the marginals is modeled by the correlation matrix R of a Gaussian copula C S , where S is the covariance matrix originating from R and the marginal variances induced by the shape parameters. The correlation matrix R = (R ℓ,r ) is of AR(1) structure, i. e., R ℓ,r = ρ jℓ−rj , 1 ℓ,r d, where ρ takes values in {0,0.2,0.4,0.6,0.8}. This model is motivated by interpreting coordinates as epigenetic loci and considering a decreasing strength of dependency with increasing epigenetic distance.
First, we assessed the accuracy of the χ 2 approximation (Proposition 2 in connection with Corollary 2) and the permutation-based approximation (Theorem 1) of the null distribution of W N , respectively, under the global null hypothesis. The empirical type I error rate was calculated as the relative frequency of occurrences of type I errors when testing the global null hypothesis (d 1 = 0), i. e., Ifφ ðkÞ ðx ðkÞ Þ ¼ 1g; where φ (k) denotes the test of the global hypothesis H 0 in the k-th of K simulation runs and x (k) the pseudo-sample in simulation run k. The empirical power of the test of the global null hypothesis was calculated as the same frequency for the cases with d 1 > 0.
Second, type I error control of the multiple tests employing the closure principle was assessed by the FWER. Empirical values of the FWER were calculated as the relative frequency of the occurrence of at least one type I error, i. e., If91 j d 0 : φ jðkÞ ðx ðkÞ Þ ¼ 1g; where φ (k) = (φ 1(k) ,. . .,φ d(k) ) > stands for the multiple test in the k-th simulation run. For the sample size N, we considered two different regimes, namely moderate (n A = 20, n B = 30) and large (n A = 100, n B = 150). Tables 1 and 2 display empirical type I error rates for testing the global hypothesis in the moderate and large sample regimes, respectively. The empirical power for testing the global hypothesis is presented in Tables 3 and 4. Finally, Table 5 displays empirical values of the FWER, both in the moderate and in the large sample regime. The nominal significance or FWER level, respectively, was set to 5% in all simulations. The permutation test was carried out as a Monte Carlo permutation test employing 9,999 randomly chosen permutations of {1,. . .,N}, together with the identity permutation.
In both sample size regimes, the empirical type I error rate of the permutation test is below the desired level of 0.05, indicating its applicability even for moderate sample sizes. In contrast, the test depending on critical values from the limiting χ 2 distribution performs liberally in all simulation settings displayed in Tables 1 and 2. With increasing dimension this test even becomes more and more liberal. For example, its empirical type I error rate rises up to 20% for d = 10. On the other hand the stronger the dependency between the coordinates, the less liberal the χ 2 -based test.
Of course, the more stringent type I error control of the permutation test, compared with the asymptotic χ 2 -based test, leads to lower power, see Tables 3 and 4. However, the differences in power become smaller for increasing δ.
Regarding the empirical FWER (Table 5), we again observe that the permutation test keeps the level better than the χ 2 -based multiple test, where level exceedances of the latter occur for large δ and small ρ > 0 in the moderate sample size regime.

Empirical illustration
In this section, we present applications of the proposed methods to two epigenetic studies. We applied the multiple tests based on the statistics defined in Section "Test statistics and multiple test procedures" in combination with the closure principle and Remark 1.
On one hand, we re-analyzed a representative study utilizing a whole genome approach, which aimed at the discovery of novel epigenetic markers to distinguish healthy (or good prognosis) donors from those with disease (or bad prognosis). The primary statistical challenge of such studies is the high number of locus-specific tests based on a sample with a moderate number of observations. On the other hand, we re-analyzed a data set regarding three immune  Table 5. Empirical family-wise error rates. Simultaneous Statistical Inference for Epigenetic Data relevant parameters which were derived from cell type specific real-time PCR markers in previous work (see, e. g., [3]). Identification of differentially methylated CpG loci. The UK Ovarian Cancer Population Study (see [26]) aimed at detecting differentially methylated loci between ovarian cancer cases and healthy controls (GEO accession number GSE19711). To this end, 274 healthy controls were compared with 131 untreated, confirmed ovarian cancer cases. Upon rigid quality control, 264 controls and 124 cases remained in the study. When applying our method, we randomly assigned 176 and 84 controls and cases, respectively, to the screening sub-sample of a twostage selection approach (cf. [27] and references therein). We applied the univariate two-sample Wilcoxon test at each locus on the screening sample and ranked the resulting p-values in ascending order. The remaining 88 and 40 control and case subjects, respectively, were used for the confirmatory analysis (second step). The ten top-ranked loci from the screening stage were tested for a relative effect unequal 1/2 based on asymptotic critical values from the limit distribution (χ 2 ) and permutation-based critical values (Perm) on the confirmatory group. In Table 6, the results are presented as multiplicity-adjusted p-values. For locus 1 ℓ 10, the multiplicity-adjusted p-value denotes the smallest FWER level such that H 0 ' is rejected by the respective multiple test procedure. With both methods, all ten candidate CpG sites have a multiplicity-adjusted p-value below 5%, an FWER level which is often chosen in practice.
Among the ten loci displayed in Table 6, there are two which are associated with the FUT7 gene. In turn, the FUT7 gene encodes the Alpha-(1,3)-fucosyltransferase, see [28]. This enzyme plays a role in connection with the surfaces of granulocytes, monocytes and natural killer cells.
Association of immune cell counts with cancer. As mentioned in Section "Basic Model" the discussed rank-based methods can be applied under almost no assumptions due to their nonparametric nature. Furthermore, our approach implicitly adapts to the dependency structure in the data via the permutation approach. Hence, it is especially well-suited for situations with highly dependent coordinates, for example resulting from the consideration of derived parameters.
Such a situation was present in [29]. In their study, a set of three pre-identified gene regions was considered. These regions have been shown to be associated with particular cell types. Namely, demethylated Foxp3 is associated with regulatory T-cells (Tregs), CD3 with all Tcells, and GAPDH with all leukocytes. From this, three immune relevant parameters were derived: the number of Tregs, the total number of T-cells (tTL) and the cellular ratio of immune tolerance (immunoCRIT). As the Tregs constitute a subclass of the tTL and the immunoCRIT Simultaneous Statistical Inference for Epigenetic Data is the ratio of the two other values, these three parameters are highly dependent. Nonetheless each parameter is immune relevant in its own right.
We assessed the association of the three parameters with a disease indicator for cancer, with cancerogenesis, and with cancer progression. In this context, the evaluation of the individual roles of the parameters had to be investigated. This is because cancer tolerance may be either driven by the immunoCRIT or by its individual components, i. e., the shear amount of Tregs or all T-cells. In addition, it is important to understand, even if the most important part is the immunoCRIT, which of the components drives the change during cancerogenesis. The results are presented in Table 7.
Our data indicate a statistically significant role of all three parameters with respect to all three endpoints, with the exception that the Treg parameter is not significantly associated with cancer progression. Thus, our multiple permutation test confirms the notion that manifestation of cancer is strongly associated with a shift in immune tolerance as monitored by Tregs, overall T-cells and the immunoCRIT. Notably, the change of the overall immunological tolerance from healthy towards cancer tissue is driven by both the number of Tregs and the overall number of T-cells. However, once a tumor is established the continuing increase of immuno-CRIT-mediated tolerance along with higher tumor stages is mainly caused by a diminished overall T-cell number and not by Treg increase. Hence, while there is an undoubted dependency among these parameters, the biological mechanisms of cancer development allow for a detachment of these parameters such that individual changes of one of the parameters can be observed and statistically evaluated.

Discussion
Epigenetic data pose their individual set of issues for their statistical interpretation, since in contrast to DNA and protein studies, they exhibit both linkage disequilibrium-type dependencies and cell type specificity issues. Hence, dependencies have to be taken into account that go Here, we assessed a new method to cope with these statistical issues in a general manner. We demonstrated how group differences in epigenetic data can reliably be detected. To this end, a statistical approach based on hypotheses regarding central tendencies in combination with nonparametric Studentized multivariate multiple permutation tests has been proposed. We adapted the theory of [22] such that it can be applied to the analysis of relative effects. In particular, our methodology addresses the so-called "null dilemma" in the sense of [16], because Studentization leads to asymptotically pivotal test statistics, even if the dependency structure differs between the groups. Our approach features four important characteristics for analyzing epigenetic methylation data: (i) The use of the relative effect as a functional for the definition of differential methylation allows to declare a shift in central tendencies in case of a significant finding. This is particularly important as other studies, see [2], have found that variation in DNA methylation may play an important role in the development of complex diseases like cancer. The restriction to shift alternatives, however, is convenient for the development of certain epigenetic markers; (ii) the permutation-based approach keeps the desired type I error level even for moderate sample sizes; (iii) carrying out the permutation test as a multivariate procedure implicitly adapts to the dependency structure in the data; (iv) as we mentioned in Section "Basic Model" the discussed rank-based methods can be applied under almost no assumptions on the distribution of the data.
Computer simulations revealed that the permutation-based approach keeps the type I error level more accurately than asymptotic χ 2 approximations of the distribution of Wald-type statistics, especially in cases with moderate sample sizes. The latter finding is in line with the observations from [30]. The convergence of Wald-type statistics towards their limiting χ 2 distribution is known to be slow and this problem becomes more severe for increasing dimensionality.
As indicated in the real data examples above, epigenetic studies usually involve several loci simultaneously based on a single sample. In many medical applications, the number of observations is very limited. Each of the given examples represents one extreme-but very common -experimental set-up: Microarray analyses with thousands of mildly dependent CpGs as in Example 1 bear a substantial risk of false positives, even when relatively high sample sizes are at hand. On the other end an unknown or complicated dependency structure in the data poses a statistical challenge. This issue is true for both directly adjacent CpGs, which are usually comethylated as well as when technically independent markers functionally overlap. The latter case was considered in Example 2 with the Foxp3 gene as marker for Tregs, and CD3g/d intergenic region as marker for the overall T-cells.
As usual for resampling procedures, our approach based on permutations in combination with the closure principle is computationally much more demanding than asymptotic approximations based on tabulated χ 2 -quantiles. However, computations can be parallelized with respect to the subsets S in the closed test procedure such that the computing time can be distributed among nodes in a cluster computing system. Furthermore, efficient shortcut versions (step-down variants) of the closed test procedure can be employed; see [31] for details.
Possible extensions of our methodology comprise multi-sample problems with more than two groups, as well as the consideration of other types of limit laws (e. g., coming from extreme value theory). Finally, Edgeworth expansions as in [32] for the Wald-type statistic can prevent the costly resampling steps, at least if some concrete distributional assumptions for the observational units can be justified.
Supporting Information S1