A Comparative Study of Five Association Tests Based on CpG Set for Epigenome-Wide Association Studies

An epigenome-wide association study (EWAS) is a large-scale study of human disease-associated epigenetic variation, specifically variation in DNA methylation. High throughput technologies enable simultaneous epigenetic profiling of DNA methylation at hundreds of thousands of CpGs across the genome. The clustering of correlated DNA methylation at CpGs is reportedly similar to that of linkage-disequilibrium (LD) correlation in genetic single nucleotide polymorphisms (SNP) variation. However, current analysis methods, such as the t-test and rank-sum test, may be underpowered to detect differentially methylated markers. We propose to test the association between the outcome (e.g case or control) and a set of CpG sites jointly. Here, we compared the performance of five CpG set analysis approaches: principal component analysis (PCA), supervised principal component analysis (SPCA), kernel principal component analysis (KPCA), sequence kernel association test (SKAT), and sliced inverse regression (SIR) with Hotelling’s T2 test and t-test using Bonferroni correction. The simulation results revealed that the first six methods can control the type I error at the significance level, while the t-test is conservative. SPCA and SKAT performed better than other approaches when the correlation among CpG sites was strong. For illustration, these methods were also applied to a real methylation dataset.


Introduction
DNA polymorphisms explain only a small proportion of inheritance patterns in many complex diseases [1]. Some of the missing heritability might be explained by epigenetic variation, especially DNA methylation [2]. Indeed, the DNA methylation state, rather than DNA sequence, is more determinative of gene expression levels [3]. Further, levels of DNA methylation may "record" an individual's environmental exposures, and thus methylation is a potential biomarker for disease diagnosis and risk stratification [4,5]. Because of the reversibility of DNA methylation, it may provide a potential therapeutic target for complex diseases, especially cancer [6,7].
Global DNA methylation status can now be profiled to determine its involvement with disease, via epigenome-wide association studies (EWASs). As an example, the HumanMethyla-tion450 array from Illumina can assess methylation levels at more than 485,000 CpG markers [8]. The estimated proportion of DNA methylation (β-value) varies between 0 (unmethylated) and 1 (completely methylated). The aim of the analysis of EWAS data from case-control studies is to detect differentially methylated positions (DMPs), namely, CpGs that show a significant change in methylation between cases and controls [9]. Among the existing methods for DMP detection, the t-test and rank-sum test are the most commonly used [10]. Several advanced methods, such as mixture models, logistic M values, and generalized exponential tilt model, have been proposed recently [11][12][13]. Liu et al. have shown that the clustering of correlated DNA methylation at CpGs is similar to that of linkage disequilibrium (LD) correlation in genetic SNP variation but for a much shorter distance-the correlation is reduced by half for CpGs within 500bp, and it is weak for CpGs within 2kb [14]. Some clustering of methylated CpGs appears to be genetically driven, thus, they call these sets of correlated CpGs "GeMes", for genetically controlled methylation clusters. Similar to LD blocks in GWASs, this type of correlated methylation structure can be a useful tool for guiding custom array design, efficient statistical approaches, and interpretation of EWASs [15].
Considering the correlation structure among CpG sites, the above methods for DMP detection, which are based on single-locus analysis, may be underpowered to detect associations. We hypothesized that an association test on a set of biologically related CpG sites may improve the power in EWAS analysis. This improvement may result from two characteristics: First, the number of tests is reduced if CpG sites are tested by set [16]. Second, a joint test can fully utilize information contained among the multiple loci.
In this study, we sought to identify joint testing methods that may offer improved power to detect associations and tested the association between disease outcome and CpG levels using several set-based methods: PCA, SPCA, KPCA, SKAT, and SIR (briefly described below). We then used simulated datasets to compare the performance of these five CpG set analysis approaches with Hotelling's T 2 test and t-test with Bonferroni correction. Additionally, we analyzed publicly available DNA methylation data from a rheumatoid arthritis (RA) dataset [17] for practical application of the methods.

Methods
Let i denote the ith individual. For a CpG set, we used G i1 ,G i2 ,. . .,G ip to denote DNA methylation proportions at the p CpG sites from the ith individual. When the outcome variables are dichotomous (e.g., y = 1/0 for case or control): where X i = (X i1 , X i2 ,. . ., X im ) denotes the covariates.

PCA
Principal component analysis (PCA) is a classical multivariate method for the analysis of nonindependent variables. When the p explanatory variables are correlated, it is possible to use a few (k<<p) top principal components (PCs) to replace the explanatory variables in the regression analysis [18][19][20][21]. In our analysis, we used the first k PCs instead of p CpGs to test the association with the disease outcome, in which k is the number of PCs that explain more than 80% percent of the total variation. A k-df likelihood ratio test can be used to test the significance of the CpG set.
SPCA SPCA (supervised principal component analysis) is a supervised dimension reduction approach [22][23][24]. The SPCA model is: Compared to traditional PCA, which uses all CpGs in a set to extract the PCs, only those CpGs with the strongest correlation with the outcome are used to perform SPCA, and PC 1 is the first principal component. After variable selection, the test statistic T ¼β 1 =s:e:ðβ 1 Þ is no longer approximated well by a t-distribution, so we used the distribution proposed by Chen et al. for the hypothesis testing [25].

KPCA
Kernel principal component analysis (KPCA) is a nonlinear extension of traditional PCA that has been studied intensively recently in the field of machine learning [26][27][28][29]. Given the observations, we first map the data nonlinearly into a higher-dimensional feature space F by where ϕ is a nonlinear function. Then, a kernel matrix K is formed using the inner products of new feature vectors. A standard PCA is performed on the centralized K, which is the estimate of the covariance matrix of the new feature vector in F 1 . Such a nonlinear PCA from the original data may be constructed to a linear PCA from the kernel matrix K.
Commonly used kernel functions include linear kernel, polynomial kernel, radial basis function (RBF) kernel, IBS kernel, and weighted IBS kernel [30]. In particular, KPCA with linear kernel is standard linear PCA. In this study, we chose the RBF kernel due to its flexibility in choosing the associated parameter. The parameter σ is set to 0.01 and the threshold is set to 80%.

SKAT
The sequence kernel association test (SKAT) is a supervised, flexible, computationally efficient regression method. It has been used to test for the association between a set of genetic variants and a continuous or dichotomous trait [31]. Considering the correlation among the CpG markers, we used SKAT to test the association between the trait and a set of CpGs.
To increase power, SKAT tests H 0 by assuming each β j follows an arbitrary distribution with a mean of zero and a variance of w j τ, where τ is a variance component and w j is a prespecified weight for variant j. H 0 : β = 0 is equivalent to testing H 0 : τ = 0, which can be conveniently tested with a variance-component score test in the corresponding mixed model. The variancecomponent score statistic is Q ¼ ðy ÀμÞ 0 Kðy ÀμÞ; where K = GWWG', G is an n×p matrix with the (i, j)-th element being the genotype of variant j of subject i, and W = diag(w 1 , w 2 ,. . .,w p ) contains the weights of the p variants. In this study, the matrix G is quantitative and denotes the methylation values. We set w j = 1; that is, all variants are weighted equally.

SIR
Sliced inverse regression (SIR) is a novel data-analytic tool for reducing the dimension of the input variable G [32]. Instead of regressing y against G directly, SIR regresses G against y (inverse regression) by fitting η(y) = E(G|y).
To perform a SIR analysis, we first standardized the explanatory variable G to where ∑ GG is the sample covariance matrix of G. Second, we sliced the range of the response variable y into H intervals, I 1 ,. . .,I h , and partitioned the whole dataset into several slices according to the y value. Let the proportion of the y i that falls in the slice h be The value of δ h (y i ) is 0 or 1 depending on whether y i falls into the hth slice or not. Third, we calculated the sample mean of Z within each slice, denoted A principal component analysis was then applied tom h , extracting the most important K-dimensional affine subspace for tracking the inverse regression curve E (G|y). Finally, we output SIR after retransforming these components back to the original scale.

Simulations
We performed simulations to evaluate the type I error and power of the five CpG set analysis approaches, in comparison to a t-test using Bonferroni correction and Hotelling's T 2 test. For the t-test, we extracted the minimum P-value as the whole P-value of the CpG sites (the P value of a CpG set). We generated the simulated datasets by using a disease model in which C is the number of causal CpGs. We used the program RandGen, a free program for generating random numbers, to generate the correlated CpGs [33]. Users can specify sample size, the number of variables, distributions, and correlations through the RandGen input file. If we specify the correlations between variables using the Pearson correlation parameter, then RandGen conducts a possibly time-consuming search to find the necessary copula correlation (RhoController) values to produce those desired correlations.

Simulations based on virtual datasets
Each simulated dataset contained 1,000 cases and 1,000 controls. For each individual, we first generated methylation values using RandGen. Correlation coefficients for any pairs of CpGs were set from 0.2 to 0.8 by 0.2 increments. Here, we assumed that the CpG set contained 10 CpGs. Two scenarios were simulated; they differed by whether or not the distributions of each CpG site were the same. The outcome for each individual was determined by the above disease model. We set C = 0 (no causal CpG site in the set) to evaluate type I error, which was defined as the proportion of "falsely" rejected H 0 in the 5,000 replications. To evaluate the power of the seven methods, we assumed C = 1 and C = 2. For each parameter setting, we generated 1,000 simulated datasets to calculate the power at the significance level of 0.05. Parameters of simulations are described in Table 1.

Simulations based on real DNA methylation datasets
We also simulated the CpG sets in a more realistic scenario by using a real DNA methylation dataset as the template. We used data from the Gene Expression Omnibus (GEO) generated from the Illumina HumanMethylation450 array data on whole blood (accession number GSE42861). This study examined methylation differences between RA patients (n = 354) and healthy controls (n = 335). We selected protein tyrosine phosphatase, receptor type, D (PTPRD) and mutL homolog 1 (MLH1) gene regions to generate the simulated methylation data. PTPRD is located on Chr 9 and MLH1 is located on Chr 3. The CpG sites we chose are located within 1Kb of PTPRD and MLH1 genes. Six CpGs on PTPRD (IlmnID: cg08719869, cg09371281, cg09781601, cg13723825, cg14080967, cg14458619) and nine on MLH1 (IlmnID: cg02103401, cg04726821, cg04841293, cg05670953, cg10990993, cg11291081, cg18320188, cg21109167, cg24607398) were considered. Respectively, the correlation coefficient matrices of the two CpG sets are   : We simulated 100 individuals with half as cases and half as controls. We set C = 0 to evaluate type I error and C = 1 to evaluate power, with 5,000 datasets produced to calculate the type I error rate, and 1,000 datasets generated for the evaluation of power. The first CpG site of each gene was defined as the causal CpG. Details of parameter settings are shown in Table 2.

Applications
The seven methods under comparison were applied to an RA DNA methylation dataset (described above). Original analysis of the dataset showed that the effect of genotype on RA risk appears to be mediated by DNA methylation changes in five genes: GSTA2, PBX2, C6orf10, HLA-DQB2, and GPSM3 [17]. We restricted our analysis to GSTA2 and PBX2, which are about 13.4 kb and 5.9 kb in length and include 6 CpGs and 51CpGs, respectively. The analyses were performed using R packages (version 3.0.2). The "superpc", "kernlab", "SKAT", "dr", and "Hotelling" packages were used to perform SPCA, KPCA, SKAT, SIR, and T 2 analyses, respectively.

Results of simulations based on virtual datasets
Type I error. Type I error rates of PCA, SPCA, KPCA, SKAT, SIR, T 2 , and t-test based on 10 CpGs are presented in Table 3 and Tables A-C in S1 File. Whether or not the CpGs were from the same distribution, the five set-based methods, as well as the T 2 test, could control the type I error at the 0.05 significance level, while the t-test became increasingly conservative as the correlation among CpGs increased.
Power. Estimated power from scenarios 1.2 and 2.3 is presented in Fig 1. The powers of PCA, SPCA, KPCA, and SKAT increased as the correlation among CpGs increased, but there were no apparent trends for SIR, T 2 , and t-test. When correlation was strong (r = 0.6, r = 0.8), SKAT and SPCA were more powerful than the t-test. However, if there was weak correlation, there were no apparent differences among the three methods. We also found that the power was lower for SIR and T 2 than for other methods, in general. Results from the simulations in scenarios 1.3 and 2.5 are presented in Fig 2. Power for all seven methods increased when the correlations became stronger. When the distributions of each CpG were the same, both SKAT and SPCA were more powerful than the other methods, independent of correlation strength. In contrast, when the distributions of each CpG were different, the powers of SKAT and SPCA were higher than t-test when the correlation was strong.
Considering that the average number of CpGs per gene on the HM450K array is~17, we simulated a CpG set with 20 CpGs. Figure A in S1 File shows that SKAT and SPCA remained more powerful than other methods when the correlation was strong. Table 4 presents the results based on the PTPRD gene. PCA, SPCA, KPCA, SKAT, SIR, and T 2 could control type I error at the significance level of 0.05, while the t-test was conservative. Power results are presented in Fig 3(a). For both β 1 = 4.0 and 5.0, SKAT and SPCA were more powerful than the other five methods. Among the seven methods, the power was lowest for the t-test. Results based on the MLH1 gene were similar to those from the PTPRD gene [Table 4 and Fig 3(b)].

Application to real data
We applied the seven methods to two CpG sets from an RA methylation dataset from the GEO data repository. The P-values for the CpG sets are presented in Table 5. For the first CpG set, the P-value of SPCA was 4.06E-05, the lowest of the seven methods. SKAT was second to SPCA with a P-value of 5.36E-04. P-values for PCA, t-test, KPCA, T 2 , and SIR were 1.11E-03, 1.88E-03, 2.42E-03, 3.74E-03 and 5.39E-03, respectively. For the PBX2 gene, the result also showed that SPCA had the best performance. The t-test was slightly superior to SKAT. All seven approaches yielded significant results at the significance level of 0.05 and were consistent with the original report of this dataset [17].

Discussion
The correlation structure of the DNA methylation data enables testing of the association between the disease outcome and a set of CpGs simultaneously. Here, we demonstrate that analyzing DNA methylation data using CpG set-based analysis for epigenome-wide association studies offers superior power over individual analysis. The set-based CpG association analysis has several advantages: first, the set-based methods can "borrow" information from the correlated CpG sites; second, the set-based methods decrease the number of multiple comparisons [34][35][36].
In this research, we compared the performance of five CpG set analysis approaches (PCA, SPCA, KPCA, SKAT, and SIR) with Hotelling's T 2 test and t-test using Bonferroni correction. We found that all of these set-based methods can control the type I error at the target significance level. The t-test with Bonferroni correction is conservative, especially when the correlations between CpG sites are strong, and thus can be less powerful to identify the association between the outcome variable and CpG set. When the CpG sites in the set have high correlation with each other, SPCA and SKAT can combine their information and provide better-simulated power among the seven approaches. We suggest that SPCA and SKAT can be used for CpG set analysis and screening the association across the entire epigenome. In the RA methylation dataset, we compared the methylation differences of two CpG sets (GSTA2 and PBX2) between patients and healthy controls. All these simulated studies and applications in real data analysis suggest that set-based methods may be used in DNA methylation data analysis.  SPCA is a supervised dimension reduction approach, which only includes the disease-relevant CpGs before the extraction of the principle components. Thus, this method performs better than PCA under different situations. As the KPCA does not rule out the irrelevant CpGs, the power of KPCA is inferior to SPCA but better than PCA in most occasions. We also find KPCA consumes more computational resources. SKAT uses variance component testing  framework to increase test power. If the CpGs in a region are highly correlated, the reduced degree of freedom improves the statistical power. The power of SIR is almost the lowest throughout the simulations. One possible reason is that the outcome variable is binary and can be divided into only two slices. Although, in theory, Hotelling's T 2 test has the ability to summarize the information from correlated CpGs, it is less powerful than SPCA and PCA in our simulations. This may result from the assumptions of multiple normal distribution being violated for DNA methylation data. Thus, SIR and Hotelling's T 2 test are not recommended for application in CpG set analysis for EWAS studies.
In summary, we propose to use set-based method for DNA methylation data analysis and compare the performance of CpG set-based analysis for DNA methylation data. We suggest using SPCA and SKAT to improve test power. However, there remain some limitations in our study. First, the virtual simulated datasets were generated based only on multivariate beta distribution. Other distributions such as inverse logit transformation of a multivariate normal distribution should be considered. Several recent studies have noticed that measured methylation may exhibit different levels of variability in different groups, possibly due to batch effects [7]. Therefore, some new tests that capture differences in both mean and variance of methylation levels, such as semiparametric tests [13], have been proposed. We will discuss the performance of these methods in the same situation later. Second, for the methods PCA, KPCA, and SIR, further studies should be performed to identify the effect of the number of PCs on the power. Third, more complicated situations, such as the interactions between CpG sets and the methylation-mediated genetic risks in the genome-wide scan, are not covered here but will be considered in future studies.
Supporting Information S1 File. Table A, Empirical Type I error rates at α = 0.05 level based on 10 CpGs from the same distribution (mean level = 0.2). Table B, Empirical Type I error rates at α = 0.05 level based on 10 CpGs from the same distribution (mean level = 0.4). Table C, Empirical Type I error rates at α = 0.05 level based on 10 CpGs from the same distribution (mean level = 0.8).