Methylation-level inferences and detection of differential methylation with MeDIP-seq data

DNA methylation is an essential epigenetic modification involved in regulating the expression of mammalian genomes. A variety of experimental approaches to generate genome-wide or whole-genome DNA methylation data have emerged in recent years. Methylated DNA immunoprecipitation followed by sequencing (MeDIP-seq) is one of the major tools used in whole-genome epigenetic studies. However, analyzing this data in terms of accuracy, sensitivity, and speed still remains an important challenge. Existing methods, such as BATMAN and MEDIPS, analyze MeDIP-seq data by dividing the whole genome into equal length windows and assume that each CpG of the same window has the same methylation level. More precise work is necessary to estimate the methylation level of each CpG site in the whole genome. In this paper, we propose a Statistical Inferences with MeDIP-seq Data (SIMD) to infer the methylation level for each CpG site. In addition, we analyze a real dataset for DNA methylation. The results show that our method displays improved precision in detecting differentially methylated CpG sites compared to the existing method. To meet the demands of the application, we have developed an R package called “SIMD”, which is freely available in https://github.com/FocusPaka/SIMD.


Introduction
Several studies have shown that methylation in DNA is highly related to diverse biological processes and that aberrant methylation results in severe effects, including different types of cancers [1,2]. Therefore, the research on epigenetic modifications throughout the human genome is meaningful. Analyzing DNA methylation profiles is now feasible due to the development of next-generation sequencing techniques, such as MethylC-seq, MeDIP-seq, MBD-seq, and MRE-seq. In particular, bisulfite genomic DNA sequencing is the gold standard to profile genome-wide DNA methylation [3]. Although there are several approaches (such as MethylCseq and whole-genome shotgun bisulfit sequencing (WGSBS)) that are reasonable for wholegenome analysis, it is rather expensive. MeDIP-seq can achieve nearly the same results as some PLOS  more expensive approaches at a lower cost [4]. We therefore propose a method based on MeDIP-seq data to analyze methylation levels. MeDIP [5] involves enrichment of the methylated DNA fractions immunoprecipitated by 5-methylcytosine-specific antibodies. Although MeDIP-seq cannot provide base pair-specific profiles, it reflects methylation levels by the number of immunoprecipitated DNA fragments. Compared to MeDIP-seq, which only enriches the methylated portion of the genome, we combine methylation-sensitive restriction enzyme sequencing (MRE-seq) to identify unmethylated CpGs. It utilizes methyl-sensitive restriction enzymes (MREs), such as HpaII (CĈ GG), Hin6I (GĈ GC), and AciI (CĈ GC) to specifically identify unmethylated CpGs [6]. The integrative method improves accuracy to identify intermediate methylation regions and enables wholegenome identification for epigenetic states [7].
Several computational tools have been developed for analyzing MeDIP-seq data. BATMAN [8] defines a coupling factor to measure the varying densities of methylated CpG sites and then implements a Bayesian deconvolution strategy to infer the methylation status at each CpG site. Mattia Pelizzola [9] believes that the relationship between the MeDIP enrichment estimates and the actual methylation levels are not linear and presented MEDME, which is based on experimental and analytical methods to evaluate the actual relationship and predicted methylation levels. The MEDIPS [10] approach is similar to the former two methods and produces similar results as BATMAN with higher computational efficiency, which significantly reduces running time for processing MeDIP-seq data. To obtain more information for methylome coverage at a lower cost, R. Alan Harris [7] has proposed a strategy to combine MeDIPseq and MRE-seq to calculate the methylation scores, which can be used to infer individual CpG methylation status. The M&M [11] algorithm is another method for analyzing integrative data, and is more accurate than other methods.
In this paper, we build a model of MeDIP-seq data based on [7] to estimate the methylation level for a single CpG site. We attempt to summarize the algorithm into a model, which will enable us to understand how to integrate the MRE-seq data. After MeDIP-seq and MRE-seq experiments, we map two kinds of short reads to the reference genome. It is known that MREseq short reads can be accurately mapped to the CpG site that contributes to it, but MeDIP-seq short reads cannot be. A short read of MeDIP-seq will cover one or more CpG sites (the short reads that cannot cover CpG sites will be discarded). Then, one short read in MeDIP-seq is pulled from only one CpG site or several neighboring CpG sites; however, we do not know which one. In order to identify the actual CpG sites that contribute to the short read and to obtain the actual number of short reads on each CpG site, it is necessary to build a model and provide statistical inferences for the MeDIP-seq data. After obtaining the actual number of short reads on each CpG site, we utilize them to detect differentially methylated CpG sites.
The remainder of the paper is organized as follows. In Section 2, we provide a brief description of the model and two possible cases and then propose theorems for those assumptions. In Section 3, we use an example to illustrate the SIMD method. In Section 4, we apply the proposed method to analyze a real dataset and compare it to the existing method. In Section 5, we end with a discussion and the conclusion.

Model for MeDIP-seq reads
In the MeDIP-seq experiment, genomic DNA is first isolated and sheared by sonication to short fragments of a few hundred basepairs. In this step, the DNA fragments contain both methylated fragments and unmethylated fragments. After immunoprecipitation with an antibody that can specifically bind the DNA methylation sites, the immunoprecipitated DNA fragments will almost only comprise methylated fragments and can be PCR amplified and sequenced. By aligning the MeDIP-seq short reads to the reference genome, the methylation levels of CpG sites in a region can be estimated based on the read counts in the region. This region-based method can provide insightful answers to numerous important biological questions, but it is of low resolution and cannot provide information about the methylation status of single CpG sites. Though difficult, inferring the methylation level of single CpG sites based on MeDIP-seq data is not impossible. For example, provided that the coverage of MeDIP-seq data is sufficiently large, the methylation level of an isolated CpG site (a CpG site that is far away from other CpG sites) can be easily derived. If there are MeDIP-seq reads covering the site, these reads will not be able to cover other CpG sites, which implies that this CpG site is methylated; otherwise, this site is not methylated. When two or more CpG sites are close to each other (the distance between two neighboring CpG sites is less than the read length), an MeDIP-seq read covering one CpG site will also cover its neighboring CpG site, which makes it very difficult to determine which CpG site is methylated. However, as breakage induced by sonication is random, with sufficient sequencing coverage we may still be able to distinguish between the methylated and unmethylated CpG sites. For example, suppose that only two CpG sites are close to each other and are far away from other CpG sites. If only one of the two CpG sites is methylated, the DNA fragments obtained from the sonication in this region (the region that contains the two CpG sites) can be classified into three categories: the fragments overlapping with both CpG sites, the fragments overlapping only with the methylated CpG sites, or the fragments overlapping only with the unmethylated CpG sites. Because the fragments in the first two (the last) categories contain a methylated (unmethylated) CpG site, they can (cannot) be immunoprecipitated and sequenced. Therefore, we would obtain significantly more MeDIP-seq reads covering the methylated CpG site. Similarly, if both CpG sites are methylated, the number of reads covering both CpG sites should be roughly the same. Based on this observation, we can develop a statistical model to estimate the methylation level of single CpG sites.
Considering that a region C consists of G CpG sites, it is supposed that R is a random MeDIP-seq read sequenced from the region overlapping with some of the G CpG sites. From the MeDIP-seq experimental flow, we know that this read is sequenced because it contains at least one methylated CpG site and therefore allows an antibody to bind to its methylated CpG sites, which in turn makes it immunoprecipitated and sequenced. Let X Rj = 1 (j = 1, Á Á Á, G) if the jth CpG site contributes to the sequencing of the short read R, or in other words, if the short read R contains the jth CpG site and an antibody binds to the jth CpG site, thereby allowing the immunoprecipitation and sequencing of short read R. Otherwise, we denote X Rj = 0. Note that because R is a random read, X Rj is a random variable taking values of {0, 1}. Assume that we have n short reads overlapping with the region C. Let X ij (i = 1, Á Á Á, n and j = 1, Á Á Á, G) be the random variable as introduced above for the ith read. We denote We make the following assumptions about the random variables X ij . Assumption 1. Assume that X i1 , X i2 , Á Á ÁX ij , Á Á Á X iG are independent and follow two-point distribution, that is, where j = 1, 2, Á Á ÁG, i = 1, 2, Á Á Án, and q j is the probability of the jth CpG site contributing to the sequencing of a short read. Note that this probability is composed of two parts: one is the probability that a short read contains the jth CpG site and the other is the probability that an antibody actually binds to this CpG site and thus allows immunoprecipitation and sequencing the short read. This assumption essentially tells us that the random vectors X i are independently and identically distributed (i.i.d.) and that their components are independent. The i.i.d. assumption of the random vectors X i is reasonable because the short reads can be safely viewed as independently sampled with the same sampling procedure. The independence assumption of the components of X i is relatively strong because if a read contains two methylated CpG sites, the antibody binding to one CpG site may influence the binding to the other CpG site.
Assumption 2. Assume one short read in MeDIP-seq is pulled from only one CpG site (the case is the same as Ting's algorithm), that is, S G j¼1 X ij ¼ 1. Assumption 3. Assume one short read in MeDIP-seq is pulled from not less than one CpG site (each observed short read must be associated with not less than one CpG site on the genome), that is, S G j¼1 X ij ! 1. Under the above assumptions, there are some theoretical results.
The goal of the model is to compute the number of actual short reads in each CpG site, and to infer the methylation level. The short reads are indeed impacted by the CpG site. In real data, a short read covers some continuous CpG sites of all G CpG sites. Then, the ith short read is The CpG sites of k i , Á Á Á, l i are covered by the ith short read. At least one of x ik i ; Á Á Á ; x il i are equal to 1, but we do not know which one. Therefore, we assume that the x ik i ; Á Á Á ; x il i are missed. Then, we compute the actual short read number of each CpG site using the EM algorithm. The EM steps are presented in Appendix C.

An example of the model
We use a simple example to explain the model. We assume that there are five CpG sites in a region and six short reads are mapped on the region (Fig 1).
Let X ij comes from a two-point distribution. That is, where j = 1, 2, Á Á Á, 5 and i = 1, 2, Á Á Á, 6. Therefore, the combination distribution of all CpG sites is: In the model, there will be two cases.

Only a single CpG contributes to a short read
If one short read covers several CpG sites, it actually only comes from one of them, even though we do not know which one it is. That is, given S 5 j¼1 X ij ¼ 1, the joint distribution of X i1 , X i2 , X i3 , X i4 , and X i5 is a multinomial distribution with probability P = (p 1 , p 2 , p 3 , p 4 , p 5 ), where p j ¼ l j =S 5 j¼1 l j . A short read is an observation that is , where x ij = 0 or 1. A note is that only one element of X i is 1 and the others are 0. The x ij will be 0 when the jth CpG is not covered by the ith short reads. That is, Then, the profile log likelihood is: x ij logðp j Þ: We know the observation from Fig 1 is: In the second short read, we know that one of x 21 and x 22 is 1, but we do not know which one. Therefore, we consider x 21 and x 22 as latent variables and estimate P using the EM algorithm, which is given below:

E-step:
Given the current estimation P − for P, the conditional expectation of the log complete data likelihood is: Given this observation, E-step [12] consists of computing the following quantities:

M-Step:
During the M-step, the goal is to maximize Q(P j P − ) with respect to P, which requires solving @Q(P j P − )/@P = 0 subject to P 5 j¼1 p j ¼ 1. That is, Then, we solve the following equation system to obtain updated parameter estimates: Therefore, the update formula of P changed, as follows: ; ; ; The iteration process is the same as Ting's algorithm when we give equal value to the starting value for p ð0Þ i . In fact, the starting value does not affect the convergence value. Then, [6P] is the number of short reads at each CpG site.

At least a CpG contributes to a short read
If one short read covers several CpG sites, it actually comes from at least one of them, even though we do not know which CpG sites they are. Then, under the condition S 5 j¼1 X ij ! 1, the distribution of X i = (X i1 , X i2 , X i3 , X i4 , X i5 ) is: where q j is defined in formula (1). We know the observation is expressed in the matrix as (2). In the second read, we know that some of x are indeterminate. Therefore, we consider the missed values of x as latent variables and estimate q = (q 1 , Á Á Áq 5 ) using the EM algorithm.

E-step:
Given the current estimation q − for q, the conditional expectation of the log complete data likelihood is given as: ij logðq j Þ À 6logð1 À P 5 j¼1 ð1 À q j ÞÞ; wherex ij is replaced by the condition expectation.

M-Step:
During the M-step, the goal is to maximize Q(q j q − ) with respect to q, which requires solving @Q(q j q − )/@q = 0. That is, Then, we solve the following equation system to obtain updated parameter estimates: Therefore, the update formula of q is changed, as follows: The starting value does not affect the convergence value. Then, ½6q=ð1 À P 5 j¼1 ð1 À q j ÞÞ is the number of short reads at each CpG site.

Real data analysis
To evaluate the performance of the proposed method, we compare it with the existing method (Raw) that directly uses the observation fragments. The data comes from paper [11], which includes 19 human samples. In this paper, we only consider two samples, embryonic stem cells (the ES cell line H1) and human fetal neural stem cells (NSCs) culture (HuFNSC02, neurosphere cultured cells, ganglionic eminence derived, fetal age of 21 weeks). Then, we obtained the MeDIP-seq and MRE-seq data for each sample. Similar to the analysis procedure for the M&M method, we test the performance of SIMD and the existing method by pair-wise comparisons between two H1-ESC replicates and between H1-ESCs and fetal NSCs. The difference between the two tests is that we detect differentially methylated CpG sites in this paper; however, the test of the M&M method is to determine differentially methylated regions (DMRs).
We consider the false positive rates for two methods. We apply SIMD and the existing method to the two H1-ESC replicates and use the hypothesis test to obtain the P-values for each CpG site. Because the H1-ESC samples are biological replicates, we consider the different methylated CpG sites as the false discovery sites at any P-value cutoff. The results are represented in Table 1. It is evident from the table that at the same P-value cutoff, SIMD usually reports fewer differentially methylated CpG sites than the exisiting method; for example, when the P-value cutoff equals 10 −5 , the number of differentially methylated CpG sites for the existing method is seven times more than for SIMD. There are approximately 1751273 CpG sites in chromosome 1 of the human reference sequence (excluding blacklist regions). We can then calculate the false positive rates for two methods at any P-value cutoff. Obviously, from Table 1, we can see that false positive rates of SIMD are significantly less than those of the existing method.
Next, we consider the false discovery rates(FDRs) for two methods. We compare two different cell types, H1-ESC and fetal NSCs, and use the same P-value cutoffs as the first test. We obtain the number of differentially methylated CpG sites for two methods. Combining the results of the two H1-ESC replicates for analysis, we obtain the false discovery rates at any Pvalue cutoff. From Table 2, we can see that the number and FDRs of SIMD are no better than the existing method when the cutoffs are larger than 10 −6 . However, when cutoffs are smaller than 10 −6 , the FDRs of SIMD are obviously less than those of the existing method. Next, we further consider the q-value cutoffs in Table 3, similar to the P-value cutoff, and find that the number of differentially methylated CpG sites of SIMD is far less than in the method at each q-value level (approximately 1/20 of the existing method). However, the FDRs of the existing method are larger than the overall SIMD.

Discussion
Identifying differentially methylated CpG sites across a whole genome is an effective way to study epigenetic modification. In dealing with the data integrated by MeDIP-seq and MREseq, estimating the methylation level is the first choice. In this paper, we proposed a SIMD method that considers the possible structure whereby immunoprecipitated short reads are mapped to the methylated CpG sites. We then proposed two cases based on it, one in which only a single CpG site contributes to a short read and another in which more than one CpG site contributes to a short read. By applying the SIMD method, we can obtain the real number of short reads in each CpG site. Last, we employ the hypothesis test framework to detect the differentially methylated CpG sites.
In real data analysis, we compare the proposed SIMD method with the existing method (Raw). The results demonstrate that although the number of differentially methylated CpG sites detected by the SIMD method is less than those detected by the existing method, the FDRs of the SIMD are much smaller than those of the existing method. The conclusion is that the proposed method performs better than the existing method. There are still some problems, such as the assumption of independence between the short reads. When the independence condition cannot be satisfied, the proposed method may work not very well. Therefore, in our future work, we will take the correlation between the short reads that are mapped to the neighboring CpG sites into account.

Appendix A: Proof of Theorem 1
Under Assumption 1, the joint distribution of (X i1 , X i2 , Á Á Á, X ij , Á Á Á, X iG ) is then, This is the end of the proof.

Appendix C: EM algorithm for Theorems 1 and 2
We know the observation is ; where x ik i ; Á Á Á ; x il i are missed data. However, we know some of x ik i ; Á Á Á ; x il i are 1 and others are 0. Therefore, the observation of each read is x ðobsÞ ¼ ðx ðobsÞ (1) Only a single CpG contributes to a short read: If one short read covers several CpG sites, it actually only comes from one of them, even though we do not know which one it is. That is, we have known that S G j¼1 X ij ¼ 1, the joint distribution of X i1 , Á Á Á, and X iG is a multinomial distribution with probability P = (p 1 , Á Á Á, p G ), where p j ¼ l j =S G j¼1 l j . A short read is an observation that is X i = (x i1 , x i2 , Á Á Á, x iG ), where x ij = 0 or 1. A note is that only one element of X i is 1 and the others are 0. The x ij will be 0 when the jth CpG is not covered by the ith short reads. That is, Then, the profile log likelihood is: x ij logðp j Þ: The EM algorithm is E-step: Given the current estimate P − for P, the conditional expectation of the log complete data likelihood is given as: QðP j P À Þ ¼ EðlðP j x ðobsÞÞ Þ j P À Þ ¼ X G j¼1 X n i¼1x ij logðp j Þ: Given this, the E-step [13] consists of computing the following quantities:

M-Step:
During the M-step, the goal is to maximize Q(P j P − ) with respect to P, which requires solving @Q(P j P − )/@p = 0 subject to P G j¼1 p j ¼ 1. That is, Q Ã ¼ QðP j P À Þ À lð X G j¼1 p j À 1Þ: Then, we solve the following equation system to obtain updated parameter estimates: @Q Ã @P j ¼ 0: Thus, given below is the updated formula of P: ij ðP À Þ=n: The iteration process is the same as Ting's algorithm when we give equal value to the starting value for p ð0Þ i . In fact, the starting value does not affect the convergence value. Then, [nP] is the number of short reads at each CpG site.
(2) The case that at least a CpG contributes to a short read: The process of the proof is the same as (1).