A Reference Methylome Database and Analysis Pipeline to Facilitate Integrative and Comparative Epigenomics

DNA methylation is implicated in a surprising diversity of regulatory, evolutionary processes and diseases in eukaryotes. The introduction of whole-genome bisulfite sequencing has enabled the study of DNA methylation at a single-base resolution, revealing many new aspects of DNA methylation and highlighting the usefulness of methylome data in understanding a variety of genomic phenomena. As the number of publicly available whole-genome bisulfite sequencing studies reaches into the hundreds, reliable and convenient tools for comparing and analyzing methylomes become increasingly important. We present MethPipe, a pipeline for both low and high-level methylome analysis, and MethBase, an accompanying database of annotated methylomes from the public domain. Together these resources enable researchers to extract interesting features from methylomes and compare them with those identified in public methylomes in our database.


S1. Identifying hyper-methylated regions
In plant methylomes and other methylomes showing "mosaic methylation" pattern, localized regions of increased methylation are of interest. The input data are a sequence of read counts at each CpG site ordered by their genomic locations, X = {(m i , u i )}, where m i and u i denote the number of methylated and unmethylated reads at the i th CpG site. Through comparative analysis of multiple Arabidopsis methylomes, we observed that Arabidopsis HyperMRs, especially those located in intragenic regions, have specific locations that are consistently unmethylated across different ecoytpes and cell types (unpublished studies). We therefore model the methylation status of cytosines in the Arabidopsis with three states: the hypo-methylated state in the background, the HYPER-methylated state in HyperMRs and the HYPOmethylated state scattered within HyperMRs. We denote the hidden states with Q = {q i }, where q i is the hidden state of the i th observation. The locations where the methylation states change is denoted with C = {c 0 , . . . , c j , . . . , c m }, where c 0 points to the first observation, and c m the last observation. We use b qi (X i ) to denote the likelihood of the i th observation in hidden state q i (modeled with the beta-binomial distribution), and d q (l) to denote the likelihood of the length l of a sequence in the hidden state q. The complete likelihood of the input data assuming all model parameters known is The question is to estimate the model parameters Θ and obtain the posterior estimates of missing data Q and C from the observed data X . We train the model with a method similar to the Baum-Welch algorithm [1]. The forward score α i (q), the likelihood of the observations X 1:i with the i th observation being the end of a segment in hidden state q, is The backward score β i (q), the likelihood of the observation Y i+1:N given the i th observation being the end of a segment in hidden state q, is We compute the forward scores and backward scores recursively. The posterior probability of the i th observation being in state q is .
With the posterior probabilities above, we update the estimates of parameters of the emission distributions as described in [2]. To estimate the parameters of the duration distributions, we perform a posterior segmentation. A cytosine site is assigned to the state of the largest posterior probability. Adjacent cytosines with the same state are joined into a segment. The lengths of segments are computed in terms of the number of cytosine sites, which are used to estimate the parameters of the duration distributions. We iterate the above steps until the change of the complete likelihood (Eq. 1) is below a given threshold.
After model training, we divide the genome into HyperMRs and non-HyperMR with either posterior decoding or the Viterbi algorithm [3]. Because of the explicit duration distribution, the time complexity is quadratic in the term of the number of CpG sites. In actual implementation, we impose a cutoff on the maximum length of each region.

S2. Identifying differentially methylated regions
MethPipe provides two programs for identifying differentially methylated regions: the HMR-centric dmr algorithm, and the dmr2 program implementing a three-state VDHMM model to infer the boundary of DMRs. The latter method is independent of pre-defined HMRs, and sensitive to DMRs due to partial methylation variation and changes in boundary properties. The method builds up on the differential methylation scores (diff-score) for individual cytosine sites [4]. Given two samples, we compute the diffscore at each cytosine site, and the sequence of diff-scores (denoted with Y = {y i }) is used as the input data for the following VDHMM model. Since the diff-score is a probability, ranging from 0 to 1, the beta distribution is used to model these observations. With two methylomes, there are three possible situations at a cytosine site: (i) no differential methylation, (ii) methylome 1 has lower methylation, and (iii) methylome 2 has lower methylation. We therefore introduce three states, namely same (s), gain (g) and loss ( ) . For convenience, we use the language of gain and loss of methylation between samples, but remark that in many cases there will is no time dependence. Since a region gaining methylation is unlikely adjacent to a region losing methylation, we forbid the transition between g and l states. The transition matrix for this model is The procedures for model training and segmentation are similar to the three-state VDHMM for identifying HyperMRs, except the emission distribution of the diff-scores is the beta distribution. From the threestate VDHMM model, we obtain a set of DMRs. To assess the statistical significance of these DMRs, we assign empirical p-values to them based on a randomization method. For a DMR covering the s th CpG to the t th CpG, we compute the area of methylation difference between those two samples as following: where f i1 and f i2 represent the methylation frequencies of the i th CpG in sample 1 and sample 2 respectively. Next we randomly sample r genomic regions with the same number of cytosines, and compute the area of methylation difference for these r regions (often in practice r = 50, 000). We assign p-values to DMRs based on the empirical distribution from the r random regions, and only DMRs with p-value below a certain threshold are reported

S3. Effect of coverage
We performed down-sampling analysis to evaluate the effect of coverage on the identification of HMRs. We selected a deep-sequencing sample from [5], and obtained a series of down-sampled datasets with coverage ranging between 1 and 40. The hmr program is used to find HMRs in each down-sampled dataset respectively with the default parameters. The set of HMRs from the dataset with the highest coverage (∼40x) are considered as "golden-standard". We used the Jaccard's index to measure the similarity of the set of HRMs in down-sampled data to the "golden-standard". Specifically, the Jaccard's index is defined as the proportion of common CpG sites covered by both sets of HMRs among those CpG sites covered by any HMRs: where A denotes the set of CpG sites covered by the "golden-standard" and B the set of CpGs sites covered by the other set of HMRs. As shown in Figure S1, the Jaccard's index is 0.95 at ∼20x, and 0.90 at ∼10x. The performance drops sharply for coverage below ∼5x, but we are still able to achieve the Jaccard's index at 0.72 when the coverage is as low as ∼1x. Our methods for identifying PMDs and HyperMRs are similar, and the above result also holds. The required coverage therefore is even lower since the PMD-finding method internally works by accumulating CpGs in fixed-length bins. The performance of AMR-finding algorithm depends on both coverage and read length, which was addressed in [6]. Table S1 shows the correlation between depth of coverage and CpG densities in human WGBS samples.