A Bayesian Framework to Identify Methylcytosines from High-Throughput Bisulfite Sequencing Data

High-throughput bisulfite sequencing technologies have provided a comprehensive and well-fitted way to investigate DNA methylation at single-base resolution. However, there are substantial bioinformatic challenges to distinguish precisely methylcytosines from unconverted cytosines based on bisulfite sequencing data. The challenges arise, at least in part, from cell heterozygosis caused by multicellular sequencing and the still limited number of statistical methods that are available for methylcytosine calling based on bisulfite sequencing data. Here, we present an algorithm, termed Bycom, a new Bayesian model that can perform methylcytosine calling with high accuracy. Bycom considers cell heterozygosis along with sequencing errors and bisulfite conversion efficiency to improve calling accuracy. Bycom performance was compared with the performance of Lister, the method most widely used to identify methylcytosines from bisulfite sequencing data. The results showed that the performance of Bycom was better than that of Lister for data with high methylation levels. Bycom also showed higher sensitivity and specificity for low methylation level samples (<1%) than Lister. A validation experiment based on reduced representation bisulfite sequencing data suggested that Bycom had a false positive rate of about 4% while maintaining an accuracy of close to 94%. This study demonstrated that Bycom had a low false calling rate at any methylation level and accurate methylcytosine calling at high methylation levels. Bycom will contribute significantly to studies aimed at recalibrating the methylation level of genomic regions based on the presence of methylcytosines.


Introduction
DNA methylation is an important epigenetic modification involved in the regulation of gene expression and plays critical roles in cellular processes [1][2][3][4][5]. Abnormalities in DNA methylation contribute to the dysregulation of gene expression and have been reported to be associated with tumorigenesis [6] and imprinting disorders [7]. DNA methylation occurs on the cytosine residues in DNA and the accurate identification of methylated cytosines (methylcytosines) is essential for studying variance in methylation [8]. Advances in high-throughput bisulfite sequencing (BS-seq) [9][10][11] such as whole-genome bisulfite sequencing and reduced representation bisulfite sequencing (RRBS), provide comprehensive and well-fitted ways to identify methylcytosines at single-base resolution. However, the large data sets generated by BS-seq pose data processing challenges for methylcytosine calling.
Typically, the first step of methylation analysis with BS-seq data is to map the bisulfite-converted reads to a reference genome using software such as SOAP and BSMAP [12][13][14]. Methylcytosines can then be identified from the reads aligned to the cytosines on the reference genome. However, besides sequencing errors, methylcytosine calling is affected by incomplete bisulfite conversion, which corresponds to the ratio of unmethylated cytosines that were not converted to thymines by the bisulfite treatment. Additionally, cell heterozygosis caused by multicellular sequencing can also influence the precision of methylcytosine detection because the methylation status of the same cytosine site in different cell is probably inconsistent owing to the coexistence of methylation and demethylation [15][16][17][18].
As a consequence of the factors mentioned above, the falsepositive rate for methylcytosines detected by simple filtering using a predefine threshold that the methylation level should be above zero, has been reported to be extremely high [19][20][21]. In recent years, the methylation ascertainment method applied by Lister et al. [9], which is widely used in the methylation analysis software such as Bismark [22] and Bisulfighter [23], has been employed to determine the methylcytosines from BS-seq data [5,24] and has been shown to have a much lower false positive rate than the simple filtering method [25,26]. We termed the method as ''Lister'' hereafter. Although Lister uses binomial distribution to overcome the impacts of false-positive rate (sum of non-conversion rate and sequencing errors) [24], it does not consider the impact of methylation heterozygosis caused by cell heterozygosis. Besides, as amounts of cytosines, whose methylation level is low, are difficult to be distinguished from unmethylated sites, the methylation status determined by the binomial test is not reliable for samples with extremely low genome-wide methylation levels [5].
Here, we present a novel algorithm, Bycom that can take into account sequencing errors, non-conversion rate, and cell heterozygosis which is initially introduced as the factors to identify precisely the methylcytosines from BS-seq data. Bycom is based on the Bayesian inference model, which is not limited to certain data distribution and has been used successfully for single-nucleotide polymorphism (SNP) calling in next-generation sequencing data [27]. Bayesian inference model has also been successfully applied to methylated DNA immunoprecipitation (MeDIP)-based studies [28]. We evaluated the performance of Bycom on simulated BS-seq data, publicly available BS-Seq data, and RRBS data and demonstrated that Bycom can identify methylcytosines more precisely than Lister. This methodology of Bycom has been packaged and available at https://sourceforge.net/projects/bycom, which is written in PERL and designed for Linux 64 platform.

Results
The Bycom procedures used to identify methylcytosines from high-throughput BS-seq data are depicted in Figure 1. First, the raw reads generated by BS-seq were filtered using a Q20 cut-off to exclude low quality reads whose average ASCII quality values were less than the char ''T'' ( Figure 1A). Second, the filtered reads were aligned to a reference genome sequence using mapping software BSMAP ( Figure 1B). In this step, cytosine sites with quality values less than Q20, which indicated sequencing errors, were further excluded. Third, after filtering, the base distribution in a 1000-bp window was used to calculate the average methylation level in the region covered by the window [24,28,29]. Based on the calculated average methylation level, a least square method was used to evaluate the methylation heterozygosity ( Figure 1C). Using the base distribution and methylation heterozygosity information, the posterior probability of the nucleotide at the reference cytosine was calculated using a Bayesian inference formula ( Figure 1D). The cytosines that have the highest posterior probabilities (methylated probability) were selected as candidate methylcytosines ( Figure 1E). Finally, methylcytosines were determined from the candidate methylcytosines using an iterated threshold method ( Figure 1F). Meanwhile, Lister utilizes the mapping results of BSMAP to conduct the binary classification of mCs [9].

Performance of Bycom on simulated data
The performance of Bycom was evaluated and compared with Lister based on simulated data that were generated from chromosome 10 of the human reference genome sequence (hg18, UCSC) using the bisulfite sequencing simulator for next-generation sequencing BSSim (http://122.228.158.106/BSSim). Six factors that may affect the identification of methylcytosines were considered in the simulation process, namely, total methylation level of the sequenced sample, sequencing depth, bisulfite conversion rate, high quality value ratio, mismatch number, and SNP frequency. With all parameters set to their default values, we found that Bycom had a higher overall accuracy than Lister as shown by the receiver operator characteristic (ROC) curves ( Figure 2A).
Next, we investigated the impact of each single factor on the performance of Bycom. For the total methylation level, we assumed that the number of methylcytosines in the simulated data increased steadily as the total methylation level was increased. When the methylation levels were extremely low (0.2%, 0.4%, 0.6%, and 0.8%), the number of methylcytosines called rose slightly and the false negative rate remained stable at 2% for Bycom and at 15% for Lister. At low methylation levels (2%, 4%, 6%, and 8%), the number of methylcytosines called by the two methods increased and the false negative rate for Bycom decreased slightly to 0.3% compared with 14% for Lister. At the high methylation levels (20%, 40%, 60%, and 80%), the number of true positive sites increased, while the false negative rates for both Bycom and Lister remained very low ( Figure 2B).
For the impact of sequencing depth, the total number of methylcytosines called by Bycom increased when the read depth incrementally changed from 56 to 256. The highest number of methylcytosines called was when the depth was 256; thereafter, the number of calls remained relatively constant for depths .256 ( Figure 2C). In contrast, the total number of methylcytosines called by Lister increased when the read depth went up from 56to 556 after which the number of calls steadily decreased. The false negative rates for both Bycom and Lister reduced when the higher read depths were introduced and, in all cases, the false positive rates for Lister were higher than the false positive rates for Bycom.
The bisulfite conversion rate can affect the methylation level of the cytosines in a genome sequence; therefore, we evaluated the performance of Bycom using different values for this parameter. The number of methylcytosines called by Bycom and Lister remained constant when the non-conversion rate was ,1%. When the non-conversion rate was .1%, the total number of methylcytosines called by Bycom increased slightly and the false negative decrease steadily; conversely, the total number of methylcytosines called by Lister increased sharply ( Figure 2D). The other three factors, high quality value ratio (Q20 cut-off), mismatch number, and SNP frequency made only small contributions to either the number of methylcytosines called or the false positive rate, both of which remained nearly constant. Here, for the method Lister, the influence of sequencing errors calculated by quality value were reduced as the quality cut-off went up, while for Bycom, the influence was still kept at a relatively low level ( Figure  S1B & S1D). Overall, compared with the performance of Lister,

Author Summary
High-throughput bisulfite sequencing (BS-seq) has advanced tremendously the study of DNA methylation and the determination of methylcytosines at single-base resolution. In BS-seq data analysis, sequencing errors, incomplete bisulfite conversion, and cell heterozygosis affect the accuracy of methylcytosine detection in quite a major way. Simple filtering methods using predefined thresholds have proved to have extremely low efficiency. The commonly used Lister uses binomial distribution to overcome the impacts of non-conversion rate and sequencing errors, but the impact of the cell heterozygosis is not considered. Here, we present Bycom, a novel algorithm based on the Bayesian inference model. To improve the accuracy of methylcytosine calling, Bycom considers sequencing errors, non-conversion rate, and cell heterozygosis integratively to identify methylcytosines from BS-seq data. We evaluated the performance of Bycom using different kinds of BS-seq data. Our results demonstrated that Bycom identified methylcytosines more accurately than Lister, especially in BS-seq data with extremely low genome-wide methylation levels.
Bycom generated more true positive sites and fewer false positive sites ( Figure S1).

Performance of Bycom on publicly available BS-seq data with known methylcytosines
To further assess the ability of Bycom to identify methylcytosines, whole-genome BS-seq data of YH [24], silkworm [19] and the ascomycetes Aspergillus flavus [5] were used as described in [24]. The YH BS-seq data, generated from the peripheral blood mononuclear cells of an Asian individual, had a high methylation level of 68.4% at CpG sites and ,0.2% at non-CpGs sites with an overall coverage of 12.3-fold per strand. 32 methylated CpGs and 18 non-methylated CGs were selected randomly and validated by bisulfite Sanger sequencing. Because the depth of 11 validated cytosines was lower than the threshold (lowest depth was 4-fold) [24], only 24 methylated CpGs and 15 non-methylated CGs gave eligible validated results. The methylcytosine calling results showed that the accuracy for Bycom was 71.79% compared with the 66.67% accuracy for Lister, and both methods made no false calls ( Figure 3A).
We also evaluated the performance of Bycom on a species that had a low methylation level. The BS-seq data of silkworm had overall methylation levels of 0.67% at CG, 0.21% at CHG, and 0.24% at CHH sites (where H is A, C, or T) with an average depth of 7.4-fold per strand. In the BS-seq silkworm data, 598 methylcytosines and 311 cytosines were validated ($46) among which 24 methylated and 101 non-methylated sites were verified by bisulfite Sanger sequencing while the others were confirmed using bisulfite-PCR followed by 454 sequencing. The distribution of the validated sites called by Bycom and Lister is shown in Figure 3A. For bisulfite Sanger sequencing, Bycom had 93.6% accuracy with a false positive rate of 0.99%, while Lister exhibited 83.2% accuracy with a false positive rate of 15.84%. For 454 sequencing, the Bycom and Lister accuracies were the same at 82.53% with false positive rates of 10.48% for Bycom and 13.81% for Lister ( Figure 3B).
Finally, we evaluated the performance of Bycom on the BS-seq data of Aspergillus flavus, which had been reported to have a negligible level of methylation with 4-fold overall depth per strand. All the methylcytosines called by Lister on this data were regarded as false positives ($46). An overlapping strategy was applied between the cytosines detected by Bycom and by Lister. The results indicated that the false positive rate for Lister was at least three times the false positive rate for Bycom ( Figure 3C). Further, sites that were selected randomly from the methylcytosines called by Lister and verified by Sanger sequencing as non-methylated cytosines were not called by Bycom.
Meanwhile, we splited the validated sites of YH, silkworm and Aspergillus flavus into CpG and CpH (CpA, CpC, and CpT dinucleotides) context. The strategy was also performed on the validated results of Bycom and Lister in these samples. We found that true positive sites (mC) were all in CpG context and the sites in CpH context were all false positive. The results also indicated that Bycom was more precise on the detection of mCpH than Lister (Table S1).

Performance of Bycom on RRBS data from human ovarian epithelial T29 cells
We evaluated the performance of Bycom on RRBS data from T29 cells (GSE55568). A total of 2,110,089 methylcytosines were identified in the genome, making up nearly 0.18% of the total cytosines; 98.58% of the identified methylcytosines were in CG sites with a mean sequencing depth of 15.5-fold per strand (Table S2).
To verify the precision of Bycom, seven batches of sequences from the T29 RRBS data were selected randomly for validation by bisulfite-PCR followed by sequencing on the Illumina MiSeq platform. As a result, 68 methylcytosines and 158 cytosines were validated. Bycom identified a high percentage (75%) of the CpG methylcytosines and a low percentage (5.56%) of the CpH methylcytosines. Lister identified a relatively low percentage (71.88%) of mCpG and the equal proportion of mCpH as Bycom. Besides, a similar low proportion of the validation sites were identified as cytosines by both Bycom and Lister (Table S3). Bycom and Lister had an accuracy of 93.81% and 93.36%, respectively, with the same false positive rate of 4.12%. Notably, methylcytosines in the CpH context, which were nearly complete absence in the T29 cells [20], were regarded as background noise (Table S3).

Comparison of Bycom to Bismark and Bisulfighter
We compared the accuracy of mC calling between Bycom and methylation analysis tools, including Bismark and Bisulfighter, which employed Lister as the algorithm to detect methylcytosines. With all parameters set to default in simulator BSSim, we found that the software based on Bycom run fastest (Table S4). Besides, Bycom had a highest overall accuracy than others as shown by the ROC curves ( Figure 4A). Then, we compared the software based on Bycom with Bismark and Bisulfighter using the previous public data. Firstly, Bycom had a highest accuracy than other two in YH Sangersequencing data without false positive results. Secondly, in silkworm data, for 454-sequencing validation, the accuracy of Bycom was lower than that of Bismark and higher than that of Bisulfighter. For Sangersequencing validation, Bycom kept highest precision on the detection of methylcytosines than others ( Figure 4B). Finally, the validation sites of Aspergillus flavus Sanger sequencing data all proven to be false positive were not called by Bycom and Bismark, and 16.67% of them were selected by Bisulfighter as methylcytosines.

Discussion
Detecting methylcytosines accurately from high-throughput sequencing data is an essential step to calculate the methylation level of genomes at single-base resolution. In this study, we presented a novel computational strategy, Bycom, for identifying precisely methylcytosines from BS-seq data using the Bayes inference model. Bycom not only considers the impacts of sequencing errors and non-conversion rate, both of which are treated as the false-positive rate in the Lister method, but also introduces cell heterozygosis to identify methylcytosines in an unbiased manner.
The results on the simulated data showed that the nonconversion rate and total methylation level significantly affected the accuracy of the Bycom and Lister methods, while sequencing errors had limited influence after the relatively stringent quality filtration of the data. Additionally, the accuracy of methylcytosine calling was better with Bycom than with Lister. The results on the public data indicated that the false positive rate of Lister was higher than the false positive rate of Bycom in the genome-wide samples with low methylation levels. Although The validation sites of public data in CpH context were all detected as false positive sites, Bycom had a better performance on the classification of mCpH than Lister. Furthermore, the validation results on the T29 RRBS data showed that Bycom detected the vast majority of methylcytosines and maintained the false calling rate at a low level. The performance of Bycom on the public data with different genome-wide methylation levels revealed that Bycom had no bias on data with high or low methylation levels. Moreover, to further verify the accuracy of our method, we compared the accuracy of Bycom with the methylation analysis tools, Bismark and Bisulfighter, which could also be used to call methylcytosines.
The results of simulated data and public data both showed that Bycom behaved better than Bismark and Bisulfighter.
However, we noticed that variances of the methylation level on single cytosines between the BS-seq data and the Sanger/454 sequencing data likely led to false/missing calls by both Bycom and Lister. Although the methylcytosines were still detected by Bycom, this finding suggested that the accuracy might improve further if the methylation level of the base was recalibrated depending on mass validated sequencing data. In future work, we aim to refine the model with well-recalibrated sequencing data. Besides, other functions, such as differentially methylated regions (DMRs) detection will be added in future, which will be great facilitated by the precise methylated cytosines calling. In conclusion, the results presented here demonstrate that Bycom is an effective statistical framework that is capable of identifying methylcytosines and will be very useful for genome-wide investigations of DNA methylation.

Data sets
NCBI build 36.1 (hg18) was downloaded from the UCSC database (http://genome.ucsc.edu/) and used as the human reference genome. The whole-genome DNA methylation data from an anonymous male Asian Han Chinese (YH) were obtained from the NCBI Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih. gov/geo; GSE17972). The silkworm sequence data were downloaded from the GEO database (GSE18315) and the Sequence Read Archive (SRA; SRP001159). The Aspergillus flavus reference genome sequence was downloaded from the Aspergillus flavus Genome Sequencing Project web site (http://www.aspergillusflavus. org/genomics/) and the DNA methylation data were obtained from GEO (GSE32177). The raw Illumina T29 RRBS data are available from GEO under accession number GSE55568.

Simulated data
Simulated bisulfite short reads with different levels of six factors (total methylation level, sequencing depth, bisulfite conversion rate, sequencing quality of base, number of misaligned reads, and mutation rate) were simulated from chromosome 10 of hg18 using BSSim (http://122.228.158.106/BSSim/). The default settings of the six factors in BSSim were used for the simulation as: sequencing depth 306; maximum mismatches 2; bisulfite conversion rate 0.998; high quality value ratio 95%; and SNP frequency for homozygote/ heterozygote 0.0005/0.001. Besides, different methylation levels of 86.53%, 2.03%, and 2.27% were set for the different types of cytosines CG, CHG, and CHH, respectively, where H is A, C, or T.
The methylation status and SNP genotypes of each cytosine were then retrieved from the BSSim simulated files.

Evaluation of read mapping and base distribution
Low quality reads (Q20) were eliminated for the raw data and the clean reads were aligned against the reference genome allowing a maximum of two mismatches. Uniquely mapped reads were retained for the following analysis. For a selected genomic location, the cytosines in that region were set as c i ,i[ 1,2, . . . ,n ð Þ , where n is the total number of cytosines in the region. Then, the observed base of a mapped read j at cytosine site c i in the genomic sequence was assumed as O ij , j[ 1,2, . . . ,m ð Þ ,O ij [ A,T,C,G ð Þ , where m is the total number of reads that cover site c i . The quality value of the observed base O ij was set to Q ij .

Basic statistical model
In the Bayesian inference model, the posterior probability of expecting base b i at site c i ,i[ 1,2, . . . ,n ð Þon the reference genome was expressed as: where m is the total number of reads covering site c i . Here, E ij was considered as the actual base according to the observed base O ij of the read when b i was estimated. The independent base was expressed as E ij~bi [ A,T,C,G ð Þ ,j[ 1,2, Á Á Á ,m ð Þ . When the sequencing error ratio is set to P e , the likelihood P O ij E ij À Á of the site c i was calculated as: Beside the sequencing error, other factors, including bisulfite conversion rate P t and methylated heterozygosity P m , also influence the composition of the bases derived from the reads. After bisulfite treatment, heterozygous methylcytosines P m , heterozygous cytosines 1{P m ð Þ , wrong-sequenced cytosines P e 3 , correct-sequenced cytosines 1{P e ð Þ, converted cytosines P t , and unconverted cytosines 1{P t ð Þ may appear simultaneously at identical cytosine sites in the sequences from distinct cells. Therefore, the likelihood P O ij E ij À Á was computed as: The prior probability P(E ij ) of each base b i at site c i on the genome was set as: In this way, the posterior probability derived from the Bayesian inference formula was obtained.

Parameter estimation
The key to the model is the accurate estimation of three parameters, P m , P t and P e , which can change the of base contribution component from the reads covering the cytosine site. To estimate the methylated heterozygosity P m , a linear fitting equation was built using the methylation level of a 1000bp genome region of single chromosome in which the methylation status has been reported to be highly correlated [24,28]. The P m of the midpoint in a 1000-bp region was estimated as: where x i is the methylation level (methylated reads/total reads) of site c i . The values of a and b were obtained by a least square procedure as: where x j is the methylation level of site d j in a 1000-bp region centered at sites c i and y j is the average methylation level of the 1000-bp region centered at sites d j . The region centered at c i contained n cytosine sites. The P m of cytosine sites in the 500bp head or tail of the chromosome were estimated using the average methylation level of the cytosine sites in the 1000-bp head or tail chromosome genome regions as: Where p is the total number of the cytosine sites located in the 1000-bp head or tail of the chromosome.
Þare the number of methylated reads and total reads at site c i , respectively. The sequencing error ratio P e , which is indicated by the quality value Q ij of the observed base O ij , was calculated as P e~1 0 { Q ij 10 [30]. The bisulfite conversion rate P t was determined initially by un-methylated lambda DNA spike-ins [25] or by calculating the C to T conversion rate for all cytosines in the CpH context [19,24].

Methylcytosine calling
After bisulfite treatment, the posterior probability of confirming the cytosine site as base C is actually the probability of cytosine methylation. The threshold T was set to ensure methylcytosine calling was precise, and was computed by a numerical iteration algorithm as: where N C and N p represent the number of cytosine and methylcytosine sites, respectively, on a chromosome. The false discovery rate (FDR) was set as 0.01. In the iteration, N p was calculated as the number of cytosine sites with a posterior probability above the T threshold. Set T 0~0 for the first iteration, and the posterior probability above the value T 0 of the cytosine would be the methylated cytosine. Then we can calculate N p , substitute it into (*) and obtain T 1 . For the second iteration, T 1 replaces T 0 , and acquire T 2 ; T i will tend to the invariant threshold as T after several iterations.

Validation
Total genomic DNA was extracted from the T29 cell line and converted by sodium bisulfite following an established protocol [31]. Seven batches of regions were selected randomly and amplified with nested RT-PCR primers (Table S5). The amplified products were sequenced on the Illumina MiSeq platform (Illumina Inc, San Diego, CA). The validation results are summarized in Table S3. Figure S1 Effects of high quality value ratio (Q20), high quality value ratio (Q0), mismatch number, and SNP frequency on Bycom performance on simulated data. Performance of Bycom and Lister with a range of values for the mismatch number (A), the high quality value ratio using Q20 cutoff (B), the SNP frequency (C), and the high quality value ratio using Q0 cut-off (D). Histograms indicate the number of identified methylcytosines, and lines indicate the false negative rate.    Text S1 Commands of the public software used in this manuscript.