Recall DNA methylation levels at low coverage sites using a CNN model in WGBS

doi:10.1371/journal.pcbi.1011205

Fig 1.

The structure of the RcWGBS model and results by using the RcWGBS in H1-hESC and GM12878 datasets.

(A) The structure of the RcWGBS model. The DNA sequence and the DNA methylation levels were used as the input features. The 2-mer coding method was used to encode flanking DNA sequences centered on the sites with 50 bp upstream and downstream. Finally, the input feature of RcWGBS was a data matrix with a length of 100, a width of 5, and a height of 1. (B) The lower the coverage, the greater the difference between the DNA methylation level in the down-sampling and the original data. MEA means the mean absolute error. (C) Difference between predicted DNA methylation level and original DNA methylation level under different features. Y-axes represented the mean absolute error. (D) The mean absolute error of the imputed methylation calls in down-sampled H1-hESC and GM12878 data could be reduced. The blue dots represented the difference between the DNA methylation level of the down-sampled and the unsampled original dataset. While the yellow dots represented the difference between the DNA methylation level after RcWGBS interpolation and the unsampled original data. A total of 22 groups of data were compared here.

More »

Expand

Table 1.

Coverage of down-sampled data.

More »

Expand

Fig 2.

Comparison with METHimpute and BSmooth.

(A) The mean absolute error of the DNA methylation level between raw unsampled data and predicted values from RcWGBS, METHimpute, and BSmooth, respectively. (B) The pearson’s correlation coefficient between the raw unsampled data and predicted values from RcWGBS, METHimpute, and BSmooth, respectively.

More »

Expand