Figure 1.
This figure outlines the workflow used in this study to derive quantitative scores of CpG island strength, and to evaluate their performance as predictors of bona fide CpG islands. The arrows at the top describe the phases of the analysis, the cylinders correspond to input datasets (orange, blue, and brown cylinders) and results datasets (grey and green cylinders), and the rectangular boxes represent major computational steps. The sigmas in the calculation step 3 box stand for summation over the input. The figure is slightly simplified and focuses on a single CpG island map. In fact, the entire workflow was performed separately for three CpG island maps that differ in the repeat-exclusion strategy used (TJU, GGF, and GGM), with subsequent benchmarking of their performances (Figure 5).
Table 1.
Prediction Performance for DNA Methylation and Promoter Activity at CpG Islands
Figure 2.
Co-Localization between the Five Components of the Open Chromatin Score and the Three CpG Island Maps
(A) shows the relative frequency of overlap between epigenetically modified sites and CpG islands (percentage values).
(B) shows the degree of over-representation relative to a simulated case where sites are uniformly distributed over the chromosomes (base-2 log scores). Yellow boxes correspond to frequent overlap, blue boxes to rare overlap. H3D, histone H3K4 dimethylation; H3T, histone H3K4 trimethylation; H3A, histone H3K9/14 acetylation; DHS, DNase I hypersensitive sites; TFS, SP1 transcription factor binding, plus the CpG island abbreviations used throughout this study (TJU, GGF, and GGM). (B) is symmetrical as the result of averaging, therefore only the upper right triangular matrix is reported. (A) is not symmetrical, as is obvious from an example: 51.4% of all 578 known DNase I hypersensitive sites on Chromosomes 21 and 22 overlap with a GGM CpG island, while only 5.0% of all 5,913 GGM CpG islands overlap with an experimentally determined DNase I hypersensitive site.
Table 2.
A Subset of CpG Islands Exhibits Highly Significant Overlap with Multiple Epigenetic Modifications Simultaneously
Table 3.
Prediction Performance for the Distinction between CpG Islands That Overlap with a Particular Epigenetic Modification and Those That Do Not
Figure 3.
ROC Curves Comparing the Performance of Four Prediction Scores and Three Sequence Criteria against DNA Methylation and Promoter Activity
This figure compares the prediction performance of four CpG island scores that are based on epigenome prediction (upper legend box) and of three simple sequence criteria (lower legend box). In (A), (C), and (E), overlap with unmethylated regions is used for evaluation, and in (B), (D), and (F), overlap with experimentally determined transcription start sites (as an indicator of promoter activity) is used instead. All graphs plot the true positive rate against the false positive rate in the form of ROC curves [27]. The scales on top of the plots display the threshold values for the combined epigenetic score that correspond to the tradeoff between false positive rate and true positive rate at any one position. The thresholds for the combined epigenetic score are highlighted by triangles: 0.5 (balance between sensitivity and specificity), 0.33 (high sensitivity), and 0.67 (high specificity). Averaged across all six graphs, the ROC area under the curve performance measure (i.e., the percentage of the unit square that lies below the ROC curve [27]) amounts to the following values: predicted unmethylated score, 65.4%; predicted promoter activity score, 74.8%; open chromatin score, 72.2%; combined epigenetic score, 75.8%, GC content, 67.1%; CpG observed-to-expected score, 70.6%; and CpG island length, 75.5%.
Figure 4.
Box Plots Comparing the Promoter Strength between High-Scoring and Low-Scoring Promoter CpG Islands
This figure shows box plots of the average number of transcription start site tags per CpG island (as an indicator of promoter strength), restricted to those CpG islands that show experimental evidence of promoter activity at all (i.e., at least three transcription start site tags fall within the CpG island). Separate box plots are drawn for CpG islands that fall into different intervals in terms of their combined epigenetic score (i.e., 0 to 0.2, 0.2 to 0.4, etc.). The standard box plot format is used (boxes show center quartiles, whiskers extend to the most extreme data point that is no more than 1.5 times the interquartile range from the box, and non-overlapping notches provide evidence of significantly different medians), and outliers are hidden.
Figure 5.
Performance of the Combined Epigenetic Score Compared between CpG Island Maps That Use Different Repeat-Exclusion Strategies
This figure plots the precision (i.e., the percentage of experimentally supported bona fide CpG islands among all selected CpG islands) and the true positive rate (i.e., the percentage of experimentally supported bona fide CpG islands that are selected) over the total number of cases predicted as bona fide CpG islands, for any valid threshold on the combined epigenetic score. Evaluation criteria are absence of DNA methylation (A) and presence of promoter activity as indicated by experimentally determined transcription start sites (B). The three scales on top of each plot display the score thresholds that correspond to the number of CpG islands selected. Dashed lines show the three thresholds that were used to derive the final bona fide CpG island maps on the basis of the GGM dataset. Numbers on the x-axis are significantly lower in (A) than in (B) because of the fact that the DNA methylation dataset covers only a random sample of unmethylated and methylated CpG islands, while the promoter activity dataset covers essentially all nonrepetitive CpG islands genome-wide.
Table 4.
Performance Comparison between the Combined Epigenetic Score and the CpG Island Length
Figure 6.
Parallelism between Specific DNA Characteristics and the Epigenetic and Functional State of CpG Islands
This figure illustrates the link between the genome sequence and the epigenome at CpG islands, which enabled us to predict epigenetic states from characteristics of the genome sequence. CpG islands in the human genome can apparently be ordered on a scale of increasingly open and transcriptionally competent chromatin structure (left) and simultaneously on a scale of characteristic DNA attributes (right), with high correlation between both scales.