Accurate Promoter and Enhancer Identification in 127 ENCODE and Roadmap Epigenomics Cell Types and Tissues by GenoSTAN

doi:10.1371/journal.pone.0169249

Fig 1.

Overview of chromatin state annotation methods.

Comparison of features of GenoSTAN against three previous chromatin state annotation algorithms.

More »

Expand

Fig 2.

Chromatin states fitted on a dataset using eight histone modifications, P300 and DNase-Seq (dataset 1) using GenoSTAN.

(A) GenoSTAN segmentations are shown with published segmentations using ChromHMM-ENCODE [11], Segway-ENCODE [11] and EpicSeg [28] at the TAL1 gene and three known enhancers. GenoSTAN-Poilog-K562 correctly recalls all known promoter and enhancer regions, whereas other methods frequently switch between promoter, enhancer, and other states. (B) Median read coverage of GenoSTAN-Poilog-K562 chromatin states (left), their number of annotated segments in the genome, their median width and distance to the closest GENCODE TSS (middle). The right panel shows recall of genomic regions by chromatin states.

More »

Expand

Fig 3.

GenoSTAN with other published chromatin state annotation methods applied to four different datasets in K562.

(A) Description of the four data sets used for benchmarking. All methods were applied to dataset 1 with 18 states in this study. Datasets 2, 3 and 4 were used in previous studies [21, 23, 28]. Segmentations which were created by the authors of the respective studies were compared to GenoSTAN segmentations using the same number of states. (B-F) Performance of chromatin annotations on each of the dataset 1, 2, 3, and 4 is summarized by the area under the recall-FDR curve for various genomic features. Cumulative FDR and recall are calculated using overlap on state segments level (B,C) or on base pair level (D-F) by subsequently adding states (in order of increasing FDR). S2, S3, S4 and S5 Figs show individual recall-FDR curves for all datasets and segmentations.

More »

Expand

Fig 4.

Comparison of GenoSTAN to other published ChromHMM segmentations from the Roadmap Epigenomics project.

GenoSTAN was learned on all 127 cell types and tissues (GenoSTAN-127) using the five core marks H3K4me1, H3K4me3, H3K36me3, H3K27me3, H3K9me3 and an input control (ChromHMM-15 was learned on the same data). To improve accuracy additional histone modifications H3K27ac, H3K9ac and DNase-Seq were used to learn another model (GenoSTAN-20) on a subset of 20 cell types and tissues, where the marks were available. (A) Performance of chromatin states in recovering FANTOM5 CAGE tags in 127 cell types. CAGE tags were verlapped with chromatin states wihout the use of cell type information. Cumulative FDR and recall are calculated by subsequently adding states (in order of increasing FDR). (B) Performance of chromatin states in recovering GRO-cap transcription start sites in two cell types where GRO-cap data was available. (C) The same as in (B) for ENCODE HOT regions for five cell types where annotation of HOT regions was available. (D) Recall of FANTOM5 promoters and enhancers by predicted promoters and enhancersis plotted to assess how well models distinguish promoters from enhancers. (E) The fraction of predicted enhancer segments bound by individual TFs is shown for different studies. GenoSTAN enhancers are more frequently bound by TFs than those from other studies.

More »

Expand

Fig 5.

Enrichments of genetic variants associated with diverse traits in enhancers and promoters are specific to the relevant cell types or tissues.

(A) Median SNP recall and frequency was calculated for enhancer states in different segmentations by restricting it to a total genomic coverage of 2% (100 samples of random subsetting) to control for different number of enhancer calls between the segmentations. Error bars show the 95% confidence interval. (B) The same as in (A) but for promoters. (C) The heatmap shows the -log10(p-value) of significantly enriched traits in enhancer states (GenoSTAN-Poilog-127, p-value < 0.01, marked by ‘*’). Only cell types and tissues where at least one trait was significantly enriched are shown. P-values were adjusted for multiple testing using the Benjamini-Yekutieli correction.

More »

Expand

Fig 6.

Promoters and enhancers have a distinctive TF regulatory landscape.

Co-binding (left) and enrichment of transcription factor binding sites (right) in chromatin states (GenoSTAN-Poilog-K562) for 101 transcription factors in K562 reveals TF regulatory modules with distinct binding preferences for promoters, enhancers and repressed regions. The co-binding is depicted as the frequency of binding sites of two TFs that co-occur in a chromatin state divided by the number of all binding sites of the two TFs (Jaccard index). For each TF, enrichments were normalized to sum up to 1 across all 18 chromatin states of GenoSTAN-Poilog-dataset 1.

More »

Expand