Epigenetics is all you need: A transformer to decode chromatin structural compartments from the epigenome

doi:10.1371/journal.pcbi.1012326

Fig 1.

TECSAS workflow for predicting chromatin structure from epigenomic profiles.

(A) Diverse 1D epigenetic tracks (RNA-seq, histone modifications, transcription factor binding) are extracted from the ENCODE portal and segmented into 50kbp loci. The TECSAS deep learning architecture predicts locus-wise structural annotations, including compartments, subcompartments, and potentially other features like LADS, NADS, and SPADS. Prediction is based on learned correlations within the locus’s biochemical composition. (B) The TECSAS architecture begins with an input embedding layer, transforming the epigenomic profile into a higher-dimensional representation. A Transformer encoder then analyzes this representation, capturing complex relationships and long-range dependencies within the epigenomic data to understand the structural context. Finally, the output is decoded through a linear layer and a softmax layer, assigning a probability distribution over possible structural annotations for each locus.

More »

Expand

Fig 2.

Assessment of TECSAS prediction at 50 kbp and 25kbp resolution for GM12878 and K562 cell lines.

(A) Confusion matrix comparing TECSAS predictions with experimentally derived subcompartment annotations for the GM12878 cell line at 50 kb resolution. The diagonal elements represent the fraction of correctly predicted loci for each subcompartment, highlighting the high accuracy of the model. (B) Confusion matrix for A/B compartment predictions based on the inferred subcompartments in GM12878, demonstrating accurate compartment classification. (C) Distribution of confidence probabilities for each predicted subcompartment in GM12878. B1 and B2 subcompartments exhibit lower average confidence probabilities, reflecting their more complex epigenomic profiles. (D) Confusion matrix comparing TECSAS predictions with subcompartment annotations derived using the SLICE method for the K562 cell line at 25 kb resolution, demonstrating the model’s ability to predict subcompartments at higher resolutions. (E) Overall accuracy of TECSAS in predicting subcompartments for GM12878 and K562, comparing performance for all loci and loci excluding transition regions. The exclusion of transition regions significantly improves prediction accuracy for both cell lines. (F) Fraction of successful and failed predictions within transition regions for GM12878 and K562, highlighting the challenges of predicting subcompartments in these regions with mixed epigenomic signatures.

More »

Expand

Fig 3.

The importance of epigenomic context and long-range interactions for accurate subcompartment prediction with TECSAS.

(A) Comparison of overall accuracy in predicting subcompartments between PyMEGABASE (PYMB) and TECSAS using both discretized and continuous signal intensities for epigenomic features. TECSAS demonstrates higher accuracy even with a limited epigenomic context. (B) Prediction accuracy as a function of the number of input experiments for both PYMB and TECSAS, highlighting the consistent outperformance of TECSAS regardless of the number of features used (p-value <10⁻⁹ between any PYMB and TECSAS distribution). (C) Mean accuracy of subcompartment predictions with increasing numbers of neighboring loci included in the input, demonstrating the significant improvement in accuracy as the epigenomic context expands. The maximum accuracy achieved is indicated by a star. (D) Subset of the attention map for a locus predicted as A1, showing the activation of nodes (green) corresponding to specific epigenomic features (red) and highlighting the model’s focus on relevant patterns within the local epigenomic context. (E) Full attention map for a locus predicted as B1, revealing the importance of long-range interactions and the model’s attention to distal regions with enriched epigenetic marks, particularly for marks ≈350kbp apart from the locus of interest (L).

More »

Expand

Fig 4.

3D implications of prediction accuracy on IMR-90.

(A) Compartment annotations from TECSAS, PYMB and experimental Hi-C around the chr4:36-37Mbp segment. (B) Representative structure of chromosome 4 from simulations based on TECSAS and PYMB predictions, highlighting the positioning of the chr4:36-37Mbp segment. (C) Distribution of radial positioning of the chr4:36-37Mbp segment on the simulated ensemble based on TECSAS and PYMB annotations.

More »

Expand

Fig 5.

Prediction of functional structural annotations by TECSAS highlights 3D structural bias due to nuclear body association.

(A) Confusion matrix for predicted LADS, NADS and SPADS against ground truth. (B) Distribution of A and B compartments for IMR-90 for each XAD and nonXAD. (C) Distribution of distance to lamina, speckles and nucleoli for loci predicted as LADS, SPADS and NADS respectively when projected in 3D DNA-tracing experiments [36]. (D) Number of loci in genome predicted as specific combinations of compartment, LAD, SPAD and NAD annotation; solid circles represent XAD and discontinuous circles represents nonXAD.

More »

Expand

Table 1.

Trained models.

Train, validation and test set where assigned for models trained and tested in the same cell type

More »

Expand