Machine learning on multiple epigenetic features reveals H3K27Ac as a driver of gene expression prediction across patients with glioblastoma

doi:10.1371/journal.pcbi.1012272

Fig 1.

Schematic overview of epigenetics-driven gene transcription and epigenomics sequencing data processing.

Gene transcription on epigenetic mechanisms that can be categorized into four categories: A) Chromatin accessibility, B) Active Transcription, C) Chromatin looping, D) Histone modifications. Counts of these sequencing + /- 2.5 kilo base-pairs (kbp) flanking the TSS region of each gene were measured and divided into 50 bins, with each bin representing 100 base pairs to create a heatmap for the input of the model. Created with BioRender.com.

More »

Expand

Fig 2.

Patient datasets representation after preprocessing.

The figure depicts the standardized epigenetic marker values per gene for a patient. It highlights the 3-dimensional arrangement of the datasets prior to model input. Here the X axis corresponds to the 50 bins of 100 bp counts for each feature. The Y axis represents each gene’s 4 epigenetic features. The figure’s Z axis is representative of the gene arrangement in the dataset.

More »

Expand

Fig 3.

Cross-patient prediction methodology using the model XGBoost architecture.

The model input for training and validation is derived from a patient (GSC1) different from the testing dataset. As shown, the matrices are flattened before going into the model, where the RNA-seq value is predicted. A) A functional view of the cross-patient experimental setup where the model training is illustrated on the left side of the image and the right, the transition to testing with the trained model. B) A conceptual view of the cross-patient experimental setup, which illustrates the dataset allocation and number of observations per feature in the data for training and testing.

More »

Expand

Fig 4.

PCC cross-patient regression model results.

Our experimental results are compiled as the mean PCC scores over 10 runs of each model. The error bars shown indicate standard deviation of the model results. Our cross-patient XGBoost-based regression model performed higher than all other architectures when training with GSC1 and testing with GSC2 (GSC1 → GSC2).

More »

Expand

Fig 5.

Feature importance scores extracted from our cross-patient CIPHER model for GSC1 → GSC2.

The model identifies the H3K27Ac feature as the most important for the prediction of RNA-seq. The results visualized are the means over 10 experimental runs, with the error bars denoting the standard deviation.

More »

Expand

Fig 6.

CIPHER Model training generalizes to other GSC datasets.

The CIPHER model trained with the GSC1 dataset is then evaluated with the GSC H3K27Ac/RNA-seq data from Mack et al [25]. In line with the other study experiments, each dataset is evaluated 10 times, with 10 different seeds. These experiments uncovered the similarity between the GSC data from different sources.

More »

Expand

Fig 7.

RNA-seq true versus predicted value scatter plots illustrate a similar predictive trend across GSC datasets.

A) GSC2, B) Mack-GSC7, C) Mack-GSC14, D) Mack-GSC18, E) Mack-GSC20, F) Mack-GSC25, G) Mack-GSC27, H) Mack-GSC35, I) Mack-GSC36, J) Mack-GSC38, K) Mack-GSC44.

More »

Expand

Table 1.

Training and testing with H3K27Ac only resulted in an increase in mean PCC for most of the datasets.

More »

Expand

Fig 8.

H3K27Ac only model produced RNA-seq true versus predicted value scatter plots like the prior experimental setup.

A) GSC2, B) Mack-GSC7, C) Mack-GSC14, D) Mack-GSC18, E) Mack-GSC20, F) Mack-GSC25, G) Mack-GSC27, H) Mack-GSC35, I) Mack-GSC36, J) Mack-GSC38, K) Mack-GSC44. Each plot is representative of the model run (out of the 10 runs per dataset) that produced the highest PCC for that dataset and the axes represent RNA-seq count values after log2 transformation. The ranges of both the true and predicted values for each dataset follow the previous testing closely. Additionally, as in the previous testing, the predicted values histograms’ shapes and sizes in each visualization differ from each other but follow the same trends.

More »

Expand

Fig 9.

An analysis of H3K27Ac standardized counts visualizes the similarities in signal shape contrasted by the variations at peaks.

H3K27Ac counts of all the GSC datasets around TSS are visualized at bin levels. A) Visualizing H3K27Ac signals for all genes, regardless of their RNA-seq values. All these GSCs have similar H3K27Ac epigenetic landscapes with two distinct peaks B) H3K27Ac signals for “high” expressing genes (log2(RNA-seq values) ≥ 10) C) H3K27Ac signals for “low” expressing genes (0log2(RNA-seq values) 5).

More »

Expand