DeepG4: A deep learning approach to predict cell-type specific active G-quadruplex regions

doi:10.1371/journal.pcbi.1009308

Fig 1.

DeepG4 model architecture.

Here, one-hot encoding is a numerical encoding of a 201-bp DNA sequence as a 201 × 4 matrix where each column corresponds to a DNA letter (A, C, G or T), and for instance, a value of one in the first column corresponds to a letter A in the sequence at a given position. For one-hot encoding, colored cells indicate ones, while white cells indicate zeroes.

More »

Expand

Fig 2.

Illustration of DeepG4.

A) Mapping of active G4 region sequences both in vitro and in vivo using NGS techniques. B) Deep learning model training using active G4 regions and control sequences. C) G4 activity prediction, evaluation and motif identification.

More »

Expand

Fig 3.

Prediction performance of DeepG4 to predict active G4 regions (regions where G4s form both in vitro and in vivo).

A) Prediction performance of DeepG4. The model was trained and evaluated using HaCaT cell data. Predictions were evaluated on the testing set of sequences (same experiment as training set), but also on an independent set of sequences (from a different ChIP-seq experiment). Receiver operating characteristic (ROC) curve and area under the ROC curve (AUROC) were plotted. B) Genome browser of HaCaT-trained DeepG4 predictions and G4 ChIP-seq around KRAS gene in K562 cells. C) Genome browser of HaCaT-trained DeepG4 predictions and G4 ChIP-seq around C5orf34 gene in K562 cells. D) Prediction performance of DeepG4 trained using HaCaT data and evaluated on other cell lines. E) Genome-wide prediction performance of DeepG4 trained using HaCaT data and evaluated on other cell lines. Predictions are computed for every 200-b bins of the genome. Area Under the Precision-Recall curve is plotted (AUPR). F) Prediction performance of DeepG4* trained using HaCaT data and evaluated on other cell lines. DeepG4* is identical to DeepG4 except that chromatin accessibility is not used as input. G) Genome-wide prediction performance of DeepG4* trained using HaCaT data and evaluated on other cell lines. H) Comparison of DeepG4 and DeepG4* prediction performances, in terms of accuracy and false discovery rate (FDR) metrics. I) Comparison of DeepG4 and DeepG4* genome-wide prediction performances, in terms of accuracy and false discovery rate (FDR) metrics. J) Comparison of DeepG4 and DeepG4* promoter prediction performances, in terms of AUPR, accuracy and false discovery rate (FDR) metrics.

More »

Expand

Fig 4.

DNA motifs identified by DeepG4.

A) Variable importances of DeepG4 cluster motifs, as estimated by random forests. Clustering of DeepG4 kernel motifs was done by RSAT matrix-clustering program to obtain cluster motifs. B) Multidimensional scaling (MDS) of DeepG4 motifs. As an input, matrix-clustering correlation matrix between kernel motifs was used. C) Logos of cluster motifs with highest variable importances. D) Number of kernel motifs containing one or more GG+ stretches. A GG+ stretch is defined as a stretch of 2 or more Gs in the motif consensus sequence. E) Number of kernel motifs containing G stretches depending on stretch length. F) Average profiles measuring the enrichment of cluster motifs centered around active G4 regions or canonical G4 motifs.

More »

Expand

Fig 5.

Genome-wide prediction of active G4 regions in tissues and cancers.

A) Genome browser of DeepG4 predictions at MYC and FUS genes in tissues and cancers. B) Relationship between DeepG4 predicted G4 activity and the amount of mutations, depending on the mutation class. Cancer cohort abbreviations (e.g. MESO) are detailed in S1 Table. C) Annotations of predicted stable and variable active G4 regions. D) Mutation rates in BRCA breast cancer depending on predicted G4 region activity.

More »

Expand