Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models

doi:10.1371/journal.pcbi.1004590

Fig 1.

Overview of the methodology.

A) To identify functional CRMs we searched for significant correlations between TF ChIP-seq tracks and TF target genes using i-cisTarget [28]; and selected peaks (marked in green) that are located in 20 kb regulatory space around up- or down-regulated TF target genes. B) Feature selection was performed on the set of functional CRMs to select TF and co-regulatory PWMs and data tracks. C) The performance of each of the 45 TF models was evaluated by 5-fold cross-validation, using area under the precision-recall and receiver-operating characteristic curves. D) The 45 learned classifiers where used to identify cis-regulatory somatic mutations that have an impact on the CRM score, defining a PRIME score (Predicted Regulatory Impact of a Mutation in an Enhancer).

More »

Expand

Fig 2.

Cross-validation performance for 45 TF models.

A) Area under precision-recall (AuPR) and receiver operating characteristic (AuROC) curves for different models. Mk, M1, M2, and M3 are estimated by 5-fold cross-validation. M0 model does not use a training set and the AuROC and AuPR where obtained by varying the threshold of the PWM. B) Examples of precision-recall curves for ATF2 and BATF. Random Forest classifiers outperform PWM-based models. M3 models (using experimental data tracks) outperform M1 models (using sequence only).

More »

Expand

Fig 3.

Feature importance.

A) Three examples of TFs, each with several (for NANOG and TP53) or one (for MYC) target CRMs, illustrating the feature importance in the Random Forest classifier, in the M3 model. For NANOG co-regulatory PWMs contribute more to the classification performance than the PWM of NANOG itself. For TP53, the contribution of the co-regulatory PWMs is not strong and the classification decision is largely based on the presence of strong binding sites of TP53 itself. For the MYC model the most important features are regulatory tracks. B) Examples of a decision tree in the ensemble. C) Averaged feature importance across trees, showing the contribution of various features to the classification decision. For example TCF12 and ATF2 tracks are dominant for NANOG model; for TP53 the most relevant features are motifs of the query TF (red) and particular important ones are represented with logos. The colored region around dashed line demonstrates standard deviation of the feature impartance across trees.

More »

Expand

Fig 4.

Validation of classifiers by genome-wide CRM prediction.

After genome-wide CRM scoring, removing the training CRMs, we evaluated the enrichment of ChIP-seq peaks of the corresponding TF, and the enrichment of motifs of the corresponding TF, within the top 1000 newly predicted CRMs. Enrichment is calculated by i-cisTarget [28], and represented as a Normalized Enrichment Score (NES). A) Significant enrichment of ChIP-seq peaks (orange color corresponds to NES>2.5) for 31/45 M1 models, compared to 17/45 of the Mk models. B) The motif of the respective TF is also enriched in the top 1000 newly predicted functional CRMs, for those in orange (NES>2.5).

More »

Expand

Fig 5.

Regulatory impact score on simulated substitions.

A) Nucleotide substitutions with higher PRIME scores are under constraint. B) An example of the E2F1 promoter for which each possible substitution is evaluated by M0 and M1 models. The M1 model (Random Forest) identifies a 15 bp region that is highly vulnerable to mutations, while three different M0 models (using only the PWM), identify excessive numbers of false-positive substitutions, demonstrating the higher specificity of the Random Forest classifiers, compared to single PWMs. C) Barplot showing an example from A), thus averaged phastCons scores depeneding on the PRIME score threshold, for the E2F4 model. Error bars represent standard error of the mean.

More »

Expand

Fig 6.

Comparison of PWMs and Random Forest classifiers on the known TAL1 insertion.

We scored the known TAL1 enhancer insertion that occurs in the Jurkat cell line [6] with Random Forest (M1) and PWM (M0) MYB-specific models. As control, we scored all SNVs and insertions in promoters across 498 breast cancer genomes with the same MYB models, to calculate a background distribution of impact scores. A) The distribution of background PRIME scores (i.e., delta Random Forest scores) and the observed PRIME score for M1, indicated as the orange arrow. B) The distribution of background PWM-delta scores (M0 model) and the observed score. C) Feature importance within the MYB model indicates that both and MYB motifs and co-regulatory TF motifs contribute significantly to the classification decision and the most important co-regulatory motif is RUNX, a known co-regulatory factor of MYB. D) The known driver insertion in the TAL1 enhancer generates a gain of H2K27Ac peak, whereas the known SNV in the TERT promoter does not. The red highlighted region indicates which samples harbor the respective cis-regulatory mutation.

More »

Expand

Fig 7.

Candidate cis-regulatory driver SNVs and insertions across 498 breast cancer genomes.

A) All SNVs and insertions with high PRIME score (>0.3) (insertions are within the black box) found by M1 models in the regulatory regions around cancer related genes and 167 TFs expressed in breast cancer (all significant PRIME scores with model-specific thresholds are provided in S5–S6 Tables). Values inside boxes indicate the recurrence, that is the number of samples where this variant was found across the 498 TCGA samples. B) An example of a high scoring recurrent insertion that is predicted to generate a TP53 gain of target in the vicinity of SOX5. Z-scores of the SOX5 gene expression are significantly higher (Wilcoxon rank sum test) in the 33 samples with the insertion, compared to samples without the insertion.

More »

Expand

Fig 8.

Scoring cis-regulatory variants in the HeLa cell-line.

A) Scatter plot of PRIME scores (45 M1 models) for heterozygous SNVs in the HeLa cell line versus z-scores of H3K27Ac peak scores (the higher the z-score the more exclusive the H3K27Ac signal to Hela, compared to 108 other samples). The arrow indicates an example SNV that generates a de novo JUN binding site (shown in C-D). B) Using high-scoring SNVs falling in acetylation peaks for each TF model we plotted fractions of gains and losses in dbSNP (polymorphisms) versus not in dbSNP (possibly somatic mutations). Oncogenic TFs that are important for HeLa, namely MYC, E2F7, JUND, and STAT1, have more gains than losses, specifically for variations not in dbSNP. Vice versa, YY1, a known repressor related to cancer, has almost no gains in non-dbSNP variations, while dbSNP variations have an almost equal amount of gains and losses. C) H3K27Ac signal around SNP that is predicted to generate a gain in JUN binding (PRIME = 0.21; z-score = 16.28) indeed shows a moderate exclusivity of H3K27Ac to HeLa. D) This position shows an allele-specific binding of JUN, only having ChIP-seq reads with the variant allele that causes a gain in JUN binding sites.

More »

Expand