Fig 1.
The flowchart of the analysis.
Using DNA motifs, including known TF motifs (TF motifs), histone associated motifs (Histone motifs) and DNA methylation associated motifs (Methyl motifs) to represent epigenetic states, we built a contextual regression (CR) model to predict regional mutation rates. As the majority of the mutations are related to the local epigenetic state and independent from the disease state (grey dots), this CR model can quantify the relationship between DNA motifs and somatic mutation rates. Importantly, the CR model revealed the motifs most predictive of somatic mutations (right branch) and the predicted mutation values allowed classification of cancer types using the cancer-related regions with significantly higher mutation rates than predicted (left branch). In the scatter plot, each point represents a training/testing instance, which is the predicted/measured mutation rate of a genomic region. The mutation rate is the log2(MutationRate+1), which is consistent with Fig 2C. The rows of the heatmap are important motifs and the columns are different types of cancers.
Fig 2.
The contextual regression model successfully predicted somatic mutation rates in 13 tumors.
(a) The structure of the contextual regression model; (b) For each tumor type, 10-fold cross validation was performed and the Pearson correlation coefficient was calculated between the predicted and measured values. "Training accuracy" and "Testing accuracy" represent the average of Pearson correlation coefficients in the training, testing datasets respectively. "Included for re-training" indicates which data set was included for re-training the CR model after removing the cancer-related regions (i.e. the regions with mutation rates significantly deviating from the predicted values). "Testing accuracy of the re-trained model" represents the correlation using the re-trained CR model obtained from the interactive procedure (see Online Methods). Because the regions from a tumor type in the test set may overlap with the regions included in the merged dataset for training, the overlapped regions were removed and Pearson correlation coefficients are shown as "Testing accuracy after removing overlapping regions"; (c) The scatter plot for one fold from the 10-fold cross validations in which chr1 and chr11 were left out as the testing set; (d) The scatter plot for prediction in Lymph-CLL testing set using the re-trained CR model.
Fig 3.
Analysis of the cancer-independent regions.
(a) The percentage of cancer-independent regions in the 13 cancer types. Percentage is calculated as the number of cancer-independent/related or ambiguous regions divided by the total number of regions in a cancer; (b) Cancer-independent regions clustered using the contextual weights of the motifs. For each of the 13 cancer types, the identified cancer-independent regions were clustered into 10 clusters using the Manhattan distance between the feature contextual weight vectors as the similarity metric. Each row is a motif with non-zero contextual weight, each column a cluster, and each entry is the average of a motif’s contextual weights in all the regions in a cluster. The clusters were further clustered into 10 groups; (c) The normalized mutation rate of each group, which is the z-score of mutation density (see methods for more details), varies significantly from the lowest in group A to the highest in group G; (d) The numbers of regions in the 10 groups; (e) The fold change of ChromHMM states in group A and each tumor. The fold change for each ChromHMM state is defined as the percentage of the state in group A divided by the percentage of the state in all the regions in a specific cancer.
Fig 4.
Identification of important motifs in cancer-independent regions.
(a) The important features in the 10 groups. Blue and orange represent the negative and positive contextual weight, respectively; (b) The number of important motifs in each group; (c) The percentages of motif categories in each group; (d) The average mutation rates around the motifs with positive contextual weights are higher than those around the motifs with negative contextual weights in the group A regions; (e) Mutation rates around the motif sites (1kbp at each side of the motif) in the group A regions. The red and blue lines represent the motifs with positive and negative contextual weights, respectively. The motif site is at the center.
Fig 5.
Analysis of the cancer-related regions.
(a) The identified cancer-related regions in Breast-AdenoCA (red dots); (b)The enriched pathways for the cancer-related regions in Breast-AdenoCA; (c) The fold change and p-value for the motif disruption rates in the 13 tumors. The red line represents the p-value of 0.05; (d) The fold change of chromHMM state (same as that in Fig 3E) for the cancer-related regions in each tumor; (e) The percentages of motif types that were significantly disrupted in cancer-related regions; (f) The classification model performance. The confusion matrix for the classification model using the 150 selected cancer-related regions on the testing dataset. Rows and columns correspond to the true and predicted tumor types, respectively. Values are the number of donors classified correctly. For example, for the Prost−AdenoCA, 31 donors were correctly classified.