Predicting regional somatic mutation rates using DNA motifs
Fig 2
The contextual regression model successfully predicted somatic mutation rates in 13 tumors.
(a) The structure of the contextual regression model; (b) For each tumor type, 10-fold cross validation was performed and the Pearson correlation coefficient was calculated between the predicted and measured values. "Training accuracy" and "Testing accuracy" represent the average of Pearson correlation coefficients in the training, testing datasets respectively. "Included for re-training" indicates which data set was included for re-training the CR model after removing the cancer-related regions (i.e. the regions with mutation rates significantly deviating from the predicted values). "Testing accuracy of the re-trained model" represents the correlation using the re-trained CR model obtained from the interactive procedure (see Online Methods). Because the regions from a tumor type in the test set may overlap with the regions included in the merged dataset for training, the overlapped regions were removed and Pearson correlation coefficients are shown as "Testing accuracy after removing overlapping regions"; (c) The scatter plot for one fold from the 10-fold cross validations in which chr1 and chr11 were left out as the testing set; (d) The scatter plot for prediction in Lymph-CLL testing set using the re-trained CR model.