Improved Prediction of Non-methylated Islands in Vertebrates Highlights Different Characteristic Sequence Patterns
Fig 1
Receiver Operating Characteristic curves show that the DNA sequence is highly predictive of non-methylated regions, and our SVM method achieves higher AUROC than other methods when predicting these regions.
A receiver operating characteristic curve for four different classifiers: SVM (our spectrum kernel SVM), CpG ratio (the ratio of observed versus expected CpG dinucleotides), UCSC CpG island predictions (a variant of the observed versus expected method with additional constraints), and Wu HMM (an HMM-based CpG island prediction method), as well as an SVM trained on sequences with randomly shuffled labels, “SVM (random)”. The UCSC and Wu HMM methods are shown as points rather than curves, because they only provide a set of genomic windows rather than scores for the whole genome, essentially the same as choosing a single cutoff score for the other methods. The prediction was run five times with different random splits of training and test data, therefore five lines or points are shown for each method. The performance is very stable between runs, with the lines for each run almost perfectly overlapping. The average area under the curve across all 5 random splits is indicated in each panel.