Predicting host taxonomic information from viral genomes: A comparison of feature representations
Fig 8
Comparison of the ‘holdout’ and ‘all’ classifiers showing the signal loss.
Comparison of holdout and the standard (labelled ‘all’) classifiers for each dataset. For the majority of datasets there was a small loss in predictive power, implying that both classifiers are learning a shared signal. In a minority of cases there was a complete loss in predictive power implying the lack of a common signal. Each row corresponds to a dataset and each column a feature set. In the feature set labels the letters indicate the genome representation and the number the k-mer size. Genome representation: DNA—nucleotide sequence; AA—amino acid sequence of CDS regions; PC—physio-chemical properties, each amino acid residue binned into one of seven bins based on its physio-chemical property; Domains—presence of PFAM domain in the sequence. The colour indicates the AUC score for each classifier. All AUC scores of less than 0.5 were set 0.5, i.e., no predictive signal.