Predicting host taxonomic information from viral genomes: A comparison of feature representations

doi:10.1371/journal.pcbi.1007894

Predicting host taxonomic information from viral genomes: A comparison of feature representations

Fig 8

Comparison of the ‘holdout’ and ‘all’ classifiers showing the signal loss.

Comparison of holdout and the standard (labelled ‘all’) classifiers for each dataset. For the majority of datasets there was a small loss in predictive power, implying that both classifiers are learning a shared signal. In a minority of cases there was a complete loss in predictive power implying the lack of a common signal. Each row corresponds to a dataset and each column a feature set. In the feature set labels the letters indicate the genome representation and the number the k-mer size. Genome representation: DNA—nucleotide sequence; AA—amino acid sequence of CDS regions; PC—physio-chemical properties, each amino acid residue binned into one of seven bins based on its physio-chemical property; Domains—presence of PFAM domain in the sequence. The colour indicates the AUC score for each classifier. All AUC scores of less than 0.5 were set 0.5, i.e., no predictive signal.

doi: https://doi.org/10.1371/journal.pcbi.1007894.g008