Predicting host taxonomic information from viral genomes: A comparison of feature representations

doi:10.1371/journal.pcbi.1007894

Predicting host taxonomic information from viral genomes: A comparison of feature representations

Fig 4

The effect of k-mer length on prediction across host taxonomic ranks for the bacteria datasets.

The boxplots show how prediction improves with increasing k-mer length for all representations of the genome and that prediction gets more difficult at lower taxonomic ranks. Genome representation is indicated by colour and k-mer length by depth of colour:DNA—nucleotide sequence (blue); AA—amino acid sequence of CDS regions (orange); PC—Physio-chemical properties, each amino acid residue binned into one of seven bins based on its physio-chemical property (green); Domains—presence of PFAM domain in the sequence. Any AUC scores of less than 0.5 were reset to 0.5, i.e., no predictive signal.

doi: https://doi.org/10.1371/journal.pcbi.1007894.g004