Predicting host taxonomic information from viral genomes: A comparison of feature representations
Fig 4
The effect of k-mer length on prediction across host taxonomic ranks for the bacteria datasets.
The boxplots show how prediction improves with increasing k-mer length for all representations of the genome and that prediction gets more difficult at lower taxonomic ranks. Genome representation is indicated by colour and k-mer length by depth of colour:DNA—nucleotide sequence (blue); AA—amino acid sequence of CDS regions (orange); PC—Physio-chemical properties, each amino acid residue binned into one of seven bins based on its physio-chemical property (green); Domains—presence of PFAM domain in the sequence. Any AUC scores of less than 0.5 were reset to 0.5, i.e., no predictive signal.