Figure 1.
The pipeline for building profile HMMs from a set of curated viral protein sequences.
An initial set of protein sequences of interest is curated and reduced by collapsing high-identity sequences. The similarity between all pairs of remaining sequences is calculated using BLAST. Using the BLAST results, polyprotein sequences are inferred and removed. The Markov Clustering algorithm groups the remaining sequences into families. Sequences with extreme lengths are removed before multiple sequence alignments are generated for each family. Multiple sequence alignments are used to train profile HMMs. Statistics for each step in the generation of the vFams are in parentheses.
Figure 2.
Viral sequence recall for all vFams in cross-validation.
(A) A schematic representation of the cross-validation of the vFams is depicted for a single vFam. The initial multiple sequence alignment (MSA) and HMM building are depicted for the vFam being tested (top left). Each sequence is removed from the vFam exactly once, and a validation MSA and validation HMM are built from the remaining sequences. A set of test sequences comprising a large set of non-viral sequences and all viral sequences across all vFams is aligned to the validation HMM, and the left out sequence is evaluated. If the left out sequence is recalled by the validation HMM with an E-value ≤10, the sequence is considered “recalled” by the vFam (black). If the left out sequence is recalled by the validation HMM and additionally has a lower E-value than all test sequences not in the current vFam, the sequence is considered “strictly recalled” (red). The process is repeated for all “N” sequences in the vFam and the vFam’s % recall and % strict recall are calculated. Each vFam was evaluated in this manner. (B) For each vFam in the cross-validation experiments, the percentage of recalled sequences (black) and the percentage of strictly recalled sequences (i.e., E-value less than non-viral controls; red) is plotted. The vFams are ranked by their percentage of strictly recalled sequences (x-axis). A threshold of 80% strict recall (dashed blue line) was used to filter the vFams to the best performing subset. Scale bars below the x-axis show the number and fraction of vFams in the ranked set.
Figure 3.
Viral sequence recall as a function of other vFam metrics.
For each vFam, the percentage of the vFam’s sequences correctly recalled by the HMM with a score better than all non-viral controls (% strict recall) in the cross-validation experiments is plotted as a function of (A) the number of sequences used to build the vFam; red box (zoomed and inset) highlights HMMs built from 40 or more sequences with strict recall less than 3%, (B) the length of the vFam, (C) the positional relative entropy in the vFam, and (D) the total relative entropy in the vFam.
Figure 4.
Performance of vFams and BLAST on metagenomic datasets.
A comparison of BLAST vs. HMMER for the detection of Human klassevirus 1, Santeuil nodavirus, and CAS virus. (A) Percent amino acid identity for 80 aa windows is shown between the Human klassevirus 1 and Aichi virus polyprotein sequences (green); genome coverage of correctly classified viral reads by BLAST and HMMER is shown in blue and orange respectively; the difference in coverage (HMMER coverage−BLAST coverage = Δ coverage) is shown in black; the regions of the genome truly covered in the full dataset are shown in pink; a to-scale genome schematic of Human klassevirus 1 is found below, depicting structural proteins (yellow) and non-structural proteins (blue). (B) The number of true positives vs. the number of false positives for the detection of Human klassevirus 1 is depicted for BLAST (blue) and HMMER (orange). (C) Percent amino acid identity for 84 aa windows is shown between ORF A and ORF α of Santeuil nodavirus and the RdRP and capsid proteins of the Striped Jack Nervous Necrosis virus (green) [no homolog of ORF δ was detected at the time of the discovery] [42]; genome coverage of correctly classified viral reads by BLAST and HMMER is shown in blue and orange respectively; the difference in coverage (HMMER coverage−BLAST coverage = Δ coverage) is shown in black; the regions of the genome truly covered in the full dataset are shown in pink; a to-scale genome schematic of Santeuil nodavirus RNA-1 and RNA-2 is found below, depicting ORF A (yellow), and ORF α and ORF δ (blue). (D) The number of true positives vs. the number of false positives for the detection of Santeuil nodavirus is depicted for BLAST (blue) and HMMER (orange). (E) Percent amino acid identity for 33 aa windows is shown between the L protein, the glycoprotein, and the nucleoprotein of CAS virus and the L protein of Lymphocytic choriomeningitis virus, the glycoprotein of Lloviu virus, and the nucleoprotein of Lymphocytic choriomeningitis virus (green) respectively [no homolog of the Z protein was detected at the time of the discovery] [12]; genome coverage of correctly classified viral reads by BLAST and HMMER is shown in blue and orange respectively; the difference in coverage (HMMER coverage−BLAST coverage = Δ coverage) is shown in black; the regions of the genome truly covered in the full dataset are shown in pink; a genome schematic of the CAS virus L segment and S segment is found below, depicting the Z and L proteins (yellow) and the glycoprotein (GPC) and nucleoprotein (NP) (blue) respectively. (F) The number of true positives vs. the number of false positives for the detection of CAS virus is depicted for BLAST (blue) and HMMER (orange).
Table 1.
Statistics on vFam performance on metagenomic datasets.