Figure 1.
Schematic of bioinformatics pipeline used for processing of NGS libraries.
High quality reads, excluding ribosomal and mitochondrial sequences, were aligned against the taxonomy databases of NCBI using BLASTn (taxonomic classification). Unclassified or ambiguously classified reads, together with virus, phage, and HERV sequences were assembled into scaffolds. Scaffolds were used to query the non-redundant protein database of NCBI using BLASTx to identify viral proteins with similarity to predicted polypeptides in our scaffolds (finding novel viruses). Given the large genomes of NCLDVs, hits to this class of viruses were reanalyzed with the profile hidden Markov model-based algorithm HHblits. PCR and Sanger sequencing were used to confirm the presence of novel viral-like sequences in our samples.
Figure 2.
Viral read ends represent only a small fraction of libraries from plasma.
Pie charts: Classification of reads from each library into human, bacteria, virus, and unknown categories (HERV and phage sequences are not included as they were found at low frequency, not visible at the scale of the pie charts). The number of clean reads (high-quality sequences excluding ribosomal, mitochondrial, and low complexity sequences; in black), ambiguous reads (in brackets), viral reads (in red), and the density of viral reads per million (in blue) are indicated. Bar graphs: Red bars depict total number of viral reads (normalized to 100%) and green bars represent the percentage thereof that corresponds to the most abundant virus in each library (N.I. not identified; indicates that no predominant virus was identified during BLASTn alignments).
Figure 3.
Classification of viral reads into families reveals high diversity of viruses in plasma samples.
Reads were classified into families according to the taxonomy databases of NCBI. Solid bars represent DNA libraries and open bars represent RNA libraries. Data is presented in log10 scale. The names of the DNA viral families are shadowed (bottom).
Figure 4.
NGS identified several classes of viruses that would otherwise remain undetected.
Panels A to C show the mapping of reads along the indicated viral genome, (A) Expected abundant viruses were recovered at high and uniform coverage as illustrated for HBV (NC_003977.1) in library hbvP02D and HCV genotype 2 (NC_009823.1) in library hcvP05. (B) Unexpected abundant viruses were also found at high and uniform coverage, as illustrated for GBV-C/Hepatitis G virus (NC_001710.1) in library aihP01 and TTV 5 isolate TCHN-C1 (AF345523.1) in library nshP01D. (C) A few read ends mapping to HCV genotype 1 (NC_004102.1) were found in library nshP01, which initially tested negative for HCV in serological tests. (D) HCV was PCR amplified in one of the three samples pooled for construction of library nshP01 (upper panel). Additionally, TTV was found in samples NSH and AIH, but not in samples from patients affected by chronic hepatitis C (HCVP03 and HCVP05).
Figure 5.
A novel circovirus genome was assembled from sequences in sample AIH.
(A) The schematic of Scaffold2603 shows the 580 reads used for assembly and the location of the two predicted open reading frames. The few reads matched by BLASTn to the YN-BtCV-1 (AEL87784.1) and to the RW-E (NC_013023.1) circovirus reference sequences show red segments on aligned reads that represent mismatches. (B) Phylogenetic tree derived from the alignment of the replicase amino acid sequences from the circoviruses named on the tree branches (the GenBank identifier is also indicated). The dendrogram was calculated using the neighbor joining method and the bootstraps option (1000 replications) of the Mega 4 software (http://www.megasoftware.net/mega4/mega.html). Vertical branches are arbitrary, while the length of horizontal branches is proportional to calculated mutational distances. Numbers at nodes indicate percentage bootstrap scores.
Table 1.
Representative viral proteins identified as high scoring hits during BLASTx alignments of assembled scaffolds.