Is There Still Room for Novel Viral Pathogens in Pediatric Respiratory Tract Infections?

Viruses are the most frequent cause of respiratory disease in children. However, despite the advanced diagnostic methods currently in use, in 20 to 50% of respiratory samples a specific pathogen cannot be detected. In this work, we used a metagenomic approach and deep sequencing to examine respiratory samples from children with lower and upper respiratory tract infections that had been previously found negative for 6 bacteria and 15 respiratory viruses by PCR. Nasal washings from 25 children (out of 250) hospitalized with a diagnosis of pneumonia and nasopharyngeal swabs from 46 outpatient children (out of 526) were studied. DNA reads for at least one virus commonly associated to respiratory infections was found in 20 of 25 hospitalized patients, while reads for pathogenic respiratory bacteria were detected in the remaining 5 children. For outpatients, all the samples were pooled into 25 DNA libraries for sequencing. In this case, in 22 of the 25 sequenced libraries at least one respiratory virus was identified, while in all other, but one, pathogenic bacteria were detected. In both patient groups reads for respiratory syncytial virus, coronavirus-OC43, and rhinovirus were identified. In addition, viruses less frequently associated to respiratory infections were also found. Saffold virus was detected in outpatient but not in hospitalized children. Anellovirus, rotavirus, and astrovirus, as well as several animal and plant viruses were detected in both groups. No novel viruses were identified. Adding up the deep sequencing results to the PCR data, 79.2% of 250 hospitalized and 76.6% of 526 ambulatory patients were positive for viruses, and all other children, but one, had pathogenic respiratory bacteria identified. These results suggest that at least in the type of populations studied and with the sampling methods used the odds of finding novel, clinically relevant viruses, in pediatric respiratory infections are low.


Introduction
Acute respiratory infections (ARIs) are the most common illnesses in humans and are associated with significant morbidity and mortality in young children in developing countries and elderly people in developed countries. In children, 156 million episodes of pneumonia are recorded annually worldwide, of which more than 95% are reported in developing countries [1,2]. In 2008, 1.6 million children younger than 5 years died from pneumonia [3]. To try to reduce child mortality due to ARIs, is important to perform a more accurate diagnosis of the pathogens associated with those deaths in children younger than 5 years of age [1].
Introduction of PCR-based diagnostic methods has increased the ability to detect respiratory viruses, which are responsible for most ARIs in young children [4,5,6]. Several respiratory viruses, such as influenza, parainfluenza virus, adenovirus, respiratory syncytial virus (RSV) and coronavirus (HCoV) have been known for some time as etiological agents of lower tract respiratory infections (LRTI). More recently, with the improvement of diagnostic methods, rhinovirus (RV), which had been thought to be mostly associated with mild-to-moderate upper respiratory tract infections (URTI) was also found to be associated with severe respiratory infections [7,8] and, in the last decade, several new respiratory viruses have been identified, such as human metapneumovirus (hMPV), HCoV-NL63 and -HKU1, human bocavirus (HBoV), parechovirus (HPeV), polyomavirus KI and WU, and enterovirus 104 and 109 [9,10,11,12,13,14,15]. In this regard, the fact that even with state-of-the-art diagnostic tools in most studies a virus is detected in only 50% to 80% of upper and lower ARIs [4,5,6,16,17,18,19] a wonder is if there are more respiratory viruses associated to ARIs than those currently known [20].
In this work, we analyzed by next generation sequencing (NGS) nasopharyngeal samples from children with LRTI and URTI that had been found negative for a panel of 21 respiratory pathogens (15 viruses and 6 bacteria) using commercial multiplex PCR methods. This study contributes to the description of the viral and bacterial populations present in nasopharyngeal samples from children with lower and upper ARIs using a metagenomic approach, which so far has been employed in limited studies [21,22,23], and suggests that the current diagnostic methods likely miss known respiratory pathogens, which might explain the relatively high proportion of undiagnosed cases.

Study populations and clinical samples
Two pediatric populations with symptomatic respiratory tract infections were included in this study. The first consisted of children with LTRI that required hospital admission due to clinical or radiological signs or symptoms of pneumonia in four different states of Mexico. Nasal washings with 1.5 ml of saline solution were collected from 250 children (male:female ratio, 1.43; age range, 1-76 months) between March 2010 and April 2011. The second population was composed of patients with symptomatic URTI that attended the private consult in five different cities of the state of Veracruz, Mexico. Nasopharyngeal swabs (rayontipped, BD BBL) were collected from 526 children (male:female ratio, 1.27; age range, 0-191 months) from September 2011 to April 2012. All samples were placed in vials containing viral transport medium (1:1 in the case of nashal washings; Microtest M4-RT, Remel) and sent frozen in blue ice either to the Institute of Biotechnology in Cuernavaca (URTI samples) or to the School of Medicine in Mexico City (LRTI samples) and stored at 270uC until analyzed. All children were previously healthy, not diagnosed with tuberculosis or signs of malnutrition, and not immunocompromised. Administration of antibiotics before hospital admission was not registered; in outpatients no antibiotics were administered before sample collection. The children included in the study were those that arrived consecutively at the collection places during the study period, with no further selection. The study (project 186) was approved by the institutional review boards of the School of Medicine and the Institute of Biotechnology of the National University of Mexico and from the institutional review board and ethics committee of each participant hospital. Written informed consent was obtained from each parent or guardian prior to enrollment.

Pathogen detection
The respiratory specimens from hospitalized and outpatient children were previously screened for viruses using the xTAG Bioplex respiratory Viral Panel (Abbott, Rungis, France) (JI Santos et al., in preparation) and the Seeplex RV15 ACE detection kit (Seegene, Seoul, Korea) (Wong-Chew et al., in preparation), respectively. The virus-negative samples from both groups of patients were screened in this work by a multiplex PCR (Seeplex Pneumobacter ACE detection kit, Seegene, Seoul, Korea) for the presence of six bacteria commonly associated to respiratory infections: Streptococcus pneumoniae, Haemophilus influenzae, Chlamydophila pneumoniae, Legionella pneumophila, Bordetella pertussis, and Mycoplasma pneumoniae.

Nucleic acid extraction, amplification and barcode labeling
Genetic material from clinical samples was extracted with the PureLink Viral RNA/DNA kit according to the manufacturer's instructions (Invitrogen, Waltham, MA). Before extraction, samples (200 ml) were treated with Turbo DNAse (Ambion, Waltham, MA) and RNAse (Sigma, St. Louis, MO) for 30 min at 37uC and immediately chilled on ice. Nucleic acids were eluted in nucleasefree water, aliquoted, quantified in NanoDrop ND-1000 (Nano-Drop Technologies, Waltham, MA), and stored at 270uC until further use. Sample random primer-amplification of nucleic acids was performed essentially as described previously [24]. Briefly, reverse transcription was done using SuperScript III Reverse Transcriptase (Invitrogen, Waltham, MA) and primer-A (5'-GTTTCCCAGTAGGTCTCN 9 -3'). Complementary DNA (cDNA) strand was generated by two rounds of synthesis with Sequenase 2.0 (USB, USA). The cDNA obtained was then amplified with KlenTaq polymerase (Sigma, St. Louis, MO) using the primer-B (5'-GTTTCCCAGTAGGTCTC-3') and 20 cycles of the following program: 30 sec at 94uC, 1 min at 50uC, 1 min at 72uC. After cleaning the PCR products with the DNA Clean & Concentrator-5 kit (Zymo Research, Irvine, CA), DNA was digested with the GsuI restriction enzyme (Fermentas Waltham, MA) for 2 h at 30uC to remove sequences corresponding to PCR primers. After digestion, samples were purified again and used as starting material to prepare 300 bp-sized libraries using Illumina's Genomic DNA sample Prep Kit with multiplex primers as suggested by the manufacturer (Illumina, San Diego, CA). Libraries were loaded in a flow cell (4 or 5 libraries per lane) and sequencing was performed by 72 cycles of nucleotide extension followed by acquisition of multiplex code in a Genome Analyzer IIx. The datasets generated by the GAIIx were deposited in the European Nucleotide Archive, with study accession numbers PRJEB7390 and PRJEB7391 for URTI and LRTI samples respectively.

Deep sequencing and sequence analysis
Image analysis and base calling were performed with the Illumina GAPipeline program (version 1.3.0) using standard parameters. To separate the samples, the pooled data from each lane were binned by barcode. In-house scripts were developed for the sequence analysis, including the following steps: i) Preprocessing. For each read, the adapter and 5' and 3' bases with no-call sites (N residues) and low-quality (Phred-like scores , 20) were trimmed. Then, low complexity reads and less than 35 bases long were removed. Finally, identical reads were collapsed into a single representative sequence to optimize analysis time. Only reads passing the preprocessing step were considered valid.
ii) Removal of host sequences. The program SMALT (Wellcome Trust Sanger Institute, 2012) was used to align the reads against mitochondrial, human genome, and bacterial ribosomal RNA to remove them, using 90% coverage and 90% identity.
iii) Taxonomic identification. To minimize CPU time, valid reads were aligned to bacteria, fungi and viruses nt NCBI databases, using SMALT with 70% coverage and identity. Then, the reads that mapped were aligned with standalone BLASTn against the databases described above, using an E-value of 1e-03.
To avoid misclassification, the first 100 hits were obtained for each sequence. Reads that did not map were considered as unidentified.
iv) Taxonomic classification. To assign reads to the most appropriate taxonomic level the software MEGAN 4.70.4 was used, which assigns a read to the lowest common taxonomic ancestor of the organisms corresponding to the set of significant hits.
v) Assembly. Reads assigned to the same virus family level were subsequently used for de novo assembly with Velvet 1.1.04 to increase the accuracy of classification. Each assembly contig was aligned against BLASTn database.
vi) Detection of novel viruses. All unidentified sequences unaligned using SMALT nucleotide alignment were assembled de novo with Metavelvet modified by us to improve the assembly efficiency. First, we conducted exploratory assemblies of the reads using multiple hash lengths (k = 17-35). Then, additional assembly of all unused reads from the exploratory assemblies was done (k = 21). Finally, we assembled all contigs obtained from all exploratory assemblies and the unused reads assembly by using the program VelvetOptimiser. From this final assembly, contigs that were greater than 180 nt were directly compared with NCBI nr (non-redundant protein) database using BLASTx with an E-value of 100 in an attempt to identify novel viruses.

Phylogenetic tree inference
Metagenomic contigs from specific viruses that were at least 150 nt-long, were phylogenetically characterized. The analysis required a different approach compared to full-length genomes due to the fact that metagenomics sequences are fragmentary and not completely overlapping. Therefore, for each virus, a database of complete genomes was first created using all sequences available in GenBank until January 2014. Then, a reference alignment was done with sequences of this database by using MUSCLE method. Next, we combined metagenomics contigs into a single large alignment by using the software MAFFT with the option align fragment sequences to reference alignment. Finally, maximum likelihood trees were generated with 1000 repetitions bootstrap using the MEGA program.

Pathogen detection
In previous studies we screened by RT-PCR the presence of 15 respiratory viruses in nasal washings from 250 hospitalized children with clinical diagnosis suggestive of viral pneumonia and in 526 nasopharyngeal samples from pediatric children with URTI (see Materials and Methods). Table 1 shows the frequency of the different viruses found in both types of samples. Among the viruses detected, considering both single and multiple infections, RSV-A and rhinovirus showed the highest frequency in both LRTI and URTI. At least one virus was detected in 71.2% (178/ 250) of LRTI (Santos et al., manuscript in preparation) and 71.5% (376/526) of URTI (Wong-Chew et al., manuscript in preparation). In 40 of the 250 LRTI samples (16%) a viral coinfection was found. Thirty-four of these samples had a dual infection, with the combination of RSV-A/RV and RSV-A/AdV being the more frequent, while 6 children had triple virus infections. In the case of URTI, 73 of the 526 samples (13.9%) showed a viral coinfection. Sixty-three of these samples had a dual infection, with the combination of AdV/EV and RV/CoV 229/N63 being the more frequent. Eight children had triple virus infections, and two were infected simultaneously with four viruses.
The virus-negative samples were screened by a multiplex PCR for the presence of six bacteria commonly associated to respiratory infections. In 64.7% (46/71, LRTI) and 68.7% (103/150, URTI) of the virus-negative samples at least one bacterial pathogen was found. The most frequent bacteria detected in children in both types of populations were S. pneumoniae (36 LRTI, 88 URTI) and H. influenzae (24 LRTI, 47 URTI); in a few cases C. pneumoniae (9 URTI) and M. pneumoniae (2 LRTI, 2 URTI) were also detected. In 37 children with URTI two different bacteria were found, and in 3 children 3 bacteria were detected. In the case of LRTI, 8 children had a mixed infection. It is important to have in mind that bacterial colonization, frequently at lower bacterial colony counts, may be detected by very sensitive laboratory tests, and even more frequently than viruses, these bacteria may not be associated with acute disease.
After screening for common respiratory viruses and bacteria, 90% of children with LRTI and 91.3% with URTI had at least one pathogen identified. The remaining 25 (10%) hospitalized and 46 (8.7%) outpatient children remained negative for all the tested pathogens and were then characterized by next-generation sequencing (NGS).

Next-generation sequencing of negative samples
To search for either known or novel respiratory pathogens in the double-negative (virus and bacteria) samples, the nucleic acids in these samples were isolated, amplified by PCR, and sequenced using the Illumina platform, as described in Materials and Methods. The 25 samples from children with LRTI were sequenced individually (listed in Table 2). In the case of the URTI samples, 9 were sequenced individually, while the amount of DNA isolated from the other 37 samples was too low to be analyzed independently, thus, they were used to prepare 16 pools for sequencing: 13 pools of two samples, 1 pool of three samples, and 2 pools of four samples ( Table 3).
The total number of DNA reads and the valid unique reads obtained from each sample after passing the quality controls are shown in Tables 2 and 3. The valid reads were analyzed for the presence of sequences from human, bacterial, fungal, or viral origin. As expected, the most abundant reads were from human origin, representing 70% and 80% of LRTI and URTI patients, respectively ( Fig. 1). Bacterial sequences made up the second largest data set, representing 15.2% of the sequence reads in LRTI and 8.5% in URTI. Viral sequences represented 0.56% and 0.57% of valid reads in LRTI and URTI, respectively, and only 0.05% of reads corresponded to fungi (Fig. 1). Finally, approximately 13% and 10% of the sequences in both LRTI and URTI could not be classified since no homolog was found (E-value 1e-03) or there were contradicting database hits. This category is referred to as 'undefined' in Figure 1. Of interest, despite the fact that the samples from LRTI and URTI were collected by different methods (nasal washings vs. swabs), and from children with different clinical syndromes and varying severity of respiratory disease, the proportion of sequences from different origins was very similar.
The undefined sequence reads from all samples were assembled, and contigs $180 nt were compared with non-redundant protein database of GenBank (E-value 100) to find sequences that could be distantly related to known viral sequences and could thus represent novel viruses. Indeed, short sequences are less likely than long sequences to retrieve statistically significant similarities in Blast searches, and sequence assembly into longer contigs is helpful to overcome this difficulty. As result of this, all filtered contigs aligned either to bacterial or human proteins during BLASTx runs. An analysis revealed that the contigs that map to bacteria showed only 60-80% nucleotide identity to their best-matching reference, indicating that they most likely represent novel species within their corresponding genera and thus could not be classified during alignments with BLASTn. Nonetheless, the vast majority of reads (50% to 90%) were not assembled into contigs. The unassembled reads were low complexity sequences or library artifacts as adapter chimeras, suggesting that it is unlikely that they correspond to novel viruses. A remaining small amount of sequences could not be assembled due to non-uniform read depth because of a nonuniform species abundance distribution.

Viruses detected by NGS in double-negative samples
DNA sequence reads from at least one virus commonly associated to respiratory infections was found in 20 out of the 25 double-negative samples of LRTI patients (Table 4): 5 samples were positive for RSV reads, 11 samples for HCoV-OC-43, and 9 for RV. In addition, 5 samples contained HBoV and in 12 samples anelloviruses (torque teno -TTV-, torque teno mini -TTMV-, or torque teno midi viruses -TTMDV) were also detected; rotavirus, papillomavirus, and herpesvirus sequences were identified once in the samples, and reads from several viruses from both animal (bat picornavirus, bovine viral diarrheal virus, bovine kobovirus) and plant origin (potato virus Y, pepper mild mottle virus), as well as various bacteriophages were also found ( Table 4). Regarding bacteria, DNA sequence reads from S. pneumoniae were the most frequent, being present in all but one of the 25 samples sequenced, and M. catarrhalis, L. pneumoniae, and H. influenzae were less frequently found. DNA reads from other bacteria less commonly associated with respiratory infections were also detected ( Table 4). Some of the samples had sequence reads corresponding to up to 8 different viruses or 15 different bacteria. Of interest, including the NGS results, 79.2% (198/250) of the samples had a respiratory virus detected, and in the remaining 52 samples at least one bacteria was found, such that all 250 samples from children with LRTI had a respiratory pathogen identified.
DNA reads from one to five typical respiratory viruses were detected in 22 of the 25 sequenced double-negative individual and/or pooled samples from children with URTI ( Table 5): The virus most frequently detected was RV, which was found in 19 of the pooled and/or individual samples; some of the samples had more than one type of virus, such that we found sequence reads from 4 RV subtype A, 4 subtype B, and 19 subtype C. One sample was positive for RSV, 3 for HCoV-OC43, 3 for human enterovirus A71, and 3 samples had HBoV. Of interest, 5 of the samples had DNA reads from Saffold virus, a virus recently described to be associated to respiratory infections. Also, among these samples we identified 3 containing herpesvirus, 5 papillomavirus, 2 human astrovirus, 4 rotavirus, and 10 anelloviruses (TTV, TTMV, TTMDV). Similar to what was found in LRTI, in children with URTI DNA reads of viruses from animal (white spot syndrome and bat picornavirus) and plant origin (Okra mosaic virus, capsicum chlorosis virus, cucumber mosaic virus, pepper

Genome assembly and phylogenetic analyses
To estimate the sequence coverage of the NGS-identified viruses, the sequence reads were assembled de novo, and all contigs were used to estimate the extent of the virus genome coverage. A significant coverage was obtained for several of the detected viruses. In patients with LRTI, the genome of 14 viruses was assembled with coverage higher than 20% ( Table 6). As indication of the sensitivity of NGS and of the relative abundance of some viruses not detected by the conventional PCR, we could assemble more than 95% of the genome of two RV strains, 98% of one HBoV, and 99.8% of one HCoV-OC43 strain. In the case of children with URTI, at least 50% of the genome was covered for 15 viruses, including RV, HEV, Saffold virus, and TTV (Table 7), with 8 of them having a genome coverage of more than 90%. For the DNA reads of the animal viruses identified in both types of children populations the coverage ranged between 0.57 and 4.9%, and for plant viruses between 0.13 and 7.15% in the case of tomato mosaic virus (Tables 5A and B).
The assembled sequences from HCoV-OC43, RV, HBoV, Saffold, and anelloviruses were used to construct phylogenetic trees to determine the genetic similarity of the viruses characterized in this work with those in databases (see Materials and Methods). All viruses from LRTI and URTI grouped with clades formed by previously reported virus sequences (not shown). Anelloviruses could be readily classified as TTV, TTMV, or TTMDV, with some samples containing the three genera (Tables 4 and 5). All HCoV detected belong to species HCoV-OC43 and grouped with other known betacoronaviruses. HBoVs were all genotype 1, while the Saffold viruses detected in this work belonged to either genotype 2 or 3. Of interest, in the case of some RV, HBoV, Saffold, and anelloviruses, different contigs mapped to different clades, suggesting that recombination events are common in this type of viruses, as has been reported for RV [25].

Discussion
Improvements in diagnostic methods have increased the rate of identification of viral pathogens in different clinical conditions, such as gastrointestinal, respiratory, or neurologic infections. However, despite these advances there are still a significant number of cases (20-50%), in which the etiologic agents are believed to be viruses, but the agent is not identified [26]. Previously, we reported the presence of a respiratory virus in about 71% of nasal samples obtained from children with LRTI and URTI (Aponte et al., manuscript in preparation; see also the Pathogen Detection section above), using a PCR method able to detect 15 different respiratory viruses. After PCR screening the virus-negative samples for the presence of respiratory bacteria, 89.6% of children with LRTI and 91.1% with URTI had at least one potential pathogen identified. These percentages were raised to levels close to 80% for viruses in both patient populations after NGS analysis of the double-negative samples, and essentially 100% of the samples had either a common respiratory virus or bacteria identified (in only one of 526 URTI samples DNA reads from a potential pathogen was not identified). It is interesting that 6 of the 8 samples from both LRTI and URTI that were negative for viruses after NGS had less than one million valid reads (Tables 2 and 3). Since the number of sequence reads directly correlates with the amount of nucleic acids present in the original sample [21], the absence of virus detection in these samples could represent false-negative results; it is likely that with a larger amount of sample, or deeper sequencing, respiratory viruses could have also been detected.
The samples that resulted negative for viruses by PCR, and subsequently determined as PCR-positive for respiratory bacterial pathogens, were not characterized by NGS, but it is reasonable to assume that a high percentage of them could have also been positive for viruses by deep sequencing. It is difficult, however, to determine with confidence, which, if any, of the detected pathogens could be responsible for the clinical respiratory symptoms observed. The virus or bacteria detected by these methods could be present in the patient as an asymptomatic carrier state or as causal agents of asymptomatic infections. Studies comparing the presence of respiratory pathogens in nasal specimens from healthy children will help to resolve this issue. In addition, PCR and NGS are such sensitive techniques that the presence of small amounts of viral targets may not necessarily have clinical relevance. An additional limitation of this study is the limited number of samples analyzed. Exploring the possibility to define cutoff levels represents the next necessary step for diagnosing viral respiratory infections using molecular tests [27]. It is important to mention, however, that for several of the RV and HCoV detected in this study high genome sequence coverages were achieved. This observation indicates that a high number of DNA reads, and probably also of virus particles, were present in the samples. These viruses could have been undetected by PCR due to mismatches in the diagnostic primers used.
Of interest, the classes of viruses found by NGS in patients with LRTI and URTI were very similar, although their frequencies were different in the two study populations. RV was more frequently found in URTI (19 of 25 samples) vs. LRTI (9 of 25 samples), while coronavirus was more represented in LRTI (11/ 25) than in URTI (3/25). Only one of the 46 samples of children with URTI was positive for RSV, while in 5 of 25 samples from children with LRTI RSV was detected. Saffold viruses, members of the picornaviridae family and cardiovirus genus, were found only in children with URTI. Since their initial description in 2007, these viruses have been shown to circulate worldwide, occur early Table 4. Cont.     in life, and involve the respiratory and gastrointestinal tracts. The association of these viruses with clinical symptoms is under investigation and requires additional epidemiological studies to clarify their pathogenicity [28]. Anelloviruses (TTV, TTMV, TTMDV) were found in both LRTIs and URTIs. Members of this family ubiquitously infect humans and establish persistent infections, although causal disease associations are currently lacking [10]. It is interesting to note that common gastrointestinal viruses such as astrovirus and rotavirus were found in some of the samples. It is not surprising though, since rotaviruses have long being suspected to reach the gastrointestinal tract via mouth and nose. In fact, some rotavirus infections have been associated with respiratory symptoms [29]. Finally, we found low amounts of DNA reads corresponding to animal and plant viruses. The number of these types of viruses is larger than previously reported in respiratory samples [21], although plant viruses have been found more abundantly in human feces [30]. Both, plant and animal viruses are thought to be derived either from consumed food or acquired from the environment. The search for new viruses using NGS technologies in mammalian, avian, and in particular, human samples, has contributed to the identification of new viruses in animal reservoirs and in different conditions of disease [31]. However, the important effort invested in mammal and avian virus detection has only resulted in the discovery of variants of virus species, sister species to known viruses, and rarely genera. These observations contrast with the recent efforts to discover arthropod viruses, which have yielded widely divergent taxa that sometimes have even defined novel families [32]. Altogether, these observations and the presence of DNA sequence reads from common respiratory viruses or bacteria in essentially 100% of the samples collected from children with LRTI and URTI, suggest there is limited potential for the discovery of so far undescribed, clinically relevant, viruses associated to pediatric respiratory disease at least in the type of populations studied and with the sampling and diagnostic methods employed.