Evolutionary Strategies of Viruses, Bacteria and Archaea in Hydrothermal Vent Ecosystems Revealed through Metagenomics

The deep-sea hydrothermal vent habitat hosts a diverse community of archaea and bacteria that withstand extreme fluctuations in environmental conditions. Abundant viruses in these systems, a high proportion of which are lysogenic, must also withstand these environmental extremes. Here, we explore the evolutionary strategies of both microorganisms and viruses in hydrothermal systems through comparative analysis of a cellular and viral metagenome, collected by size fractionation of high temperature fluids from a diffuse flow hydrothermal vent. We detected a high enrichment of mobile elements and proviruses in the cellular fraction relative to microorganisms in other environments. We observed a relatively high abundance of genes related to energy metabolism as well as cofactors and vitamins in the viral fraction compared to the cellular fraction, which suggest encoding of auxiliary metabolic genes on viral genomes. Moreover, the observation of stronger purifying selection in the viral versus cellular gene pool suggests viral strategies that promote prolonged host integration. Our results demonstrate that there is great potential for hydrothermal vent viruses to integrate into hosts, facilitate horizontal gene transfer, and express or transfer genes that manipulate the hosts’ functional capabilities.


Introduction
The deep subsurface below hydrothermal systems hosts a high diversity of archaea, bacteria, and viruses that must tolerate extremely variable environmental conditions. High-temperature, reduced hydrothermal fluids mix with cold, oxidized seawater both above and below the seafloor to establish strong gradients in temperature, pH, and chemical and mineralogical composition [1][2][3][4][5]. Wide variations in environmental parameters can occur over centimeter scales. Constant fluid flux throughout and above the subsurface transports organisms from one region to the next, exposing them to a range of environmental conditions. Gradients that dominate this environment create a highly diverse microbial community consisting of both archaea and bacteria [6]. Physical and chemical parameters vary according to fluid mixing and volcanic activity, leading to niche partitioning in microbial communities across both space [1] and time [7,8]. Moreover, hyperthermophiles are routinely cultured from fluids that exit at low temperatures (5-30uC) [9,10], indicating that organisms in vent systems are frequently flushed from their native habitats, most likely from the deep subsurface. Microbial communities in hydrothermal systems are known to form biofilms that coat mineral surfaces, including within high-temperature chimney structures [5]. Such biofilms, which are likely to occur within the subsurface as well, host high-density communities with potentially high contact rates between organisms.
The dynamic, diverse and dense nature of this habitat should foster frequent exchange of genes within the microbial community. Previous work with vent samples has shown that the genes responsible for this process, including transposases and integrases, were observed to occur at high frequency in cellular metagenomes from hydrothermal systems as compared to other environments [11,12]. Analysis of fully sequenced genomes of thermophiles, including many from vent systems, suggests that gene transfers occur more frequently among thermophiles than mesophiles or psychrophiles [13,14] and that these transfers sometimes cross domains [13][14][15]. The prevalence of horizontal gene transfer in vent systems may expand the functional repertoire of a given species, expanding the pangenome and providing access to different ecological niches. This expanded metabolic flexibility would provide a strong advantage in hydrothermal vent environments where fluid flux and environmental gradients expose communities to wide extremes in temperature, pH, and chemical composition.
Here, we use comparative metagenomics to elucidate the role that viruses play in facilitating gene flow and manipulating host genetic potential in hydrothermal systems. Viruses are known to play pivotal roles in the transfer of genes and the alteration of host phenotype, particularly in the pelagic oceans (see Breitbart 2012 [16] for review). Bacterial and archaeal viruses introduce foreign genetic material through transduction and expression of virally encoded genes during infection. Transduction, or virally-mediated horizontal gene transfer, occurs on a massive scale in the surface oceans. Up to 10 14 transduction events can occur per year in Tampa Bay estuary [17], and virus-like particles that serve as gene transfer agents (GTAs) may boost these transduction rates by one million-fold [18]. Viruses are known to encode auxiliary metabolic genes, or AMGs, which play critical roles in facilitating biochemical or metabolic processes [19]. For example, cyanophage transcribe and express photosynthesis genes during lytic infection of their cyanobacterial hosts [20][21][22][23], potentially to support the host during infection, or to redirect host metabolism to support phage deoxyribonucleotide biosynthesis [24]. Lysogenic viruses can have similar impacts on their hosts: the expression of genes encoded by integrated viruses (also known as proviruses, or prophage in bacteria) can manipulate host phenotype, such as in the case of the cholera toxin expressed by a prophage integrated in the Vibrio cholerae genome [25]. Selection should favor expression of genes within lysogenic viruses that enhance host fitness while the virus is integrated in the genome. For example, it has been hypothesized that proviruses express genes that suppress host metabolism to conserve resources under low-energy or lownutrient conditions [26].
Despite increasing evidence that viruses play a crucial role in manipulating host genotype and phenotype in the surface oceans, this phenomenon has yet to be explored in the dynamic environment of hydrothermal vents. Viruses are abundant in hydrothermal systems [27] and have the potential to infect many different taxa of bacteria and archaea [3]. It has been suggested that up to 80% of archaea and bacteria in the deep ocean contain proviruses in their genomes [28]. Induction experiments have suggested that proviruses are particularly abundant in the genomes of archaea and bacteria from hydrothermal vent fluids compared to those in water from the deep ocean or the deep chlorophyll maximum [29]. Considering the abundance of viruses in these systems, and lysogenic viruses in particular, several questions arise: do these viruses transfer genes between hosts? Do they express fitness factors while integrated in the host genome? If so, which genes are expressed? Do viruses contribute to host genomic plasticity and facilitate their adaptation to changing conditions? Does selection act differently on viral genes compared to cellular genes?
To address these questions, we used a cultivation-independent approach that provides a community-wide perspective of both the viral gene pool and the bacterial and archaeal gene pool (hereafter referred to as the ''cellular'' gene pool) in hydrothermal systems. Specifically, we analyzed the unamplified viral and cellular metagenomes of high-temperature diffuse flow hydrothermal fluid from Hulk hydrothermal vent in the Main Endeavour Field on the Juan de Fuca Ridge. We compared the relative content of each of these gene pools and inferred the modes of genetic interaction between viruses and their hosts. This analysis focused on a unique fluid sample that contained organisms native to a wide range of ecological niches within the gradient-dominated hydrothermal environment, all with the potential to come into contact through constant fluid flux. Given the potential for gene and viral exchange across these niches, these metagenomes can provide insights into interactions within the diverse communal gene pool of the hydrothermal vent microbial community.
Comparative analysis of the cellular and viral metagenomes from this sample addressed whether viruses have the potential to manipulate the physiology or metabolism of their hosts. The presence of genes facilitating horizontal gene transfer and lysogenic virus integration described the genetic potential for these processes in the vent environment. We compared the relative abundance of genes in the viral and cellular gene pools in order to determine the types of genes enriched in the viral gene pool relative to the cellular gene pool. Finally, we asked how evolution has shaped the viral and cellular gene pools by examining relative selection pressures on viral and cellular genes. Together, these analyses provide insight into the broader question of how evolution has shaped the genomes of viruses and their hosts in some of the more extreme environments of the planet.

Sample collection and DNA extraction
We collected a 170-L hydrothermal vent fluid sample from Hulk vent at the Main Endeavour Field on the Juan de Fuca Ridge (47u57.009 N, 129u5.819 W) as described previously [3]. No specific permissions were required for these locations or sampling activities. The vent fluid was obtained using a large barrel sampler equipped with two 100-L sterile bags. We placed the sample collection funnel atop a region of diffuse venting, adjacent to a colony of tube worms on the side of a large sulfide structure. While the tube worms were surrounded by fluid at measured temperatures of 13-30uC, the average temperature of the metagenome fluid sample was calculated from its silica chemistry to be about 125uC [3]. This result indicates that we most likely sampled fluid ranging from cool background seawater to high-temperature hydrothermal fluid (up to 300uC) from the sulfide structure adjacent to the sample site, illustrating the dynamic fluid flux of these systems. The organisms collected in the sample therefore represent a range of habitats in the hydrothermal environment, including psychrophiles, mesophiles and thermophiles, and aerobic, microaerophilic, and anaerobic organisms. Some of these organisms may have been derived from deep subsurface fluids, whereas others from entrained seawater. Having been sampled from the same site, these organisms have the potential to come into contact within the vent environment due to dynamic fluid flux. We included available metadata about this vent sample in Table S1.
We collected the cellular fraction by filtering the 170 L of hydrothermal vent fluid through three 0.22 ml Steripaks (Millipore, USA) while the sample and filtrate were held on ice. The filtrate was retained for subsequent virus sampling. Filters were frozen at -80uC while shipboard and until sample processing. We extracted DNA from one Steripak using a modified DNA extraction procedure described by Anderson et al. [1]. Briefly, DNA extraction buffer (0.1 M Tris-HCl, 0.2 M Na-EDTA, 0.1 M NaH2PO4, 1.5 M NaCl, and 1% cetyltrimethylammonium bromide) was added to each filter, then the filters were capped and freeze-thawed five times. Lysozyme (50 mg/mL solution), proteinase K (1% solution), and SDS (20% solution) were added to each filter and incubated. Lysate was removed from filters and centrifuged; DNA was extracted from the supernatant using a phenol/chloroform/isoamyl extraction method described by Anderson et al. [1].
For virus collection, we concentrated the sample filtrate using tangential flow filtration (30 kDa cutoff) to approximately 400 mL in a 4uC cold room, and froze concentrated filtrate into six aliquots at -80uC until further processing. One aliquot was further concentrated by adding 10% w/v polyethylene glycol 8000 (PEG), incubating overnight at 4uC, and centrifuging at 13 0006g for 50 min. The pellet was resuspended in Tris-EDTA buffer and incubated for 15 min with 0.7 volume of chloroform to lyse any remaining cellular contamination. Free DNA was removed by incubating with 10% DNAse I for 2 h at 37uC, then inactivated by adding EDTA to a final concentration of 0.02 M. The QIAamp MinElute Virus Spin Kit (Qiagen) was used to extract the viral DNA, yielding approximately 90 ng, which was not amplified for downstream sequencing. PCR tests of extracted viral DNA using universal 16S archaeal and bacterial primers showed no amplification from contaminating cellular 16S rRNA, whereas positive controls of DNA extracted from E. coli showed successful amplification.

Metagenomic sequencing
The viral metagenome was generated on a Roche Genome Sequencer FLX (GSFLX) with GS FLX Titanium 454 sequencing protocols by the Broad Institute. For the cellular metagenome, libraries were created using the Nexterra transposon-mediated method (Epicentre) at the Josephine Bay Paul Center at the Marine Biological Laboratory, then sequenced using Roche Titanium 454 sequencing protocols on a GSFLX. Both metagenomes are publicly available on the MG-RAST v3 database [30], with accession numbers 4469452.3 for the viral metagenome and 4481541.3 for the cellular metagenome. The viral metagenome was previously described in Anderson et al. [3] and was uploaded under MG-RAST v2, and is still available at 4448187.3.
We used TagCleaner [31] to trim tags from the 59 end of each sequence in the cellular metagenome. Assembly of both the viral and cellular metagenomes was conducted in Geneious [32] using the ''Medium Sensitivity'' method, with a word length of 14, a maximum gap size of 2, maximum gaps per read of 15, and maximum mismatches of 2. To classify sequences using di, tri, and tetranucleotide analysis, we created a boutique database of bacterial and archaeal virus sequences (Table S2) as a training set to accompany the existing cellular dataset in PhylopythiaS [33], which was used to identify archaea, bacteria, archaeal viruses and bacterial viruses. Metagenomes were assembled in Geneious prior to classification with PhylopythiaS; only contigs over 1000bp in length were used.

Enrichment of proviruses and mobile genetic elements
To identify the numbers of reads in each metagenome matching lysogenic viruses, metagenomes were compared to a database of sequences from the ''Prophages'' category in the ACLAME database [34]. To assess abundance of mobile genetic elements, we compared all metagenomes to a dataset of Pfam seed sequences [35] matching transposases, recombinases, resolvases and integrases, listed in Table S3, using tblastn with an e-value cutoff of 10-5 . The number of unique reads with a match to a sequence from the query sequence collection was tallied and normalized to the number of reads in the metagenome. Only metagenomes generated with 454 pyrosequencing were used for the analysis, so that all metagenomic reads had a length ranging from approximately 100 to 300 bp.

Relative enrichment of gene categories
We tallied gene categories by adding ''abundance'' counts for each functional category as defined by the KEGG Orthology database [36], the Clusters of Orthologous Groups (COG) database [37], or the SEED Subsystems database [38] in MG-RAST v3, using an e-value cutoff of 10 23 . For the combined analysis of 20 cellular metagenomes and 23 viral metagenomes, all abundance counts for either viral or cellular metagenomes were tallied together. We used Xipe-Totec [39], a nonparametric method of statistical analysis using a difference of medians analysis, to determine whether abundance differences were statistically significant. For this analysis, we used a confidence level of 95% and a sample size of 5000 to determine significance.

Fragment recruitment
Cellular metagenomic reads were recruited to genomes of hydrothermal vent isolates using NUCmer, part of the MUMmer 3.0 package [40], with the following parameters for the command line: -minmatch 10 -breaklen 1200 -maxgap 1000 -mincluster 50. Fragment recruitments were visualized using mummerplot in the MUMmer package. Coverage plots were created by using the show-coords command in MUMmer, then in-house Python scripts were used to calculate coverage for each base pair position. Coverage plots were created with a convolution function in numpy, using a moving average window size of 50000.

Calculation of dN/dS
Prior to calculation of dN/dS ratios for genes mapped by each metagenome, metagenomes were subjected to stringent error filtering using Prinseq [41] with the following parameters: minimum sequence length of 60bp; minimum mean quality score of 30; maximum number of allowed Ns per sequence of 4; and low-complexity threshold of 70 (using Entropy). The dN/dS ratio measures selection pressures by calculating whether the number of non-synonymous substitutions (dN) in a gene is greater or fewer than the number expected by chance compared to the number of synonymous substitutions (dS). A majority-rule consensus was calculated from the mapped reads; the number of possible synonymous or nonsynonymous substitutions was then tallied and compared to the number of actual synonymous and nonsynonymous substitutions.
We mapped reads from both the viral and cellular metagenomes to the vent isolate genomes using CLC Genomics Workbench with the criteria of 80% identity and 80% coverage, using previously established benchmarks [42]. Mapping results were exported in ACE format; dN/dS was calculated for each gene using the Python scripts described in Tai et al. [42]. Polymorphisms were only tallied for positions with a mapping depth of at least 5X; only genes with at least 100 nucleotides at 5X depth were included in the analysis. Redundant genes were deleted from the analysis; only the dN/dS value for the gene with higher coverage was retained. The files used to define gene coordinates were downloaded from JGI IMG, with the exception of T. kodakarensis, which was derived from a.gff file obtained from NCBI. The 95% confidence interval was calculated for all genes mapped by the viral and cellular metagenomes with dN/dS less than 1 (subject to purifying selection) using alpha = 0.05.

General features of the metagenomes
Metagenomic sequencing of the cellular and viral fractions from the large sample of hydrothermal fluid yielded a total of 808,051 and 231,246 sequence reads, respectively. The cellular metagenome contained reads from a wide variety of bacterial and archaeal taxa, including Thermococcales, methanogens, Marine Groups I and II, Proteobacteria, Bacteroidetes, Firmicutes, and many others, reflecting the wide range of ecological niches represented in the sample. Figure 1 indicates that approximately 31% of the cellular metagenome and 47% of the viral metagenome (virome) sequences had no matches to the M5NR database (classified here as ''Unknown'') [43]. An analysis of the viral metagenome alone, including the taxa and diversity of viruses and their general host range, has been published previously [3]. Close matches between CRISPR spacers identified in the cellular metagenome and sequences in the viral metagenome suggest an active and relatively recent relationship between the two gene pools (see File S1). Classification of the cellular and viral metagenomes showed that only 2% of the reads from the cellular fraction matched archaeal genes, whereas nearly 8% of the viral metagenome reads matched archaeal genes ( Figure 1). Nucleotide signature matching of assembled contigs longer than 1000 bp using PhylopythiaS [33] indicated that a disproportionate percentage of contigs in both metagenomes matched nucleotide compositional patterns of archaeal viruses, given that bacterial reads dominate both metagenomes ( Figure S1). The percentage of contigs matching bacterial nucleotide compositional patterns was greater than the percentage of contigs with archaeal patterns by a ratio of 2.8 to 1 in the cellular metagenome. However, contigs matching bacterial virus patterns outnumbered contigs with archaeal virus patterns by only 1.8 to 1 in the viral metagenome. Taken together, these observations suggest that archaeal viruses may be disproportionally abundant in the vent habitat compared to the relative abundance of bacteria to archaea.
Assembly of viral metagenome contigs yielded many short, high-coverage (.20X coverage) contigs shorter than 6 kb ( Figure  S2). The cellular metagenome did not exhibit contigs with such high coverage and short length. In contrast, several contigs in the viral metagenome had relatively low coverage but were quite long.
We annotated reads based on the best matches to the SEED database, and determined the taxonomy of each contig based on the annotation of the majority of the contained reads [3]. Many of the long, low-coverage contigs were annotated as bacterial or archaeal, whereas many shorter, high-coverage contigs were annotated as unknown ( Figure S2).
We considered the long, low-coverage contigs with cellular annotation to be more likely to be derived from cellular contamination, whereas the high-coverage, shorter contigs with unknown annotation are more likely to be derived from viruses. These high-coverage contigs may include certain viral genomes that occurred at high frequency.
While we attempted to eliminate cellular contamination through size fractionation and chloroform and DNAse treatment prior to sequencing, and 16S PCR tests of extracted viral DNA returned no amplification of cellular material, we sought to further reduce cellular contamination in silico. We created a ''viral subset'' of the viral metagenome consisting of reads assembled into contigs with a coverage of 8 or greater, or assembled into contigs annotated as viral or unknown. The aim of producing a subset was to remove low-coverage contigs more likely to be derived from cellular contamination. We chose a coverage cutoff of 8 because most long contigs annotated as bacterial or archaeal had a coverage of approximately 8 or lower ( Figure S2). The goal of this subset was not to generate a ''pure'' viral metagenome but rather to reduce the number of reads that may represent cellular contamination, which is frequently an issue in viral metagenomics [44,45]. While the high percentage of unknown contigs in viral metagenomes makes any attempts at removing contamination difficult, this subset eliminates reads from low-coverage contigs with bacterial or archaeal annotation that are more likely to be derived from cellular contamination. Except in specific cases noted below, the analyses presented here or in the supplementary figures include both the original virome as well as the ''viral subset'' so that both may be compared.

Assessing the potential for horizontal gene transfer
To assess the degree to which cells and viruses in hydrothermal ecosystems are capable of horizontal gene transfer or integration of proviruses, we determined the relative abundances of provirus genes (Table 1) and genes related to DNA transfer or mobilization (Table 2) in the hydrothermal vent cellular and viral metagenomes. We used the ''Prophage'' dataset in the ACLAME database [34] to identify provirus-related proteins in 22 pyrosequenced cellular metagenomes. These metagenomes were chosen to represent a range of aquatic and terrestrial environments, while controlling for sequencing method. Table 1 indicates that relative to the 22 cellular metagenomes, the hydrothermal vent cellular metagenome contained a high percentage of reads (approximately 4%) that match provirus-coding regions. This result provides compelling molecular evidence of abundant proviruses in cellular genomes in vents and complements descriptions of high proportions of lysogenic cells in vents and the deep ocean based upon mitomycin C induction experiments [28,29]. We also analyzed the relative abundance of mobile genetic elements in 40 viral and cellular metagenomes, also selected to represent a range of environments, and controlled for sequencing method. We found a relative enrichment of mobile genetic elements in both the viral and cellular gene pools at Hulk hydrothermal vent (Table 2). Thus, not only do the genomes of archaea and bacteria in vents encode a higher abundance of mobile elements, on average, than genomes native to other habitats, but vent viruses encode a high abundance of mobile elements as well. These data suggest that selection has favored enrichment of mobile elements in these  Table 1. Percent of reads in cellular metagenomes matching a protein in the ''Prophage'' grouping of the ACLAME database. cellular and viral genomes, potentially leading to increased rates of horizontal gene flow among cells and viruses.

Evidence of genomic plasticity in vent genomes
Having found evidence for gene transfer and viral integration in the genomes of vent cells and viruses, we sought evidence for either lysogenic virus integration or gene transfer events in other hydrothermal vent isolates. Of the 34 available fully sequenced vent genomes, 20 (59%) contain integrated viruses (Table S4). Most of these viruses encode capsid genes or genes such as DNA ligases, which have been identified before as being particularly abundant in the viral fraction of this hydrothermal vent sample [46]. However, identification of auxiliary metabolic genes (AMGs) in these viral genomes is difficult, partly due to the high abundance of unknown genes and partly because the boundaries between the viral and cellular genome are not always clearly delineated.
To better identify regions that have been transferred in vent genomes, we compared genomes from bacterial or archaeal isolates with sequences sampled directly from the environment. This strategy can identify potential hypervariable regions, or ''genomic islands,'' that display lower coverage than the rest of the genome [47][48][49]. Previous work with Haloquadratum walsbyi DSM 16790 [48] and Prochlorococcus genomes [47] used this technique to identify genomic islands that most likely represented regions of virally-mediated lateral gene transfer; in the case of Prochlorococcus, these genes are differentially expressed under light and nutrient stress [47].
We performed fragment recruitment of the hydrothermal vent cellular metagenome against the genomes of all isolates of hydrothermal vent bacteria and archaea available in the NCBI database. In most cases, the metagenome did not recruit to the isolate genomes with high enough coverage to yield useful data. Nautilia profundicola AmH successfully recruited reads at high coverage, but no metagenomic islands were found. Recruitment of the cellular metagenome to the longest contig (ABCJ01000001) of the draft genome of Caminibacter mediatlanticus TB-2, a chemolithotrophic, nitrate-ammonifying Epsilonproteobacterium that was isolated from the walls of a hydrothermal vent chimney on the Mid-Atlantic Ridge [50], yielded a number of regions with relatively low coverage (Figure 2). The first, a region with relatively low coverage between 160000 and 200000 bp, contains a series of CRISPR loci. CRISPR regions are dedicated to viral and plasmid immunity by effectively creating a library of previous infection, and therefore are highly specific to a given environment. Therefore recruitment to CRISPR loci should naturally yield lower coverage, particularly for a metagenome sampled in a different geographic location than this isolate. Aside from the CRISPR region, recruitment yielded two distinct genomic islands: one region of approximately 40 kbp, followed by a second lowcoverage region of approximately 15 kbp (Figure 2), separated from each other by a 20 kbp region that includes a ribosome. The first genomic island coincides with a region with relatively high GC content, which suggests this region was transferred into the C. mediatlanticus genome. A phage integrase gene is located approximately 58 kbp downstream of the 39 end of the first genomic island, though its presence there is not necessarily conclusive evidence that the region was introduced by a virus. The first genomic island begins near a tRNA gene, a common site for integration of horizontally transferred regions [51]. Many of the genes in both the first and second genomic islands encode proteins that interact with the environment, including sugar and nitrate membrane transporters, and proteins related to energy metabolism, including hydrogenases (Table S5). Possession of a diverse suite of hydrogenases can enhance metabolic flexibility in variable redox conditions, and has been observed in other Epsilonproteobacteria isolated from hydrothermal vents [52].
The presence of this hypervariable region indicates that genomic islands from genomes in vent environments encode genes related to environmental interactions and energy metabolism. The genomic island shown here is unique to a vent isolate from the Mid-Atlantic Ridge. The C. mediatlanticus strains present in our sample on the Juan de Fuca Ridge most likely have genomic islands of their own, though these cannot be identified without a fully sequenced strain from the Juan de Fuca Ridge. None of the sequenced strains from the Juan de Fuca Ridge had high enough coverage with our metagenome to identify genomic islands. However, the genomic island on C. mediatlanticus provides evidence of horizontal transfer of genes that facilitate metabolic flexibility in an environment that is very similar to the Juan de Fuca Ridge. From this we can hypothesize that genes related to environmental interactions and metabolic activity are transferred in our sample site.

Quantification of relative gene enrichment in the viral fraction
We next sought to determine whether there were differences in the relative enrichment of gene types in the cellular and viral fractions. In order to account for biases or omissions in various databases, we annotated reads in both the viral and cellular metagenomes according to three functional databases: the SEED Subsystems database [38], the Clusters of Orthologous Groups (COG) database [37], and the KEGG Orthology (KO) database [36] (Figures 3 and 4). Among all three databases, certain trends appeared. First, in all three cases the results indicated a high similarity in the overall relative abundances between functional genes in the cellular and viral fractions. A similar study of relative abundances of functional groups in viral and cellular metagenomes, conducted by Kristensen et al. [44] based on data collected by Dinsdale et al. [53], found a similar trend. While some of this is most likely due to cellular contamination in the viral fraction, as well as the presence of viruses in the cellular fraction, Kristensen et al. suggest this trend may also be due in part to the choice of available functional categories, which encompass functions that are generally cellular rather than viral. However, despite the likely presence of contamination in these datasets, direct comparison in this way enables us to compare the relative enrichment of certain types of genes in the viral and cellular gene pools. Comparisons between the viral subset and the cellular metagenome are shown in Figures 3 and 4; comparisons between the cellular metagenome, original virome, and viral subset are shown in Figure S3.
There was a statistically significant enrichment of reads matching phage, prophage, and transposable elements in the virome relative to the cellular metagenome (2.3% vs. 1.5%). These annotations were made according to the SEED Subsystems classification ( Figure 3A), and we used Xipe-Totec [39] to assess statistical significance. Most viral gene hits were to distantly related head-tail viruses and cyanophage. The low percentage of matches to viral sequences overall is likely due in part to the relative lack of environmental viral sequences in public databases. We also observed an enrichment of reads matching cofactors, vitamins, prosthetic groups and pigments in the viral fraction ( Figure 3A), which falls in line with studies identifying these types of genes on cyanophage genomes. Genes encoding vitamin B12 [53,54], and the pigments psbA/D [55][56][57][58] and pebS [23] have previously been identified on cyanophage genomes. These virally encoded genes are thought to supplement host metabolism during infection, and similar processes may occur in this vent ecosystem.
We observed a statistically significant enrichment of reads matching amino acids and derivatives in the cellular fraction relative to the viral fraction, according to the COG ( Figure 3B) and KO ( Figure 4A) databases. These include genes related to the biosynthesis and metabolism of various amino acids, and are evidently not commonly carried on viral genomes in vent ecosystems, suggesting that viruses rely on their hosts for amino acid biosynthesis.
We also observed enrichment in reads related to replication, recombination and repair in the viral fraction, as annotated by both the COG and KO databases ( Figure 3B, Figure 4A). These genes probably serve necessary functions in the synthesis and replication of viral DNA during the course of viral infection, and include genes like DNA and RNA polymerase, which are commonly encoded on viral genomes. The ''replication and repair'' category includes DNA ligases, which occur at very high abundances in the virome, as previously reported [46].
Finally, we also observed a statistically significant enrichment of genes related to energy metabolism as annotated by the KO database. Among reads annotated as matching the energy metabolism sub-category, 6.5% from the cellular metagenome and 8.4% from the viral subset were annotated as sulfur metabolism, whereas 8.2% of cellular versus 3.7% of viral metagenome reads annotated as belonging to methane metabolism ( Figure 4B). The enrichment of energy metabolism genes corre-sponds with what we might expect given the observation of auxiliary metabolic genes (AMGs) in viral genomes from the surface oceans. A similar enrichment in energy metabolism genes was found in the viral fraction relative to the cellular fraction in samples from the Indian Ocean [59]. Similarly, photosynthesis genes encoded in cyanophage are known to be expressed during viral infection [20,60,61], and modeling work has indicated that these photosynthesis genes can enhance host fitness [62,63]. For lytic viruses, energy metabolism genes may be expressed as a means to supplement host metabolism or to redirect resources for phage particle synthesis. Our analysis of fragment recruitment ( Figure 2) indicated that genes related to energy metabolism had been successfully transferred between genomes in the past, and viruses are potential vectors for such gene transfer. One potential explanation for this enrichment, in line with the fragment recruitment results, is that highly abundant proviruses in the vent system express these genes while integrated in the host genome. By providing their hosts with new or supplemental means of surviving a challenging, dynamic environment, these proviruses boost host fitness, and in turn, enhance their own fitness.
To compare these results with other cellular-viral metagenome comparisons, we conducted an analysis of 20 cellular metagenomes and 23 viral metagenomes from other environments, all sequenced with 454 pyrosequencing and annotated with the KO database. Each of the cellular metagenomes had at least one viral counterpart sampled from the same environment. Metadata regarding these metagenomes are summarized in Table S6; the functional profiling analysis is depicted in Figure S4. The overall patterns of relative gene abundance between the viral and cellular fractions were similar to those observed in our sample. Specifically, in other environments functional annotations of viral and cellular metagenomic reads were strongly correlated, but viral metagenomes were more enriched in reads annotated as nucleotide metabolism and replication and repair. However, the relative enrichment of genes related to energy metabolism in the viral fraction, while apparent in the vent environment, was not a universal characteristic of viruses in other environments.

Does selection operate differentially on viral and cellular genes?
Integral to the study of viral and cellular evolution is understanding how selection shapes the viral and cellular gene pools, and which genes are subject to stronger or weaker selection. Differing life strategies for cells and viruses, as well as disparate roles for functional genes within each of the respective gene pools, should leave different selective signatures on genes within each of these gene pools. To measure differential selection among viral and cellular genes, we calculated dN/dS ratios of genes encoded by the viral and cellular fractions. The challenge of calculating dN/dS ratios with shotgun metagenomic data is that the short sequences make it difficult to align long blocks of sequences to the same region of a gene. To circumvent this problem, we used the  [42] to calculate dN/dS ratios from metagenomic reads, in which sequencing reads are mapped to the genomes of previously sequenced isolates. We used 80% identity as the threshold for mapping to genomes, a convention established previously [42]. While some work suggests that microbial ''species'' share an average nucleotide identity of 95% across the genome [64], tiling at that percentage did not yield enough hits for statistical analysis. Previous work has also indicated that below 80% identity, the number of reads recruited drops drastically, implying a biological threshold at 80% similarity [42]. These sequences therefore define the ''population.'' However, since the sequences used for analysis may be derived from multiple taxa, and we do not know the specific phylogenetic relationship of these sequences to each other, this method cannot determine which polymorphisms have become ''fixed'' in the population. Instead, this method provides an indicator of diversification within the environmental gene pool.
Both metagenomes were mapped to pre-existing hydrothermal vent isolates as a high-throughput means to align reads to many genes at once. The mapping analyses and calculation of dN/dS ratios are calculated by tiling metagenomic reads to the genomes of Nautilia profundicola AmH, Thermococcus kodakarensis KOD1, Thermococcus onnurineus NA1, Caminibacter mediatlanticus TB-2 (contig ABCJ01000001), and Nitratiruptor sp. SB155-2, which represent abundant strains in the vent environment. We attempted to map the virome to several existing viral genomes from various environments, but none exhibited sufficient depth of coverage to calculate dN/dS ratios, indicating that the genes encoded by viruses from this hydrothermal system are vastly different from those sequenced previously.
Overall, the cellular metagenome mapped to 831 bacterial and 32 archaeal genes, with an average dN/dS of 0.22. This result indicates that genes encoded by cells in the vent environment are subject to purifying selection. The viral metagenome mapped to 85 bacterial and 106 archaeal genes, with an average dN/dS of 0.15, and the viral metagenome subset mapped to 39 bacterial and 25 archaeal genes, with an average dN/dS of 0.13 ( Figure 5). These dN/dS values are significantly lower than the dN/dS of genes matching the cellular metagenome, within a confidence interval of 95%. This pattern was consistent for each of the genomes mapped ( Figure S5). The viral and cellular metagenomes mapped to different genes in each of the strains listed above, and so slightly different sets of genes were used to make this calculation. However, the difference in overall dN/dS is not due solely to differences in the types of genes to which each metagenome mapped: when we examined the dN/dS for only the genes to which both metagenomes mapped, the calculated dN/dS was, on average, lower for the viral fraction compared to the cellular fraction ( Figure S6). There were no clear trends in dN/dS for different gene categories ( Figure S7).
These results indicate that both the viral and cellular fractions in this hydrothermal system are subject to purifying (negative) selection, but the viral gene pool is under stronger purifying selection than the cellular gene pool. One possible explanation for the overall difference in dN/dS between the viral and cellular gene pools is that very little variation in viral genes is permitted. In this scenario, viral genes are under such strong selection that deviations from the consensus protein sequence produce enough of a fitness difference to eliminate the viral mutant. The viral genes included in this analysis mapped to cellular genomes, and therefore are likely to be auxiliary metabolic genes carried on viral genomes, suggesting that AMGs carried by viruses are subject to purifying selection to a greater degree than their cellular counterparts.
This scenario becomes more complicated as a result of the relationship between virus and host. If a virus were primarily lytic, then a selective sweep could act directly on the viral particles, reducing phenotypic variation (and therefore nonsynonymous polymorphisms). If, however, a virus were primarily lysogenic, a larger proportion of time would be spent integrated in the genome of the host. In this situation, the selective sweeps would act on both host and virus, and we might not expect to see a significant difference in the dN/dS ratios between viral and cellular genes. The difference observed here may indicate that in this system, selection acts most strongly on viruses while they are in the process of replicating or when they are in virion form (free in the environment), and can therefore be selected separately from the host.

Conclusions
The dynamic, recirculating conditions of sulfide-hosted hydrothermal vent systems, combined with the vast diversity of archaea, bacteria, and their accompanying viruses, create an ecosystem with high potential for widespread sharing of the communal gene pool. The unique sample analyzed here, which most likely pulled fluid both from high-temperature subsurface hydrothermal fluid and from cold background seawater, included microorganisms from a wide range of ecological habitats within the vent system. Having been sampled from the same point source, this implies that these diverse microorganisms have the potential to come into close contact in these vent systems. In doing so, there is potential for these microorganisms to exchange genes as well as viruses. Mobile elements were abundant in both the viral and cellular metagenomes we obtained, likely reflecting the abundance of lysogenic viruses, which require integrases for viral genome insertion into the host, as well as high potential for horizontal gene transfer. In the genomes of vent inhabitants, we found evidence for horizontal transfer of genes for environmental interaction and energy metabolism, suggesting selection for enhanced phenotypic plasticity. Moreover, a slight enrichment of genes related to processes such as cofactor synthesis and energy metabolism suggests that viruses in this hydrothermal system carry auxiliary metabolic genes.
Any genes found in the viral gene pool must have some utility in order to be retained on small viral genomes. The genes encoded on viral genomes may be used by lytic viruses as a means to facilitate the manufacture of viral particles, as has been observed in cyanophage previously. However, the prevalence of lysogenic viruses in vent habitats and the selective signatures observed here suggest that these vent viruses are selected to spend much of their time as integrated proviruses rather than as free virions. In this case, auxiliary metabolism genes may be expressed by integrated proviruses to benefit the host. Selection should favor traits by which proviruses boost host fitness while the fitness of the virus and host are intertwined. In turn, host cells benefiting from provirusencoded genes may gain an adaptive advantage through enhanced metabolic flexibility, which may favor selection for cells harboring proviruses. This advantage complicates the symbiotic relationship between virus and host. While still capable of wreaking destruction upon the cells they infect, the data described here suggest a viral evolutionary strategy in which the virus-host relationship transcends from a parasitic relationship into a mutualistic one, as both host and virus seek to survive the dynamic, extreme environment in which they coexist.