Diversity in a Polymicrobial Community Revealed by Analysis of Viromes, Endolysins and CRISPR Spacers

The polymicrobial biofilm communities in Mushroom and Octopus Spring in Yellowstone National Park (YNP) are well characterized, yet little is known about the phage populations. Dominant species, Synechococcus sp. JA-2-3B'a(2–13), Synechococcus sp. JA-3-3Ab, Chloroflexus sp. Y-400-fl, and Roseiflexus sp. RS-1, contain multiple CRISPR-Cas arrays, suggesting complex interactions with phage predators. To analyze phage populations from Octopus Spring biofilms, we sequenced a viral enriched fraction. To assemble and analyze phage metagenomic data, we developed a custom module, VIRITAS, implemented within the MetAMOS framework. This module bins contigs into groups based on tetranucleotide frequencies and CRISPR spacer-protospacer matching and ORF calling. Using this pipeline we were able to assemble phage sequences into contigs and bin them into three clusters that corroborated with their potential host range. The virome contained 52,348 predicted ORFs; some were clearly phage-like; 9319 ORFs had a recognizable Pfam domain while the rest were hypothetical. Of the recognized domains with CRISPR spacer matches, was the phage endolysin used by lytic phage to disrupt cells. Analysis of the endolysins present in the thermophilic cyanophage contigs revealed a subset of characterized endolysins as well as a Glyco_hydro_108 (PF05838) domain not previously associated with sequenced cyanophages. A search for CRISPR spacer matches to all identified phage endolysins demonstrated that a majority of endolysin domains were targets. This strategy provides a general way to link host and phage as endolysins are known to be widely distributed in bacteriophage. Endolysins can also provide information about host cell wall composition and have the additional potential to be used as targets for novel therapeutics.

In an additional dimension of complexity, microbial biofilms also harbor phage populations that, in turn, have a significant impact on the entire community structure: exerting significant evolutionary selection, influencing metabolic capabilities and influencing overall growth and diversity [21][22][23]. Pioneering work by several groups [24][25][26][27] using high-throughput metagenomic DNA sequencing, metatranscriptomics and novel bioinformatics, have provided a tantalizing glimpse of extensive phage diversity from several natural environments.
Our knowledge of microbial communities in the alkaline siliceous hot springs of YNP is quite extensive at the biogeochemical, physiological, and more recently at the genomic/metagenomic level [28][29][30][31]. In contrast, information about the phage populations and their impact on microbial communities is much more limited [32,33]. Although this community is well suited to probe the dynamics of co-evolution of phage and microbial populations, availability of appropriate data has been lacking. Thus, our first objective was to build a database of phage DNA sequences (a virome) from photosynthetic microbial mats in YNP.
The important role of phage in the microbial mats of YNP is highlighted by the presence of the CRISPR-Cas (Clustered Regular Interspaced Short Palindromic Repeats, CRISPR ASsociated) adaptive immunity system in all three dominant phototrophs; Synechococcus sp. JA-2-3B'a (2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)  . While CRISPR-mediated adaptive immunity system is only one of many strategies used by cells to avoid phage attack [34] it is specific in linking host and phage relationships [32,35]. As new spacers are acquired into host CRISPR arrays at a certain rate and in a particular orientation, they are useful markers for analysis of host and co-existing phage populations [36,37]. Selection pressure is placed on the phage, to evade the host CRISPR defense system which relies on close nucleotide matching between acquired spacers and incoming phage sequence, and yet retain functionality [32,38].
A critical question to ask is if specific viral genes are preferentially targeted by the CRISPR--Cas system. Only a few studies have focused on CRISPR dynamics and viral targets in environmental settings [32,[39][40][41]. Comparative analysis of CRISPR spacers in cyanobacteria using metagenomes derived from microbial mats in Octopus and Mushroom Spring and viromes derived from the source water of these hot springs suggested that spacers were being actively acquired by the host cyanobacteria, and could be used as a marker for host-phage interactions over short time intervals [28,32]. Initial environmental surveys also suggested that endolysins might play an important role in the YNP microbial mat communities [32].
We generated a virome from the top photosynthetic microbial mat layer of Octopus Spring using the 454 Titanium sequencing platform. Accurate assembly of phage sequence is challenging so we developed a custom strategy to utilize the assembled contigs and analyze host-viral co-evolution. A three-tier module, called VIRITAS, was developed to analyze phage metagenomic sequences. This module has been integrated into MetAMOS as a separate workflow (-W viritas) [42]. Using this pipeline, we assembled phage contigs, and binned related contigs by tetranucleotide analysis and CRISPR spacer matching. CRISPR spacer matching to cyanobacteria highlighted an endolysin domain: Glyco_hydro_108 (PF05838), prompting a characterization of the endolysin domains in OS-V-09 and sequenced cyanophages. We found that OS-V-09 contained only a subset of the annotated endolysins found in fully sequenced cyanophages. Led by these findings, we expanded our search and found that phage endolysins are a frequent CRISPR target. This allows for a general strategy to link unknown host and phage. This combination of widespread phage distribution and CRISPR spacer targeting suggest endolysins may be useful marker genes. Phage endolysins can be host species specific; they provide information about the host cell wall composition and can be harnessed as a useful tool for cell lysis, and have the potential to be used as candidates for novel therapeutics.

Results and Discussion
Generation of the OS-V-09 virome by 454 sequencing A virome (hereafter referred to as OS-V-09: OctopusSpring-Virome-2009) was generated from a phage-enriched fraction of a microbial mat core sample taken from a 60°C region in Octopus Spring, Yellowstone National Park. DNA was extracted and whole genome amplification (WGA; also termed Multiple Displacement Reaction (MDA) was required to ensure sufficient sample for sequencing (Fig 1). Prior to sequencing, putative phage primers designed from sequence generated by Schoenfeld et al. 2008 [33] herein named OS-V-03 and BP-V-03 (S1 Table), indicated phage DNA was present. The extent of bacterial DNA present in the virome was judged to be low based on the faint 16S rDNA signal that was found using general V1-V3 16S rDNA primers [43] in contrast to the robust 16S rDNA signal in whole mat DNA extractions (S1 Fig). A DNA sequence dataset of 180,141,543bp consisting of 501,240 reads was generated. Read distribution had a mean length of 359bp (longest read was 1385bp) and run statistics met or exceeded all quality control checks ( Table 1).

Classification and identification of OS-V-09 reads
Reads generated from OS-V-09 were classified prior to assembly with VIRITAS via the Fragment Classification Program (FCP) [44]. Archaeal and Bacterial reads comprised 0.4% (1864 reads) and 23% (116344 reads) respectively, while the remaining reads 76.4% (383032 reads) had little to no homology to known bacterial or archaeal sequence ( Table 1). Archaeal reads were predominantly Crenarcheota and Euryarchaeota, which are known to be ubiquitous in many environments, including hot springs [45,46]. The most numerous identifiable reads belonged to the bacterial phylum, Chloroflexi [18% (61,470 reads)] which are abundant in the mat and are more easily lysed than cyanobacteria [47][48][49]. Of the 501,240 reads from the virome, only 52 reads (0.01%) contained partial 16S rDNA sequences based on HMMER 3.0 predictions [50] Identified reads which spanned the 16S were aligned to known species (S2 Fig). This is comparable to the 24 reads (0.07%) found in OS-V-03 and BP-V-03 viromes which were treated with 10U benzonase endonuclease for 30mins [33]. The presence of such a low percentage of contaminating 16S rDNA sequences in the dataset provided further confirmation that OS-V-09 represented a dataset depleted for bacterial sequences and enriched for phage sequences.

Assembly of phage reads with the VIRITAS pipeline
In an attempt to mitigate the challenges of de novo viral assembly [51,52] and MDA bias typical of Phi29 polymerase activity [53] while producing high quality contigs, we employed the SPADes assembler within the VIRITAS pipeline to assemble reads [54]. We were able to recruit 99.8% of the reads (500,128 of a total of 501,240 reads) in the final assembly ( Table 2). A total of 19,837 contigs were assembled, with an N50 value of 605bp ( Table 2). As expected, we observed an uneven read coverage of assembled contigs such that some regions were over-represented (345x coverage) or under-represented (25x coverage) (Fig 2A) which is typical of the activity of Phi29 polymerase [53].
Assembled contigs were run through MG-RAST [55] to generate a rarefaction curve. For comparison, metagenomic reads from Octopus and Mushroom Spring (OS-M-03 and MS-M-04), phage metagenome reads from OS-V-03 and BP-V-04, and the unassembled viral reads for OS-V-09 were also plotted (Fig 2B). Although OS-V-09 reads start to reach saturation, we clearly observe that the assembled OS-V-09 contigs are still in the exponential phase of the curve. This reflects what we would expect with Phi29 polymerase amplified sequences; the high coverage bias produces an artifact which suggests that saturation has been reached, as individual reads may oversample the same sequence. Binning of phage contigs by tetranucleotide analysis, and CRISPR spacer matching followed by visualization using Emergent Self Organizing Maps (ESOMs) To bin contigs, which did not assemble into a consensus sequence, yet may have come from related phage, we used tetranucleotide frequency analysis (TNF) which has been successful in binning sequences from isolated genomes, metagenomes, as well as prophages [22,39,56]. In TNF analysis many data points can be collected and are less likely to be affected by overall genome GC content, or nucleotide biases [57]. Only contigs greater than 1Kb in length, were clustered via TNF scripts [57], as the accuracy with which sequences are correctly assigned is correlated with contig length [58]. The frequency of the 256 tetramers (136 non-redundant) was calculated for viral reads as well as several well-characterized microbial mat members:  clusters; with some viral reads containing a fingerprint very similar to known bacterial genomes, while other contigs had a very unique pattern not associated with a known genome. To more clearly visualize bins, calculated TNF was input into the ESOM-Mapping tool [59] (Fig 3A). The five genomes (Synechococcus sp. JA-2-3B'a(2-13), Synechococcus sp. JA-3-3Ab, Meiothermus silvanus, Chloroflexus sp. Y-400-fl, and Roseiflexus sp. RS-1 could be clearly separated into distinct clusters, as expected. In the ESOM map, we could also clearly visualize the A) The tetranucleotide signature for viral contigs greater than 1Kb (navy), as well as 5K fragments from five genomes from fully sequenced mat species Synechococcus sp. JA-2-3B'a(2-13) (light pink), Synechococcus sp. JA-3-3Ab (salmon pink), Meiothermus silvanus (light grey), Chloroflexus sp. Y-400-fl (mint green) and Roseiflexus sp. RS-1 (yellow), was calculated via scripts from Dick et al [59]. Viral contigs clustered into three major groups (Cluster 1-3). B) Viral contigs with at least one CRISPR spacer hit re-coloured to reflect their host as shown in part a. Legend represents tetranucleotide frequency distances from valleys (blue) to peaks (white).
doi:10.1371/journal.pone.0160574.g003 viral contigs. The 2052 viral contigs (above 1Kb) fell into three main clusters: Cluster 1 included 171 viral contigs that were associated with the two Synechococcus genomes, Cluster 2 included 1175 contigs intermixed with the Roseiflexus RS-1 genome, and Cluster 3 included 706 viral contigs that were not closely associated with any host genome. In contrast, only a few viral contigs were associated with the Meiothermus silvanus or Chloroflexus sp. Y-400-fl genome clusters.
Next, to determine putative host-phage pairs, CRISPR spacer-protospacer matching information from dominant species present in the mat was overlaid on the ESOM maps. First, all CRISPR spacers from CRISPRdb in addition to spacers manually extracted via CRISPRfinder from relevant environmental datasets (S2 Table) were blasted against assembled viral contigs. We were able to identify the host for the three distinct clusters, labeled Cluster 1-3 (Fig 3A, S3 Table). Cluster 1 had 13 contigs which contained matches to Cyanobacterial CRISPR spacers. Cluster 2 had 116 contigs with CRISPR spacer hits to Roseiflexus spacers. Cluster 3 had only one CRISPR spacer match to a Cyanobacterial spacer. To visualize this subset of contigs with CRISPR spacer matches more clearly, viral contigs with at least one spacer hit are shown in Fig  3B, coloured to match their putative host. By using tetranucleotide binning in parallel with CRISPR spacer matching and ESOM visualization, we consolidated the dataset, grouping sequences which were not assembled, but which retained similar signatures, and identified the predicted hosts.

Identification of predicted ORFs and those containing CRISPR spacer hits in assembled contigs
ORFs were predicted in the assembled contigs via ORFfinder [60] with a minimum size of 300bp, and run through InterProScan [61] to detect identifiable domains (S4 Table). A total of 52,348 ORFs were identified (getorf -minsize 300), of which 9319 (i.e. 17.8%) contained domains identifiable by Pfam. As expected, a majority of predicted ORFs did not have any recognizable domain, which is a common feature of viral datasets. However, some ORFs were predicted to be of phage origin based on their annotation; including phage portal proteins, terminases, VirE and integrases, as well as host genes frequently observed to be carried by phage: methyltransferases, the most common gene observed in environmental phage enrichments [62] and PAPS_reductase (thioredoxin), an essential enzyme in prokaryotic sulfur assimilation pathways known to be carried in sequenced cyanophages [63]. To visualize ORF distributions across clusters, we generated a heat map based on all identified Pfam annotations to look for broad-scale similarities and differences (Fig 4A). Pfams with no hits are shown in light grey, while Pfams with 1 representative are shown in medium grey and those with 2 or more are shown in dark grey. We observe that individual bins share common Pfams (such as phage integrases and methyltransferases) while other domains are unique per cluster.
Identifiable ORFs containing CRISPR spacer matches were broken down by cluster ( Fig  4B). Cluster  Most CRISPR spacers mapped to hypotheticals or proteins of unknown function. A notable exception was the endolysin Glyco_hydro_108 (PF05838) and the closely associated PG_3 binding domain (Fig 5A). We focused on this domain for two reasons. First, it served as a test case to determine genomic diversity in the phage population, since each read represents the genome within an individual viral particle. Second, it allowed us to explore the potential for using it as a phage marker gene and identifying strategies for the identification of additional useful phage marker genes, as only a limited number of these have been established [64].

Identification of Glyco_hydro_108 domains
We identified 47 full length open reading frames that included the Glyco_hydro_108 (PF05838) and PG_3 (PF09374) domains from reads in several relevant datasets. These included previous YNP microbial mat metagenomes from Octopus and Mushroom Springs, a  , and OS-V-03 (indicated by the prefix "OCTOPUS_READ") in addition to outgroup Bordetella_phage_BPP-1 were identified via HMMsearch, aligned with MUSCLE, and the gene tree visualized using the MABL server (phylogeny.fr). Overlaid on the protein tree are significant nucleotide hits (greater than 70% ID over 85% length) to cyanobacterial CRISPR spacer CRISPR_II_YMBCR81TF-SP-2 (previously shown to target a Glyco_hydro_108 domain [Heidelberg, 2009] 93°C virome from Octopus and Bear Paw Springs, and the 60°C virome we generated from Octopus Spring (S1 Table). Sequences were aligned via Muscle in Jalview [65] (S4 Fig) and the phylogenetic relationship visualized using the MABL server [66]. Significant CRISPR spacer hits to cyanobacteria spacer YMBCR81TF_sp_2 were represented on the tree as coloured text to represent varying degrees of nucleotide identity (Fig 5B). This allowed us to make the following observations. First, within these datasets, we observe a range of spacer hit identities between 70-95%. We also observed high identity CRISPR spacer hits in data collected in 2003 as well as in 2009. The same sequences are present over a 6-year span that may indicate rapid turn-over rates, or reflect phage sequence persistence. Second, high percentage identity CRISPR spacer hits are found in several tree branches, and are not strongly correlated with either protein relatedness, sample location, or year the sample was taken. Third, we found recognizable Glyco_hydro_108 domain variants present in the dataset, yet not all contain a cyanobacterial CRISPR spacer hit. This could be because we have not reached saturation in cyanobacterial CRISPR spacer sequence databases or that these are endolysins present in phages that target other host species.

Endolysin Distribution in OS-V-09 Assembled Contigs
The additional four contigs containing Glyco_hydro_108 domains did not have CRISPR spacer hits to any known species, and could not be assigned a putative host (S4 Table). The presence of these untargeted endolysins might indicate that we have do not have adequate spacer coverage in hosts. Phage fecundity highly depends on successful host lysis, thus endolysins may be preferred targets of the CRISPR system as they are also under strong evolutionary pressure [67], making them an efficient target.

Phage endolysin domains present in OS-V-09 as compared to annotated cyanophages
To determine if endolysins can potentially be used as a phage marker gene, similar to the use of 16S rDNA to identify bacterial phyla, we identified phage endolysin domains across all sequenced cyanophage retrieved from JGI DOE IMG (img.jgi.doe.gov, last update June 2015, 3899 phages, 68 cyanophages). Endolysins are typically composed of two domains; a catalytic domain followed by a binding domain. These domains are modular, and can be found in multiple combinations [68]. For OS-V-09 contigs, OS-M-03/OS-M-04 reads, and metagenome reads, open reading frames greater than 300bp were identified by getorf, part of the EMBOSS software package [69], and searched with Markov Models via HMMsearch (S5 Table). There are 14 endolysin domains in cyanophage (S6 Table). We observed that only a subset of four catalytic endolysin domains (PF00182, PF05838, PF01464 and PF01551) overlapped between OS-V-09 and annotated cyanophages (Fig 6). Of note, the Glyco_hydro_108 (PF05838) domain was not found in previously sequenced cyanophages. This might suggest that endolysins are predictive of a particular phage-host lifestyle or environment, and could be useful as a diagnostic for host-phage relationships.

CRISPR targeting of phage endolysins is not exclusive to cyanobacteria
To determine if CRISPR spacer targeting of phage endolysins is a general phage strategy, or specific to cyanobacteria, a BLASTn was run with all known spacers in CRISPRdb against all annotated phage endolysin domains (as identified Olivieria et al 2013 [68]) in IMG. Significant hits (90%ID, evalue = e-5) were mapped to HMM logos [70] (Table 3). We observed that most phage endolysin domains contained CRISPR spacer domains, and that in some cases they are heavily targeted while a few domains have none (although this could be due to underrepresentation of CRISPR spacers). For comparison, we also analyzed a few other phage genes. VirE and a phage portal gene also contained some CRISPR spacer hits, although NinC a non-structural gene had no CRISPR spacer hits. As CRISPR spacer databases and phage Diversity in a Polymicrobial Community Revealed by Analysis of Viromes, Endolysins and CRISPRs databases get bigger, it will be possible to extend such studies to examine if there are preferred targets of the CRISPR spacer immunity system.

Co-evolution of Host and Phages in YNP communities
The alkaline siliceous hot springs of YNP have been extensively analyzed at the biogeochemical, physiological, and at the genomic/metagenomic level and provide a good model system in which to examine host-phage co-evolution dynamics [28][29][30][31]. The comparative genomic analysis of two Synechococcus species isolated from different temperature regions of Octopus Springs in YNP revealed that both contained CRISPR-Cas systems [32]. Two distinct CRISPR types, distinguished by their repeat sequences, were common to both genomes, although the spacers were unique in each genome. The genome of Synechococcus OS-A contained an additional third CRISPR type that appeared to be shared with other microorganisms that inhabit the mat and may have undergone horizontal gene transfer [32]. Comparative analysis of CRISPR spacers in cyanobacteria using metagenomes derived from microbial mats and viromes derived from the source water of these hot springs suggested that spacers were being actively acquired by the host cyanobacteria, and could be used as a marker for host-phage interactions over short time intervals [28,32]. In particular, a few host spacers matched regions of a putative viral lysozyme/endolysin suggesting that host spacer matches to viral sequences could be a powerful way to characterize putative viral-host relationships as well as gain insight into the strategies used by host to avoid phage attack and conversely to explore how phages evolve to evade host defenses. The microbial mat community is well suited to probe the dynamics of co-evolution of phage and microbial populations, but availability of appropriate data has been lacking [32,33]. Thus, our first objective was to build a database of phage DNA sequences (a virome) from the photosynthetic microbial mats in YNP.
We observed that CRISPR spacers matches were highly conserved even across a span of 6 years. Spacers curated from metagenomic sequence collected in 2003 contained hits to a 2009 virome with high conservation (Fig 5). Second, high percentage identity CRISPR spacer hits are found in several tree branches, and are not strongly correlated with either protein relatedness, sample location, or year the sample was taken. This could be explained by either very high or moderate CRISPR turn-over rates. To quantify these rates in a natural environment, our results suggest a time series with both monthly and yearly times scales would be most informative.

Technical challenges in phage genome assembly
De novo assembly of metagenomic sequences, in particular, phage-derived sequences, is a challenging computational task. In spite of recent technological advances, such as preassembly read-filtering by digital normalization and partitioning [71], and use of a variety of sequencing platforms to minimize the shortcomings of any one technique [72], reconstructing an entire genome, from a metagenome or virome sequence database remains an open problem [73]. In this environmental biofilm many genomes of highly similar strains are present and evidence suggests recombination is occurring at a high rate [28,74]. Such high strain level diversity can cause assemblers to fail or result in hybrid assemblies combining variations found in several similar species or strains. In contrast, highly conservative assemblers will break the assembly at regions of variation, which results in highly fragmented, non-cohesive assemblies. A further complication results from amplification artifacts introduced by Phi29 polymerase during MDA, yielding uneven coverage, breaking multiple assembly heuristics for resolving repeat structure, or resulting in chimeric reads [51,52].
By using SPADes, an assembler specifically tuned to address MDA artifacts, we were able to create robust viral assemblies. Binning of the contigs via tetranucleotide analysis and visualization by ESOM enabled us to group sequences and also allowed us to characterize the phage types present within the dataset. We were encouraged by the fact that clustering of viral contigs to host was robust and was corroborated by CRISPR spacer matching. This strategy was independent of the system used, and thus represents a general pipeline for viral sequence analysis. Furthermore, CRISPR spacer matches provided insight not only as to the host, but also into which ORFs were targeted. As our analysis was built upon datasets generated over the course of several years, we were able to observe CRISPR spacer turn-over. Although we only had a few "snap-shots" of sequence, a deeper targeted time-course dataset would allow us to differentiate between rapid or moderate turn-over in a natural environment. Further expansion of the cyanobacterial CRISPR spacer database will be required to get more insight into spacer acquisition dynamics.

Identification of phage proteins
Phage proteins are notoriously difficult to identify, often with no known homologues in sequence databases; the so-called viral "dark matter" [75][76][77]. Consequently, most phage genes are still annotated as "hypothetical" or of "unknown function" [57,78]. In contrast to bacteria, no universally conserved gene (such as the 16S rDNA) exists in phage, hindering attempts to identify phage genomes or survey abundance in natural environments [79]. In cyanophage, structural markers such as capsid, portal, or tail sheath proteins have been used to determine viral abundance across time and sampling locations, while ribonucleotide reductases have been recently posited as a phage marker candidate with a broad host range [80]. Tracking dynamics of these genes allows for inference of the viral impact on host and the frequency at which host cells are infected [81,82]. However, using a single marker gene approach, has several drawbacks: sequences that are divergent, or have undergone inter or intra-genic recombination may not be identified, even with degenerate primers; PCR amplification may introduce biases, so that the amplified genes are not representative of the natural population distribution; rare phage sequences may not be amplified at all. One means of mediating these shortcomings is the use of a "panel" of phage gene markers [83].

Endolysins as useful phage marker genes
Endolysins make an attractive candidate to add to the panel of phage marker genes. Endolysins are highly specialized, exquisitely timed hydrolytic components involved in successful release of phage particles from infected cells, and have been recently characterized for all doublestranded sequenced bacteriophage [67,68]. Endolysin classification is dependent on their mode of action, with four types discovered to date: lysozymes and transglycosylases cleave glycosidic bonds between amino sugars in the cell wall, while amidases and endopeptidases cleave crosslinking oligopeptide bonds [68].
Endolysins contain regions of high conservation, as well as variable regions [68], not unlike the golden standard of 16S rDNA used for bacteria. Sequences containing regions of conservation allow for robust assemblies, even of highly diverged variants, while the variable regions allow for fine scale resolution. In addition to yielding information about the phage, endolysins also simultaneously reveal information about their host specifically about the cell wall composition.
We show that endolysins are frequently targeted by spacers. Experiments under laboratory conditions have shown that CRISPR spacers can be enriched for specific gene targets, in particular, an endolysin domain was found to be over-represented as a spacer target [37]. In this study we show that in a natural community, phage endolysins were targeted by the host and this may represent a general strategy in host-phage interactions ( Table 3) although further analysis will be required to establish that this is a common mechanism.

Practical applications of Endolysins
The role that endolysins play in creating and maintaining a biofilm is not straightforward. Lysins can be key factors in helping to prime biofilm formation, by producing an initial extracellular DNA scaffolding, which is later predominantly replaced by exopolysaccharides [84]. However, timing is crucial, as lysins can also destroy a more mature biofilm [85]. Such a delicate balance in timing adds a further layer of complication to the host-phage relationship. Many species are pathogenic only in a biofilm state, and endolysins represent a potential novel source of antimicrobials, effective against infections which may be resistant to antibiotics [86]. Some endolysins display broad-host ranges, while others have "near-species specificity" of domains [87]. Species specificity can also be engineered, with inducible lysins specifically targeted to particular species for optimal breakage, or transformation, such as cyanobacterial targeted via a green light inducible T4 phage holin/endolysin [49]. Endolysins can also be targeted to disease causing members of a community, while leaving the remaining consortia intact. This strategy was effective with targeting of Clostridium with a bacteriophage endolysin delivered via probiotic species [88,89]. Lastly, while endolysins can diffuse freely across the cell membrane of gram negative species, "artilysins" i.e. endolysins that have been have also been engineered to target the outer membrane of gram negative species, have also shown great promise as novel antibiotics [90].

Materials and Methods
Generation of Viral DNA Sequence (Virome; OS-V-09) The uppermost 1-2mm green layer was excised from a 2009 microbial mat core sample (8mm diameter) from Octopus Spring (stored at -80°C until use) and re-suspended in 50 mL 10mM Tris, 1mM EDTA by vigorous vortexing. Cells were pelleted at 6000 x g for 10 minutes in a Sorvall GS5C. The supernatant was passed sequentially through 0.45μm and 0.2μm filters (Nalgene, Thermo Scientific) to remove remaining intact cells and any cellular debris. One mL aliquots were centrifuged at 50,000K (Beckman TL-100 Ultra Centrifuge, TLA 100.3) for one hour to concentrate viral particles. Viral DNA was amplified from these enrichments via a Phi29 polymerase (GenomiPhi, GE) in two independent technical replicates (Fig 1). Amplified viral DNA was subjected to a panel of Syn OS-BSand Syn OS-A specific primers and universal bacterial 16S primers for the V1-V3 variable regions [91]. Putative viral primers designed from sequence generated by Schoenfeld et al. 2008 [33] herein named OS-V-03 and BP-V-03 (S1 Table), indicated an enrichment of viral sequences (S1 Fig). To reduce random biases, two technical replicate MDA reactions were pooled and sent for sequencing with 454 Titanium technology at the Genome Sequencing and Analysis Core Resource (http://genome.duke.edu/cores/sequencing/ at Duke University) resulting in a DNA sequence database of 180,141,543bp, consisting of 501,370 reads, with a read distribution median length of 425bp (the longest read was 1385bp, and the shortest was 40bp) and run statistics met or exceeded all quality control checks ( Table 1). This dataset was named OS-V-09.

Identification of CRISPR arrays present in Metagenome Reads and Extraction of Novel CRISPR spacers
To identify CRISPR arrays in previously generated thermophilic microbial mat datasets, MS-M-04, OS-M-03, OS-V-03, BP-V-03, LIBGSS_012136, and LIBGSS_012135, were pipelined through CRISPRfinder and identified repeats were subjected to a BLASTN against the nr database to determine the species from which they originated [95]. Identified spacers were manually inspected to remove any spurious spacer calls, such as repeat-rich sequences, that are not associated with CRISPR loci. A total of 1546 spacers were identified with Synechococcus sp.-like repeats, and a total of 2828 spacers with Roseiflexus sp.-like repeats, and 1455 spacers with Chloroflexus sp.-like repeats from these datasets. (S2 Table).

Assembly of viral reads with SPADes
OS-V-09 reads were fragmented in silico into 100bp fragments (from both the left and right) in preparation for input into SPADes3.7.1. Any "reads" smaller than 100bp were discarded. (only-assembler-s1 OS-V-09 -sanger OS-M-04).

Mate-pair read recruitment
To mine all available information from previously published sequences, we recruited mate-pair reads from similar environments: MS-M-04, OS-M-03, OS-V-03, BP-V-03 (Table 1) in an attempt to generate additional scaffolds. The majority of the recruited mates validated the assembled contigs, but did not extend contig length or assemble into new contigs.

Rarefaction
Rarefaction curves were generated in MG-RAST [55] via blastn against GenBank using a maximum e-value of 1e-5, a minimum identity of 60%, and a minimum alignment length of 15aa.

Phage annotation pipeline
Getorf part of the EMBOSS software package [69] was used to extract open reading frames over 300bp in length (getorf-minsize 300). Predicted open reading frames were pipelined through InterproScan [61] to identify recogniseable domains.

Tetranucleotide Analysis and Emergent Self Assembling Map generation
Tetranucleotide frequency was calculated using scripts from Dick et al (https://github.com/ tetramerFreqs/Binning) [57] from assembled contigs larger than.1Kb in length (number). Contigs less than 1Kb often result in "noisy" signatures and were excluded from further analysis. The gplots heatmap.2 R function was then utilized to generate the heat map based on hierarchical clustering of tetranucleotide frequency of the assembled contigs. Hclust function was used to order the tree diagram through the distance between the rows. (S3 Fig). Emergent Self Assembling Maps (ESOM) were created to better visualize the clustering within the viral dataset. The ESOM was anchored by including several known genomes of organisms found in the microbial mat community, namely, Synechococcus sp. JA-2-3B'a(2-13) (NC_007776), Synechococcus sp. JA-3-3Ab (NC_007775), Roseiflexus sp. RS-1 (NC_009523), Chloroflexus sp. Y-400-fl (NC_012032), and Meiothermus silvanus (NC_014212).

Mapping of CRISPR Spacers onto Assembled Contigs
CRISPR spacers were BLASTed against assembled contigs (S3 Table). Matches were considered significant greater than e-5 for 90% [personal communication, David Paez]

Distribution of Glyco_hydro_108 domains in relevant datasets
Full length open reading frames including the Glyco_hydro_108 (PF05838) and PG_3 (PF09374) domains from relevant datasets (S1 Table) were aligned via Muscle in Jalview [65]. Trees were visualized with the MABL server (phylogeny.fr). Significant CRISPR spacer hits were overlaid as coloured dots or bars to indicate CRISPR spacer hits analyzed from Heidelberg et al 2009 [32] with varying degrees of nucleotide identity (Fig 4).

OS-V-09 contains a subset of known endolysin catalytic domains in sequenced cyanophages
Endolysin domains of interest for sequenced genomes (as characterized by Oliveira, et al [68]) were retrieved from pre-computed functional annotation with HMMER 3.0 in IMG. For OS-V-09, all open reading frames were extracted with getORF (-minsize 300) via command line. Hmmsearch (defaults) was used to search sequences with raw HMM models (pfam.xfam. org). Counts are shown in S6 Table. Plots were generated in Circos (http://mkweb.bcgsc.ca/ tableviewer/).