Transcriptomic Characterization of an Infection of Mycobacterium smegmatis by the Cluster A4 Mycobacteriophage Kampy

The mycobacteriophages, phages that infect the genus Mycobacterium, display profound genetic diversity and widespread geographical distribution, and possess significant medical and ecological importance. However, most of the majority of functions of mycobacteriophage proteins and the identity of most genetic regulatory elements remain unknown. We characterized the gene expression profile of Kampy, a cluster A4 mycobacteriophage, during infection of its host, Mycobacterium smegmatis, using RNA-Seq and mass spectrometry. We show that mycobacteriophage Kampy transcription occurs in roughly two phases, an early phase consisting of genes for metabolism, DNA synthesis, and gene regulation, and a late phase consisting of structural genes and lysis genes. Additionally, we identify the earliest genes transcribed during infection, along with several other possible regulatory units not obvious from inspection of Kampy's genomic structure. The transcriptional profile of Kampy appears similar to that of mycobacteriophage TM4 but unlike that of mycobacteriophage Giles, a result which further expands our understanding of the diversity of mycobacteriophage gene expression programs during infection.


Introduction
Mycobacteriophages are dsDNA viruses that infect bacteria of the genus Mycobacterium. First isolated from soil samples in 1946 [1], the mycobacteriophages have been studied as a system for detection and diagnosis of tuberculosis [2][3][4], as a source of molecular tools for engineering Mycobacteria [5][6][7][8], and as a model for bacteriophage evolution and genetic diversity [9][10][11][12][13][14]. Although the mycobacteriophage host with the most clinical relevance is Mycobacterium tuberculosis, the most common host used for mycobacteriophage study is Mycobacterium smegmatis, a fast-growing model for M. tuberculosis.
Largely due to science education initiatives such as the SEA-PHAGES and PHIRE programs, an abundance of genomic data exists for the mycobacteriophages. As of the middle of Progression of Viral Transcript Abundance RNA-Seq was performed before infection and after addition of virions both before (5,15,30, and 60 minutes after infection) and after (120 minutes after infection) predicted lytic burst with two biological replicates at each time point. To prevent reads from M. smegmatis from incorrectly mapping to regions of Kampy with moderate homology, reads were mapped simultaneously to the Mycobacterium smegmatis and mycobacteriophage Kampy genomes. All reads (that successfully mapped) were mapped uniquely. We detected viral mRNA transcripts as early as five minutes after infection, at which point they constituted 1.3% of the total fraction of non-ribosomal reads mapped to either of the Kampy or Mycobacterium smegmatis genomes (Fig 2). Fifteen minutes into infection, Kampy mRNA constituted 52% of such mapped nonribosomal reads. The percentage of viral mRNA increased to approximately 85% of mapped  non-ribosomal reads by 30 minutes after infection and stabilized to 92% of mapped non-ribosomal reads from 60 minutes after infection onward.

Patterns of Viral Gene Expression
At five (Fig 3a) and fifteen minutes (Fig 3b) post-infection, gene expression was observed almost exclusively on the right arm of the genome, which contains genes involved in DNA replication (DNA polymerase, DNA primases, DnaB-like helicases), integration (recombination endonuclease VII), nucleotide metabolism (deoxycytidylate deaminase, thymidylate synthase/ thyX, ribonucleotide reductase, phosphoribosyl transferase), host immunity repression, and DNA methylation. Specifically, 86.2% and 99.9% of viral gene expression at five and fifteen minutes post-infection, respectively, occurred upstream of gene 35, as measured by integrating read depth from the beginning of gene 36 to the end of the genome. At both five and fifteen minutes after infection, gene 78, a DNA methylase, is notably (2X-10X) more highly expressed than any other gene with known function.
Five minutes after infection, expression was largely confined to genes 82 through 88, all of which encode hypothetical proteins, and with the highest expression of any gene occurring in gene 83, encoding another hypothetical protein. Fifteen minutes post-infection we observed a leftward shift in expression, indicated by a roughly 3-fold increase in expression of genes 50 through 82 (Table 1). At this time, gene 78 (DNA methylase) was the most highly transcribed gene, and the other DNA methylase (gene 77) was the 13th-most highly transcribed. Other highly expressed genes (Reads Per Kilobase per Million mapped reads (RPKM) > 5,000) at 15 minutes after infection included a metallophosphoesterase (gene 52), a DnaB-like helicase (gene 63), a RecB-like helicase (gene 67), a phosphoribosyl transferase (gene 62), and a ribonucleotide reductase (gene 50).
Thirty minutes into infection we observed a further shift in transcriptional activity towards the center of the genome (Fig 3c). While Kampy genes 50-88 were most highly expressed at fifteen minutes after infection, at thirty minutes, a region spanning genes 36 through 55 dominated expression. Metallophosphoesterase (gene 52) and ribonucleotide reductase (gene 50) remained very highly expressed (average RPKM 45,027 and 39,010, respectively), but the expression of both DNA methylases decreased relative to that of other genes. We also observed transcriptional activity on the left side of the genome, which primarily contains genes involved in virion construction (terminase, portal protein, capsid maturation protease, scaffolding protein, major capsid protein, major and minor tail subunits, tail assembly chaperones, and tapemeasure protein) and virion escape (two lysins and a holin).
Sixty minutes post-infection, nearly all (~95%) gene expression was observed in the left arm of the genome (Fig 3d). At this time, the highest single peak of expression occurred in a noncoding region upstream of gene 1, although the region upstream of gene 88 remained strongly expressed. Genes 1-31 were all considerably more expressed (18-fold by integrated transcript coverage) than the right-arm genes, with the highest expression spanning genes 14 through 22, as well as a large spike at the 3' end of gene 31 (a minor tail protein). The most highly expressed genes at this stage were a scaffolding protein (gene 14), a major capsid protein (gene 15), and a major tail subunit (gene 22).
By two hours into infection we observed gene expression on both sides of the genome. Expression was consistent with a superposition of expression patterns at thirty and sixty minutes into infection with the pattern dominated by the latter. This suggests transcription from a second round of viral infections as well as a large number of residual transcripts from the initial infection (Fig 3e). Visualization with principal component analysis (PCA) (Fig 4) suggests that gene expression patterns two hours after infection are intermediate between those at thirty   minutes into infection and those at sixty minutes into infection. Notably, the genes with the highest expression from the right-hand arm of the genome are gene 50 (ribonucleotide reductase), gene 51 (hypothetical protein) and gene 52 (metallophosphoesterase), while both DNA methylases were expressed to a lesser degree. This pattern is more similar to expression seen at 30 minutes after infection than at either 5 or 15 minutes, consistent with a latent period of approximately 90 minutes.
Notably, at all time points, including 60 minutes after infection, where expression is largely confined to the left arm of the genome, read coverage was highest in an uncalled region upstream of gene 88 at the far end of the right arm of the Kampy genome (41%, 51%, 16%, 5%, and 14% of transcription as measured by integrated base coverage at 5, 15, 30, 60, and 120 minutes after infection, respectively). Another region at the opposite end of the genome has similarly high read coverage at later stages. These regions display no coding potential according to the bioinformatic tools Glimmer and GeneMark. BLAST comparison shows that the left-and right-hand peaks are conserved within 100% and 96% of sequenced cluster A4 phages, respectively, but are not conserved between other subclusters or clusters. Proteomic analysis revealed no peptide matches to these regions in any translation frame. These data suggest that there are transcribed regions of the genome that are not translated during any stage of viral infection. The high level of expression of these regions could be partially attributable to the presence of a large 5' read pileup, indicating a transcriptional start site near 50,400 bp and another near 200 bp [30].
Centroid-based clustering with the Xmeans algorithm suggests 16 clusters of gene expression (S1 Table) [32]. No patterns are apparent from Xmeans clustering based on known functional similarity or genomic proximity. Promoter discovery using DNA master (http:// cobamide2.bio.pitt.edu/) and BPROM [33] yielded no predicted promoters with scores above 0.6 and 1, respectively, in intergenic regions with the correct orientations. Promoter discovery likely failed because both of these programs use consensus sequences for sigma factor binding sites that are known in E. coli but not in M. smegmatis.

Confirmation of bioinformatically predicted coding regions
To confirm predicted gene annotations in Kampy previously identified only through bioinformatic detection of coding potential, we performed proteomic analysis on samples from 30 minutes into infection, 60 minutes into infection, and 120 minutes into infection. We confirmed the presence of 19 translated open reading frames (ORFs): metallophosphoesterase (gene 52), portal protein (gene 12), major capsid protein (gene 15), major tail subunit (gene 22), scaffolding protein (gene 14), tail assembly chaperone (gene 24) and 12 ORFs corresponding to hypothetical proteins ( Table 2). We then compared detected peptides to bioinformatically predicted genes. In all but one case the bioinformatic predictions were verified, with peptides appearing in expected reading frames. Unexpectedly, at 60 minutes after infection (but not 30 or 120 minutes after infection) we detected a protein with several common modifications of the proteins were also detected by the PLGS such as methionine oxidation and carbamidomethylation of the cysteines by the iodoacetamide alkylating agent corresponding to a single continuous ORF spanning the end of gene 66 and the beginning of gene 65, but out of frame with respect to either gene annotation (genomic position 42,331-42,037, reverse orientation). This region was matched with a ProteinLynx Global SERVER (PLGS) score of 1013 and 68% coverage, both scores implying very high confidence in the call. No conserved domains were discovered by comparison of this region to known conserved protein families using HHPred [34]. BLAST searches did not reveal any known proteins with similar sequences. Both genes 65 and 66 show high coding potential according to Glimmer and GeneMark, while the region covered by this frameshifted product does not (in the shifted frame).
Of the proteins we detected, 12 are expressed in the early-phase arm including metallophosphoesterase (gene 52) and 11

Discussion
Nearly a thousand mycobacteriophage genomes have been fully sequenced, illuminating the landscape of genetic diversity among these phages. The transcriptional profiles of mycobacteriophage infection, however, remain poorly understood. We performed global gene expression analysis of the cluster A4 mycobacteriophage Kampy's infection of Mycobacterium smegmatis, describing the sequence and approximate timing of transcriptional events during infection.
Viral transcription begins as early as 5 minutes following infection, and by 30 minutes after infection, viral mRNA constitutes the vast majority (>80%) of the transcriptional pool. The dominance of viral transcripts could be caused by highly active viral transcription, degradation of the host's transcriptome, repression of host transcription, or, more likely, a combination of these mechanisms. As RNA-Seq data are normalized to the total number of mapped reads, they cannot be used to quantify total transcript abundance. Therefore, our data only reveal changes in ratios of expression, not absolute differences between conditions. Experiments that assay RNA degradation and the rate of new transcription will be required to distinguish among the possibilities listed above.
The most closely related phage to mycobacteriophage Kampy that has been subjected to transcriptional analysis is the well-studied cluster A2 mycobacteriophage L5. Lee et al. have shown that L5 has two primary arms of transcription driven in part by two strong promoters, one at each end, which drive transcription inwards [35,36]. While Kampy's transcriptional pattern is not surprisingly similar to that of L5, comparison of the exact temporal sequence of gene expression between Kampy and L5 is uninformative because L5 is a temperate phage and can form lysogens. The cluster K2 mycobacteriophage TM4, however, is strictly lytic and therefore can be more appropriately compared to Kampy. Similar to both L5 and Kampy, the TM4 genome is separated into a left arm, containing genes primarily involved in virion assembly, and a right arm, containing genes involved in DNA replication and metabolism. Unlike Kampy and L5, every gene in TM4 is transcribed entirely left to right throughout the entire genome [37]. Ford et al. observe translation of early-phase proteins between 10 and 20 minutes after infection, followed by the production of late-phase proteins sometime between 30 and 60 minutes after infection. The transcriptional program of Kampy appears to follow a similar progression, with activation of the early-phase genes 5 minutes following infection and the late-phase gene expression onset between 30 and 60 minutes after infection.
While mycobacteriophages Giles and Kampy share a common bacteriophage genome architecture consisting of a contiguous set of early-phase genes and a contiguous set of late-phase genes, transcription of these sets is strikingly different [29]. Transcription in Giles is largely unidirectional, with 82% of genes transcribed left to right at all time points measured [27]. In contrast, we observed changes over time in Kampy's directional transcriptional patterning. Genes 1-24 are exclusively transcribed in a left to right manner and genes 49-88 are similarly transcribed exclusively right to left, regardless of the time point. However, the region in the middle of the genome (genes 25-48) is transcribed in both directions in proportions that vary with respect to the progression of infection.
Although a two-promoter model could be used to explain much of Kampy's transcriptional profile, clustering analysis suggests that this is insufficient to completely explain observed dynamics and additional transcriptional units remain to be identified. Indeed, the presence of higher levels of coverage at 30 and 60 minutes after infection at genes 13-23 and 48-56 respectively suggest putative transcriptional units for further investigation.
While we observed agreement between biological replicates at most times, at thirty minutes into infection, expression varied considerably between replicates. Specifically, early-phase transcripts from the right hand arm dominated expression in one replicate, whereas in the other replicate we observed transcription across the entire genome. These data suggest that a shift in gene expression occurs around thirty minutes into infection, which would be consistent with the expression timing of TM4 [37]. The variability between replicates may also suggest variability in the exact timing of expression. Higher temporal resolution sampling would quantify biological variability in infection progression from early-to late-phase gene expression.
Transcription beyond thirty minutes was dominated by late-phase transcription from the left arm of the genome. However, we still observed low levels of early-phase transcription from the right arm at these times, a finding consistent with those of Dedrick et al. for mycobacteriophage Giles and of Lavigne et al. in an infection of Pseudomonas aeruginosa by LUZ19 [27,28]. As previously noted for virus-host transcript ratio, our data only reveal differences in ratios of expression, not absolute differences. It is therefore unclear whether early phase gene expression remains constant but much lower than subsequent late phase expression or, alternatively, whether early phase genes are actively repressed or degraded later in infection.
Notably, for each side of the genome, when that side of the genome is transcribed, we observed abundant transcription of a few hundred bases on that particular side of the genome upstream of any region with coding potential. With the exception of one sample at 120 minutes, these peaks consistently contained the highest levels of transcription on each side of the genome. Read pileups at transcriptional start sites are known to occur in RNA-Seq due to overrepresentation of 5' ends after RNA fragmentation [30]. We propose that the high coverage observed on either end of the Kampy genome are mostly likely the result of transcriptional start sites for two promoters.
In addition to transcriptomic analysis, we provide protein-level confirmation of the presence of eighteen previously annotated genes. Twelve of these confirmed proteins have no known function, as well as no known homology outside the mycobacteriophages. This work provides the first experimental confirmation of the translation of these twelve genes from twelve distinct protein phams with a total of 1837 members, whose existence was previously only predicted through bioinformatic detection of coding potential. Our proteomic analysis also identified a protein product spanning parts of genes 66 and 65 that was out-of-frame with respect to both genes. This product is encoded immediately downstream of a site similar to a translational frameshift site previously reported in mycobacteriophage TM4 (CCGAAAA and GGGAAAA, respectively) [31]. Further study will be required to determine whether this product is truly functional.
Using RNA-Seq and mass spectrometry, we have broadly described the transcriptional program of a cluster A4 mycobacteriophage, Kampy. Our analysis shows much greater similarity between the infection dynamics and timing of Kampy and the K2 cluster mycobacteriophage TM4 than between Kampy and the cluster Q mycobacteriophage Giles. Our high-throughput sequencing approach suggests functional studies that can further clarify mechanisms of mycobacteriophage gene regulation during infection.
For all infections, log-phase liquid-culture M. smegmatis was incubated with high-titer lysate of mycobacteriophage Kampy. For latent period experiments, Kampy was added at 1:10 multiplicity of infection (MOI) to ensure complete adsorption of viral particles, thus reducing background virions that would be detected during the assay. For RNA-Seq and proteomic experiments, Kampy was added at a 10:1 MOI to maximize the fraction of infected hosts. Infected cells were grown in a shaking incubator at 37°C.

Latent Period Experiment
M. smegmatis log-phase liquid-culture was combined with high-titer lysate of mycobacteriophage Kampy at 1:10 MOI. Infections were incubated for 5 minutes at 37°C in a shaking incubator to allow adsorption of viral particles, and then pelleted by centrifugation for 3 minutes at 5,000G at 4°C. Cells were washed three times with 1 ml room temperature phage buffer (10 mM Tris pH 7.5, 10 mM MgSO4, 4% w/v NaCl) to remove unadsorbed phage particles, centrifuging at 5,000 G for 3 minutes at 4°C between each wash. Pellets were resuspended in 40 ml Middlebrook 7H9 media supplemented as described above. Infected cells were incubated at 37°C in a shaking incubator for the duration of the experiment. At each measured time point, 200 μl of infected culture was serially diluted in phage buffer, added to 500 μL of uninfected M. smegmatis, and incubated at room temperature for 5 minutes before plating for measurement of viral titer.

RNA preparation
For each time point, 15 ml of infected M. smegmatis culture was pelleted by centrifugation for 10 minutes at 400 G at 4°C. Pellets were flash frozen in liquid nitrogen and stored at -80°C until further processing. Frozen cell pellets were resuspended in 1 ml TRIzol (Life Technologies). Cells were then disrupted as previously described [27]. Total RNA was extracted from the TRIzol solution with bromochloropropane, washed with isopropanol, precipitated with 70% ethanol, and resuspended in nuclease-free water.
To prepare samples for RNA-Seq, total RNA was incubated with recombinant DNase I (Ambion) and RNasin (Promega) for 30 minutes at 37°C. DNase I was inactivated by addition of EDTA (5 mM) and 10 minute incubation at 70°C.

RNA-Seq
Ribosomal RNA was depleted from prepared samples by subtractive hybridization using the RiboZero kit for gram-positive bacteria (Illumina) and depleted RNA was checked for integrity and rRNA depletion with the Bioanalyzer 2100 (Agilent). Libraries for sequencing were prepared from rRNA-depleted RNA using the Ion Total RNA-Seq Kit v2 (Ion Torrent) and sequenced on an Ion Torrent PGM according to the manufacturer's instructions. A total of 4,926,133 reads were generated across all conditions.
Reads with intact adapters were trimmed of adapters using the Torrent Server browser. Trimmed reads were mapped simultaneously to the M. smegmatis (NCBI NC_018289.1) and mycobacteriophage Kampy (GenBank KJ510414.1) genomes using the "RNA-Seq analysis" tool in CLC Genomic Workbench 7.0.x. To test for differential expression of genes between timepoints, count tables of reads mapping to Kampy genes were exported and rRNA and tRNA entries were removed. These count tables were analyzed using the DESeq package in Bioconductor R [38,39]. P-values were corrected for multiple hypothesis testing using Benjamini-Hochberg correction with α = 0.05 [40].
Cluster analysis was performed with the Xmeans algorithm implemented in the standalone program kmeans [32]. Analysis was performed with 88 maximum centers and otherwise recommended default arguments.

Mass Spectrometry
Whole protein was extracted from infected M. smegmatis culture as previously described [41]. Samples were first brought to pH of 8.0 with 1M ammonium bicarbonate. Samples were then incubated with 10 mM final concentration of tris(2-carboxyethyl)phosphine (TCEP) (Sigma-Aldrich) for 10 minutes at 95°C followed by alkylation in the dark with 15 mM iodoacetamide (IAA) (Sigma-Aldrich) for one hour. Samples were then digested with sequencing grade trypsin (Promega) at a 1:50 enzyme to substrate ratio overnight at 37°C. After overnight digestion, samples were then brought up to a final pH of 2.0 using trifluoroacetic acid (TFA), flash frozen, and then lyophilized until further use.
Prior to mass spec analysis, lyophilized samples were brought up in 1 mL of aqueous buffer containing 2% acetonitrile and 0.1% TFA. Samples were then applied to a C18 TopTip (Glygen) according to manufacturers instructions. Briefly, each C18 TopTip was first equilibrated with 100% acetonitrile followed by washing with buffer containing 2% acetonitrile and 0.1% TFA. The trypsin digested samples were then applied to the TopTip followed by subsequent washing with 2% acetonitrile and 0.1% TFA. Samples were eluted from the C18 TopTip using an aqueous solution containing 60% acetonitrile with 0.1% TFA and were then subjected to concentration to approximately 10μl volume using a speed vac.
Mass spec analysis was performed on a nanoACQUITY Xevo G2 QToF equipped with a TRIZAIC UPLC source (Waters Corp). Samples were loaded onto an Acquity HSS T3 Trizaic nanoTile with dimensions of 85 μm x 100 mm. Samples were analyzed using label-free dataindependent acquisition mass spectrometry with subsequent data analyzed using ProteinLynx Global Server V2.3 with a minimum of 3 and 7 ion matches per peptide and protein, respectively. The data was searched against a six-frame translation of the mycobacteriophage Kampy genome.