Identification of Novel sRNAs in Mycobacterial Species

Bacterial small RNAs (sRNAs) are short transcripts that typically do not encode proteins and often act as regulators of gene expression through a variety of mechanisms. Regulatory sRNAs have been identified in many species, including Mycobacterium tuberculosis, the causative agent of tuberculosis. Here, we use a computational algorithm to predict sRNA candidates in the mycobacterial species M. smegmatis and M. bovis BCG and confirmed the expression of many sRNAs using Northern blotting. Thus, we have identified 17 and 23 novel sRNAs in M. smegmatis and M. bovis BCG, respectively. We have also applied a high-throughput technique (Deep-RACE) to map the 5′ and 3′ ends of many of these sRNAs and identified potential regulators of sRNAs by analysis of existing ChIP-seq datasets. The sRNAs identified in this work likely contribute to the unique biology of mycobacteria.


Introduction
The genus Mycobacterium contains many clinically relevant pathogens, including Mycobacterium tuberculosis and Mycobacterium leprae, the etiologic agents of tuberculosis (TB) and leprosy, respectively. M. tuberculosis alone was responsible for 8.7 million incident cases and 1.4 million deaths globally in 2011 [1]. The treatment of TB has become increasingly difficult due to its high drug resistance and adaptability; hence, the development of new and more effective treatments for TB is imperative.
In a previous study, we used computational predictions from the SIPHT (sRNA Identification Protocol using High-throughput Technologies) [17] algorithm to identify 144 sRNA candidates in M. bovis BCG. We selected 34 conserved sRNA candidates which we experimentally confirmed by Northern blot [13]. In the current study, we expanded our search to include all the remaining SIPHT predictions for M. bovis BCG as well as all SIPHT predictions for M. smegmatis (these were not explored in the previous study). By combining SIPHT predictions with large-scale Northern blot validation, we have identified 23 and 17 novel sRNAs in M. bovis BCG and M. smegmatis, respectively. Thus, we have substantially increased the number of experimentally validated sRNAs in mycobacterial species. We also analyzed existing ChIP-seq datasets to identify possible regulators of sRNA expression, and used Deep-RACE, a technique that combines high throughput RNA-seq with Rapid Amplification of cDNA Reads (RACE) [18], to identify sRNA 59 and 39 ends. Lastly, it is worth noting that this work is one of the first efforts to better coordinate genome annotation; all sRNA candidates identified in this study have been renamed with a new nomenclature [19].

Strains and Plasmids
M. bovis BCG (Pasteur strain, Trudeau Institute), and M. tuberculosis H37Rv were grown in mycomedium (as previously reported, [13]). M. bovis BCG and M. tuberculosis cultures were grown for 7 days, with shaking, to late-log phase. Cultures of M. smegmatis MC 2 155 were grown shaking at 37uC, in trypticase soy media supplemented with 0.05% Tween 80 for 18 hours with shaking (late-log phase).

Phylogenetic Selection of Computationally Predicted sRNA Candidates
Small RNA candidates of M. smegmatis were predicted using the SIPHT program with the same parameters as described previously [17,20]. SIPHT identifies potential sRNA candidates based on the presence of intergenic sequence conservation upstream of putative Rho-independent terminators. SIPHT has been widely applied in sRNA studies [21][22][23], and its reliability has been tested and compared with other algorithms [24].

RNA Isolation and Northern Blot Analysis
RNA was isolated as previously reported [13]. Northern blot analysis was performed as previously reported [13]; probes were designed according to SIPHT predicted sequences and tested in M. bovis BCG, M. smegmatis and M. tuberculosis [13]. All the oligonucleotides that were used in this study are listed in Table S1.

ChIP-seq Analysis
We analyzed existing ChIP-seq datasets for 55 M. tuberculosis transcription factors extracted from a previous study [25]. ChIPseq peak positions were compared to the 59 end positions of M. bovis BCG and M. tuberculosis sRNAs from the current study and two previous studies [12,15]. For M. bovis BCG sRNAs, we first identified the equivalent region of the M. tuberculosis H37Rv genome. Possible sRNA regulators were selected if the ChIP-seq peak was located within 100 bp upstream and 20 bp downstream of an sRNA 59 end.

Deep 59 and 39 RACE
Deep 59 RACE and Deep 39 RACE were performed as previously described [18] with the following exceptions. Deep 59 RACE libraries and Deep 39 RACE libraries were pooled and sequenced together using an Ion Torrent 316 chip (Wadsworth Center Applied Genomic Technologies Core Facility). For Deep 59 RACE, sequence reads were identified by the presence of the expected adapter sequence at the read 59 end. Adapter sequences were removed and reads of .40 nt were mapped to the reference genomes using BWA [26]. For Deep 39 RACE, sequence reads were identified by the presence of the expected adapter sequence.   Figure S1. Lane 1 and 2 indicate total RNA samples extracted from M. smegmatis and M. bovis BCG, respectively. We used Phi-X174/Hae III Marker for the size prediction. The probes we used for this analysis are listed in Table S1. doi:10.1371/journal.pone.0079411.g002 Adapter sequences were removed. The oligo-dT stretch was removed by identifying the first consecutive pair of bases not including a ''T'' and removing all sequence upstream of this. Sequences of .40 nt were mapped to the reference genomes using BWA [26]. For both Deep 59 RACE and Deep 39 RACE, 59 and 39 ends were identified as the position with the most sequence reads, and with a minimum of 5 reads. Sequences of all primers used for Deep RACE are listed in Table S2.

Results and Discussion
Prediction of sRNAs in silico using SIPHT Using SIPHT, we identified 93 candidate sRNAs in M. smegmatis (refseq: NC_008596) (Table S3) and 144 candidate sRNAs in M. bovis BCG (refseq: NC_008769) (Table S4). Tables S3 and S4 include a detailed description of the predicted coordinates, orientations, sizes and neighboring upstream and downstream genes. Northern probes were designed according to SIPHT prediction. Figure 1 summarizes the overall approach that was employed in this work for sRNA identification and confirmation.

Novel sRNAs Identified in Mycobacterium Smegmatis
All 93 M. smegmatis sRNA candidates were tested by Northern blot using oligonucleotides in both orientations; expression was confirmed for 18 sRNA (listed in Table 1; see blot pictures in Figure 2 and Figure S1). One of them (Sm32/33) was identified in recent work as IGR-1 with similar size, coordinates and same orientation [27]. Thus, 17 M. smegmatis sRNAs identified here have not been experimentally demonstrated in any previous studies. In our previous study [13], we reported homologs of 6 M. smegmatis sRNA candidates (Sm32/Sm33, Sm35, Sm46, Sm47, and Sm74) in M. bovis BCG (Mpr13/Mcr14, Mpr20, Mpr3, Mpr4, and Mpr5, respectively). These were confirmed directly in M. smegmatis by Northern blotting in current study and listed in the 17 novel confirmed sRNAs. A homologue of Sm76 was previously identified in M. tuberculosis by RNA-seq [15] and microarray analysis [14] but not otherwise experimentally confirmed. All of the validated sRNAs were in the same orientation to that predicted by SIPHT. This suggests that the sequence specificity of SIPHT for this prediction is higher than in our previous work, in which 9 out of 37 of the validated sRNAs were in the opposite orientation to the prediction [13]. All confirmed sRNAs were assigned gene names according to a recently-proposed nomenclature [19]. Given the practical convenience of testing RNA from both species simultaneously to search for novel sRNA candidates, we used the designed probes for sRNA detection in M. smegmatis to also probe expression of these candidates in M. bovis BCG and M. tuberculosis. Although our focus was to validate M. smegmatis predictions, we fortuitously discovered homologues of 9 candidates in M. bovis BCG and 4 candidates in M. tuberculosis (Table 1, Figure  S2). Since these probes were not specifically designed for the other two species, lack of detection could be due to either the absence of sRNA expression or to non-optimization of the probe sequence that was used for hybridization to the targeted region in the M. bovis BCG and M. tuberculosis genome. Also, differences in culture medium might contribute to the low number of expressed homologous sRNAs of M. smegmatis in M. tuberculosis as expression of these sRNAs could be specific to different conditions in M. tuberculosis. Given our focus in sRNA identification, specific conditions that could lead to differences in sRNA expression will be explored in future work.

Novel sRNAs Identified in Mycobacterium Bovis BCG
Twenty-one of the sRNA candidates for M. bovis BCG (Bo12, Bo15, Bo41, Bo52, Bo58, Bo67, Bo68, Bo75, Bo80, Bo85, Bo99, Bo100, Bo111, Bo113, Bo115, Bo117, Bo122, Bo125, Bo126, Bo137, and Bo139) were previously identified, under the nomenclature Mpr 1-21, respectively [13]. Forty-six other candidates were also tested previously but showed no signal; therefore, only the remaining 77 candidates were tested using Northern blotting analysis in this study, and we confirmed expression of 23 new sRNA candidates (Figure 3 and Figure S3). A homologue of Bo46 was previously identified in M. tuberculosis by RNA-seq [15] but not otherwise experimentally validated. All of the validated sRNAs were in the same orientation as that predicted by SIPHT. We also applied the probes to M. smegmatis and M. tuberculosis and identified 20 and 5 sRNA homologues, respectively (Table 2; Figure 3; Figure S3). All the confirmed sRNAs in M. bovis BCG and M. tuberculosis are listed in Table 2, along with the new nomenclature for sRNAs.

Deep-RACE Identifies sRNA 59 and 39 Ends
We used Deep-RACE, a previously described approach that combines conventional RACE and deep sequencing to identify 59 and 39 ends of selected RNAs [18,28]. In total, we identified 59 ends for 9 sRNAs and 39 ends for 21 sRNAs. Examples are shown in Figure 4. For some sRNAs we identified multiple 59/39 ends. Multiple 59 ends could be due to multiple transcription start sites or RNA processing. Multiple 39 ends could be due to RNA processing or may indicate imprecise Rho-dependent termination of transcription.

Size Comparisons between Experimental and Prediction Analysis
As noted in our earlier study [13], the predicted size of the candidate sRNAs correlates only weakly with experimental observations. Only about 17% of the confirmed sRNAs were within 10% of their predicted sizes. Additionally, in many cases, multiple bands were detected by Northern analysis, suggesting the presence of multiple start sites, multiple termination sites, and/or sRNA processing. This is consistent with the Deep RACE data (Figure 4; Figure S4). Deep RACE identified both 59 and 39 ends for seven sRNAs. In these cases, the sizes determined by Deep RACE are similar to those confirmed by Northern blotting.

Location of sRNAs with Respect to Genes
To investigate the potential roles of the novel sRNAs, we mapped them all to the latest annotated genome (National Center for Biotechnology Information, NCBI). Although we aimed to find intergenic sRNAs, half of the candidates we identified in this study overlap partially or entirely protein-coding genes in either the  Figure S2. Lane 1 and 2 indicate total RNA samples extracted from M. bovis BCG and M. smegmatis, respectively. We used Phi-X174/Hae III Marker for the size prediction. The probes we used for this analysis are listed in Table S1. doi:10.1371/journal.pone.0079411.g003 sense or antisense orientation ( Table 1, Table 2). We categorized sRNAs into different classes according to their position relative to adjacent coding regions. Where possible, we used 59/39 end information from Deep-RACE data. For sRNAs that have only one end mapped by Deep-RACE, the other end was estimated according to the length confirmed by Northern blotting analysis (Figure 4). For sRNAs that have neither end mapped by Deep-RACE, the farthest possible ends were estimated according to Northern blotting analysis and the sRNAs would be categorized as ''not determined'' if multiple class options exist.
The location of sRNAs relative to protein-coding genes also gives clues as to their function. Regulatory sRNAs that are completely intergenic typically function by base-pairing with distally-encoded mRNAs; however, some of the sRNAs are close to the 59 end or 39 end of adjacent genes, suggesting possible alternative regulatory roles. sRNAs antisense to ORFs or UTRs can regulate expression of the overlapping gene [29]. sRNAs located within UTRs or ORFs in the sense orientation may be degradation products or mRNAs or could be important cis-acting regulatory elements such as riboswitches. sRNAs can be transcribed independently or generated by processing of mRNA UTRs. Several features of the sRNAs identified in this work are consistent with the sRNAs being independently transcribed from their own promoters. First, the Northern blots showed no evidence of larger bands that could correspond to pre-processed mRNAs. Second, 13 sRNAs (Sm35, Sm42, Sm67, Sm68, Sm74, Bo13, Bo32, Bo60, Bo71, Bo73, Bo81, Bo118, Bo130) are orientated away from the surrounding genes. Third, 5 sRNAs (Sm64, Sm82, Bo47, Bo105, Bo132) are located .200 bp from the nearest gene start/stop. Nineteen sRNAs are close to (,200 bp) upstream or downstream coding regions (Sm11, Sm19, Sm32/33, Sm46, Sm47, Sm49, Sm76, Bo27, Bo29, Bo35, Bo46, Bo48, Bo53, Bo78, Bo82, Bo86, Bo87, Bo96, Bo101) and four (Sm38, Sm41, Sm90, Sm93) overlap coding regions in   [12]. The green and blue graphs indicate the relative number of sequence reads mapping to the plus and minus strands, respectively. The yellow graphs indicate the sum of plus and minus strand reads. Annotated genes are shown as gray arrows. sRNAs are shown as red triangles. doi:10.1371/journal.pone.0079411.g005 the sense orientation. It is formally possible that these sRNAs are generated by mRNA processing or premature termination, although the Northern blot analysis argues against this. Regardless, sRNAs processed from mRNAs could still have important regulatory functions [3,30,31]. Indeed, a recent study identified 39 UTRs as an abundant source of regulatory sRNAs in Salmonella enterica [32]. Alternatively, sRNAs generated by processing of mRNAs could indicate cis-acting regulatory elements such as riboswitches.

Likely Regulators of sRNAs Identified by Analysis of ChIPseq Datasets
The regulation of sRNAs can provide important clues as to their biological functions. However, very little is currently known about regulation of mycobacterial sRNAs. The genome-wide binding profiles of many M. tuberculosis transcription factors have recently been determined using ChIP-seq and these data are publicly available [25]. Although we identified sRNAs in M. bovis BCG, it is highly likely that these sRNAs are conserved in M. tuberculosis given the extremely high similarity of the M. bovis BCG and M. tuberculosis genomes [33]. Hence, we searched existing ChIP-seq datasets of M. tuberculosis for transcription factors that bind close to sRNA 59 ends, including sRNAs identified in earlier studies [12]. We identified 10 ChIP-seq peaks (indicative of a transcription factor binding site) located between 100 bp upstream and 20 bp downstream of sRNA 59 ends (Table S5). Thus, we have identified likely examples of sRNA regulation. In some cases, the ChIP-seq peak is also close to the start of an annotated protein-coding gene. Hence, the transcription factor may regulate the protein-coding gene rather than the sRNA. Nevertheless, in four cases, the ChIPseq peak is unambiguously associated with an sRNA 59 end. The two examples with highest ChIP-seq signal are shown in Figure 5. For each of these examples, the transcription factor is otherwise uncharacterized.

Conclusion
In summary, we have identified 17 novel sRNAs in M. smegmatis and 23 novel sRNAs in M. bovis BCG, verified 59 and 39 ends, and list these sRNAs according to a recently-proposed annotation nomenclature. Our analysis of sRNA position relative to proteincoding genes suggests various potential roles for these sRNAs in gene regulation. Although the specific biological function of these, and all other known mycobacterial sRNAs, is not understood, we speculate that some of these sRNAs contribute to the biology of pathogenic mycobacterial species. Future studies will focus on the functional characterization of these novel sRNAs.