RNA-Seq Based Transcriptional Map of Bovine Respiratory Disease Pathogen “Histophilus somni 2336”

Genome structural annotation, i.e., identification and demarcation of the boundaries for all the functional elements in a genome (e.g., genes, non-coding RNAs, proteins and regulatory elements), is a prerequisite for systems level analysis. Current genome annotation programs do not identify all of the functional elements of the genome, especially small non-coding RNAs (sRNAs). Whole genome transcriptome analysis is a complementary method to identify “novel” genes, small RNAs, regulatory regions, and operon structures, thus improving the structural annotation in bacteria. In particular, the identification of non-coding RNAs has revealed their widespread occurrence and functional importance in gene regulation, stress and virulence. However, very little is known about non-coding transcripts in Histophilus somni, one of the causative agents of Bovine Respiratory Disease (BRD) as well as bovine infertility, abortion, septicemia, arthritis, myocarditis, and thrombotic meningoencephalitis. In this study, we report a single nucleotide resolution transcriptome map of H. somni strain 2336 using RNA-Seq method. The RNA-Seq based transcriptome map identified 94 sRNAs in the H. somni genome of which 82 sRNAs were never predicted or reported in earlier studies. We also identified 38 novel potential protein coding open reading frames that were absent in the current genome annotation. The transcriptome map allowed the identification of 278 operon (total 730 genes) structures in the genome. When compared with the genome sequence of a non-virulent strain 129Pt, a disproportionate number of sRNAs (∼30%) were located in genomic region unique to strain 2336 (∼18% of the total genome). This observation suggests that a number of the newly identified sRNAs in strain 2336 may be involved in strain-specific adaptations.


Introduction
Systems biology approaches are designed to facilitate the study of complex interactions among genes, proteins, and other genomic elements [1,2,3]. In the context of infectious disease, systems biology has the potential to complement reductionist approaches to resolve the complex interactions between host and pathogen that determine disease outcome. However, a prerequisite for systems biology is the description of the system's components. Therefore, genome structural annotation or the identification and demarcation of boundaries of functional elements in a genome (e.g., genes, non-coding RNAs, proteins, and regulatory elements) are critical elements in infectious disease systems biology.
Bovine Respiratory Disease (BRD) costs the cattle industry in the United States as much as $3 billion annually [4,5]. BRD is the outcome of complex interactions among host, environment, bacterial, and viral pathogens [6]. Histophilus somni, a gramnegative, pleomorphic species, is one of the important causative agents of BRD [6]. H. somni causes bovine infertility, abortion, septicemia, arthritis, myocarditis, and thrombotic meningoencephalitis [7]. H. somni strain 2336, the serotype used in this study and isolated from pneumonic calf lung, has a 2.2 Mbp genome and 2044 predicted open reading frames (ORFs), of which 1569 (76%) have an assigned biological function.
Genome structural annotation is a multi-level process that includes prediction of coding genes, pseudogenes, promoter regions, repeat elements, regulatory elements in intergenic regions such as small non-coding RNAs (sRNA), and other genomic features of biological significance. Computational gene prediction methods such as Glimmer [8] or GenMark [9] use Hidden Markov models which are based on a training set of well annotated genes. Although these methods are quite efficient, they often miss genes with anomalous nucleotide composition and have several well-described shortcomings: because bacterial genomes do not have introns, detecting gene boundaries is comparatively difficult; due to the usage of more than one start codon, computational genome annotation methods may predict overlapping ORFs [10]; prediction programs use arbitrary minimum cutoff lengths to filter short ORFs, which may lead to under-representation of small genes. In case of sRNA (small non-coding RNA) prediction, the lack of DNA sequence conservation, lack of a protein coding frame, and the limited accuracy of transcriptional signal prediction programs (promoter/Rho terminator prediction) confound computational prediction [11,12].
Computational prediction methods are a ''first pass'' genome structural annotation. Whole genome transcriptome studies (such as whole genome tiling arrays [13,14,15] and high throughput sequencing [16,17]) are complementary experimental approaches for bacterial genome annotation and can identify ''novel'' genes, gene boundaries, regulatory regions, intergenic regions, and operon structures. For example, a transcriptomic analysis of Mycoplasma pneumoniae identified 117 previously unknown transcripts, many of which were non-coding RNAs, and two novel genes [18]. Transcriptome analyses identified novel, non-coding regions in other species, including 27 sRNAs in Caulobacter crescentus [15], 64 sRNAs in Salmonella Typhimurium [17], and a large number of putative sRNAs in Vibrio cholerae [16]. sRNAs found in pathogen genomes are known to be involved in various housekeeping activities and virulence [19].
In this study we used RNA-Seq for the experimental annotation of the H. somni strain 2336 genome and to construct a single nucleotide resolution transcriptome map. Novel expressed elements were identified, and where appropriate, computational predictions of previously described gene boundaries were corrected.

Mapping of reads onto the H. somni genome
In 2008 the complete genome sequence of the H. somni strain 2336 became available (GenBank CP000947). The 2,263,857 bp circular genome has a GC content of 37.4%, and 87% of the sequence is annotated to coding regions. The genome has 2065 computationally predicted genes, of which 1980 are protein coding. We sequenced the transcriptome of H. somni using Illumina RNA-Seq methodology, and obtained 9,015,318 reads, with an average read length of approximately 76 bp. We mapped approximately 9.4% reads onto the reference DNA sequence of H. somni strain 2336 using the alignment program Bowtie [20]. To determine expressed regions in the genome, we estimated the average coverage depth of reads mapped per nucleotide/base. We used pileup format, which represents the signal map file for the whole genome in which alignment results (coverage depth) are represented in per-base format. Regions where coverage depth was greater than the lower tenth percentile of expressed genes were considered significantly expressed [21]; in the current study, this corresponded to a coverage depth of 7 reads/bp in pileup format.
As another measure for estimating background expression level, we analyzed the coverage in the intergenic regions of the genome. We assumed that at least half of the intergenic region is not expressed (considering the presence of known expressed regions, such as 39 and 59 UTR of genes, intergenic region of the operons, and sRNAs) and calculated the coverage, which corresponded to #6 reads per base, lower than our first cutoff estimate. We retained the most conservative cutoff for expression, i.e., 7 reads per base for describing the expression map of H. somni. Nucleotides in the genome sequence with coverage depth above our threshold value were considered to be expressed. This resulted in the generation of a whole genome transcriptome profile of H. somni 2336 at a single nucleotide resolution. Figure 1 show the steps involved in the analysis of expressed intergenic regions.

Expression in the intergenic region of the genome
We compared the RNA-Seq based transcriptome map with the available genome annotation to identify expressed, novel, and intergenic regions in the genome. Promoters and terminators were predicted across the genome to add confidence to the identified novel elements. For the first time, we report the identification of 94 sRNAs (Table 1) in the H. somni genome. The start and end for sRNA in Table 1 refer to the boundaries of transcriptionally active regions (TAR, putative sRNAs). Of these, twelve were similar to wellcharacterized sRNA families that are described in many bacterial species, such as tmRNA, 6S, and FMN ( Figure 2). The total of 82 novel sRNAs reported in this study has not been reported earlier.
The majority of the identified sRNAs (.75%) were shorter than 200 nucleotides (length range 70-695 nucleotides). The average GC content of sRNA at 39.3% was slightly higher compared with the 37.4% GC content of the genome. Promoters within 50 nt upstream/downstream of the TAR boundaries were predicted for 68 sRNA. Similarly, Rho-independent transcription terminators were predicted within 50 bp upstream/downstream of 40 sRNA. Figure 3 shows the depth of coverage for one of the identified novel sRNA ''HS46'' viewed in the Artemis genome browser [22].
BLAST analysis of the sRNA sequences against the nonredundant, nucleotide database at NCBI revealed that 31 of the sRNA sequences were unique to the H. somni 2336 genome. Another 41 were highly conserved (.95% identity with .95% coverage) only in H. somni strain 129PT, which is a commensal, preputial isolate. A set of 11 sRNAs were conserved in the related Pasteurellaceae family, which includes genomes such as P. multocida, H. influenzae, H. parainfluenzae, and H. ovis. Only 11 sRNAs were conserved in distant bacterial genomes from genera Streptococcus, Clostrodium, Actinobacillus, Vibrio, and others. This lack of sRNA sequence conservation beyond the species could indicate that sRNA sequences are under strong selection pressure, and that they could be responsible for the adaptation of many species to different environmental niches.
We searched all H. somni sRNA sequences against the Rfam database [23] to determine their putative functions. We found that 12 sRNAs were homologs to well characterized sRNAs in other genomes. The identified functional categories included FMN riboswitches, gcvB, glycine, intron_gpII, lysine, alpha_RBS, LR-PK1, isrK, MOCORNA, RNaseP_bact_a, tmRNA, and 6S. sRNAs for which no Rfam function could be predicted represent a completely novel set of non-coding sRNAs. Functions of these novel sRNA need to be determined by further experiments.

Identification and characterization of novel genes
We evaluated the coding potential of all expressed intergenic regions, by conducting BLASTX based sequence searches against the non-redundant protein database at NCBI followed by manual analysis and interpretation. We identified 38 novel protein coding regions ( Table 2). The average length of the identified novel proteins was around 60 amino acids (ranged from 19 to 135 amino acids). The majority of the novel proteins (30) were conserved hypothetical proteins present in related species such as H. somni 129PT, M. haemolytica, and H. influenzae. Some of the novel proteins had predicted functions, such as DnaK suppressor protein, toxic membrane protein TnaC, and predicted toxic peptide ibsB3 ( Table 2). Figure 4 shows an example of a novel protein ''HSP7'' that is similar (74% similarity and 100% coverage) to a putative, phage-related DNA-binding protein of Neisseria polysaccharea.

Corrections made to the existing genome annotation
The single nucleotide resolution map described in this study enabled us to correct the start site for five genes based on the current genome annotation (Table 3). These genes were annotated as phospholipid synthesis protein, ribosomal protein S2, aconitate hydratase 2, peptide chain release factor 2, and DUF411, a protein of unknown function. Based on evidence from RNA-Seq data, we performed a BLAST comparison with other phylogenetically similar proteins to confirm the new gene boundaries (Table 3).

Non-functional start codons and frameshifts
The comparison of the transcriptome map of the H. somni genome with predicted proteins revealed the presence of frameshift mutations. Four genes have non-functional start codons, resulting in a predicted protein, truncated at the amino terminus (based on BLAST comparison with homologous proteins in other species), although full length mRNA was present. An example is presented for the gene ''HSM_0748'', annotated as ''Alpha-Lfucosidase'' ( Figure S1). The other three genes, HSM_0603, HSM_1666 and HSM_1668, encode a hypothetical protein, type III restriction protein res subunit, and CTP synthase, respectively. Two genes with frameshifts causing protein truncations (based on BLAST comparison with homologous proteins) are HSM_1385 (beta-hydroxyacyl dehydratase, FabA) and HSM_1744 (alcohol dehydrogenase zinc-binding domain protein). The transcriptome map revealed a full length mRNA for these two genes that code for truncated proteins.

Gene expression and operon structures
Our transcriptome map of H. somni identified expression from 1636 (approximately 80%) of the predicted genes. The expressed genes were distributed evenly across all TIGRFAM functional categories (Table S1). The transcriptome map allowed identification of operon structures at a genome scale, critical for identifying co-expressed genes and for understanding coordinated regulation of the bacterial transcriptome. We identified co-expression for 452 pairs (total 730 genes) of H. somni genes ( Table S2) that were transcribed together and constituted a minimal operon. By joining consecutive overlapping pairs of co-expressed genes, we identified 278 distinct transcription units (Table S3).
We compared our experimentally identified co-expressed genes with computationally predicted operons. The overlap between computational prediction of co-expressed genes using DOOR [24] and this study was 86% (394 gene pairs) (Table S4). Thus, our dataset validates expression of 394 computational gene-pair predictions. We identified 59 new gene pairs that are co-expressed and were not predicted by DOOR, which could be part of unidentified, new operon structures. For example, further in-depth analysis indicated a new operon consisting of three genes: HSM1354, HSM1355 and HSM1356, annotated as ribosomal protein L20, ribosomal protein L35, and translation initiation factor IF-3 respectively, which were not predicted computationally ( Figure 5). The orthologs of these genes are well known to form a functional operon of ribosomal proteins (IF3-L35-L20) in Escherichia coli [25].

Discussion
In this study using RNA-Seq we describe the whole genome transcriptome profile of H. somni 2336, a bovine respiratory disease pathogen. The single nucleotide resolution map helped uncover the  structure and complexity of this pathogen's transcriptome and led to the identification of novel, small RNAs and protein coding genes as well as gene co-expression. Prokaryotic genome annotation is performed often using computational gene prediction programs [8,9]. However, these prediction algorithms are not able to identify the non-coding sRNAs, antisense transcripts, and other small proteins. To overcome the shortcomings of computational genome structural annotation, various experimental methods are used for identification of novel expressed elements [13,14,15,16,17,18,26,27,28]. Deep transcriptome sequencing (RNA-Seq) has emerged recently as a method that enables the study of RNA-based structural and regulatory regions at the genome scale. RNA-Seq technology has many advantages compared with existing array based methods for transcriptome analysis. In particular, RNA-Seq does not require     probes, so the process is free from probe design issues or bias from hybridization issues. Also, the transcriptome coverage from RNA-Seq is very high [29,30]. RNA-Seq was demonstrated to be effective for the discovery of bacterial non-coding RNAs, accurate operon definition, and correction of gene annotation [27,31,32]. Therefore, in the current study, we used RNA-Seq for profiling H. somni 2336 transcriptome.
Mapping of RNA-Seq reads onto the H. somni genome sequence resulted in more than 94% coverage with at least one read per base. This observation is consistent with the reported 94% genome expression in Bacillus anthracis, 89.5% in Sulfolobus solfataricus, and 95% in Burkholderia cenocepacia, studied under one or more experimental growth conditions using RNA-Seq [32,33,34]. These results indicate that most of the bacterial genome sequence is expressed at some basal level. To identify significantly expressed regions above this baseline, we used two alternative methods (discussed in Results section) to estimate the background expression. Both methods yielded similar results (6-7 reads per base). We selected the higher stringency cutoff of 7 reads per base to minimize the number of false positives.
We identified a total of 95 sRNAs in the H. somni genome. Twelve of these were predicted by Rfam [23] and are similar to  conserved sRNA (e.g., 6S, tmRNA, FMN) in other bacterial species, which helps validate our approach. The 83 novel H. somni sRNAs may have housekeeping function, regulatory activity, or participate in virulence as described in other pathogenic bacteria [19,35,36]. The identified sRNAs did not show any location specific bias across the genome. Similarly, genes known to be associated with virulence are known to be scattered across bacterial genomes [37,38]. However, the tendency to form clusters was observed with sRNAs, which could indicate that functionally related sRNAs tend to be located in close proximity. The RNA-Seq based transcriptome map of H. somni identified 38 novel protein coding genes that were missed by the initial annotation. The average length of the proteins coded by these genes exceeds 60 amino acids, suggesting that length based cutoff was not the main reason that these genes were missed by computational gene prediction programs. The novel protein coding genes identified in the current study could serve as a training set to improve gene prediction algorithms.
The transcriptome map helped to identify incorrect annotation of start codons in the genome. Transcriptional mapping does not provide direct evidence of translational start sites. However, location of identified transcriptional start sites suggest that the annotated start codons are incorrect, an observation that is confirmed by BLAST comparisons against homologous genes in other bacterial species. Transcriptional mapping revealed genes where the 59 untranslated sequence extended well beyond the translational start. BLAST comparisons indicated that these genes have either nonsense or missense base changes relative to homologous genes in other bacterial species, causing apparent ''truncated'' proteins compared with those in other species. Further work is needed to determine whether these 59 untranslated regions serve regulatory functions or they are vestigial.
RNA-Seq data enabled us to determine operon structures at a genome scale, and it allowed identification of some operons not predicted by the computational operon prediction method. Operon structures that include genes not expressed under the experimental growth condition used in the current study, could not be identified. Our results support the notion that using a combination of experimental operon identification by RNA-Seq and computational prediction can improve operon identification in bacterial genomes [39].
For the first time, we report the RNA-Seq based transcriptome map of H. somni 2336 and describe novel expressed regions in the genome. Whereas the results are interesting, we are aware of the limitations of the study. Because the RNA-Seq protocol was not strand specific, we could not determine the strand specificity of expressed novel transcripts. Therefore, Table 1 lacks information about sRNA orientation in the genome. Because strand specific information was missing, we could not describe antisense expression in the genome. For protein coding genes, we derived strand specificity based on alignment of the BLAST hit. Despite this shortcoming, we identified novel expressed regions and transcriptional patterns across the whole genome at a high coverage, which is not possible by other transcriptome analysis methods.
Overall, this study describes RNA-Seq based transcriptome map of H. somni for identification of functional elements in a pathogen of importance to agriculture. Our genome-wide survey predicts numerous, novel, expressed regions that need biological characterization for understanding disease pathogenesis. Description of all functional elements in the H. somni system is a prerequisite for conducting holistic systems approaches to understand the complex pathogenesis of bovine respiratory disease.

RNA isolation and sequencing
We propagated H. somni 2336 on three TSA-blood plates (with 5% sheep red blood cells) for 16 hr or until a fresh lawn of cells was visible. IBC approval was not required for acquiring the plates as they were purchased through a commercial vendor: Fisher Scientific (Pittsburgh, PA), and manufactured by Becton Dickinson Diagnostic Systems, (Franklin Lakes, NJ). We washed the plates with brain heart infusion (BHI) broth, adjusted the culture to an OD620 nm = 0.8, and supplemented with RNAprotect reagent. The cells were harvested by centrifugation and stored at 280uC. We extracted total RNA using the RNeasy mini kit (Qiagen, Valencia, CA) following the manufacturer's protocol. Total RNA was treated with RNase-free DNAse (Invitrogen, Carlsbad, CA). Using Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA), we determined the RNA integrity number (RIN) of total RNA to be greater than 8. MICROBExpress TM Kit (Ambion, TX, USA), which specifically removes rRNAs, was used for mRNA enrichment. Small RNAs (i.e., tRNA and 5S rRNA) are not removed with this enrichment step (confirmed by Bioanalyzer).
We used 100 ng enriched mRNA with Illumina mRNA-Seq sample preparation kit (Illumina, San Diego, CA) for library construction following the manufacturer's protocols. Briefly, mRNA was fragmented chemically by divalent zinc cations and randomly primed for cDNA synthesis. After ligating paired-end sequence adaptors to cDNA, we isolated fragments of approximately 200 bp by gel electrophoresis and amplified. We

Mapping and analysis of Illumina reads
We checked all Illumina reads for quality, and removed sequence reads containing ''Ns''. Custom perl script was written to convert Illumina reads into fastq format. The script ''fq_all2std.pl'' from MAQ [40] converted fastq format to Sanger fastq format. Reads in sanger fastq format, were mapped onto the Histophilus somni 2336 genome sequence (GenBank Accession number. CP000947) using the alignment tool Bowtie [41], allowing for a maximum of two mismatches. The reads that mapped to more than one location were discarded. We used Samtools [42] to convert data into SAM/BAM format, and to generate alignment results in a pileup format. Pileup format provides the signal map file and has per-base format coverage. Custom perl scripts were written to calculate the background expression. Processed data was deposited in GEO with the accession number GSE29578.

Analysis of intergenic regions of H. somni genome
We used in-house perl scripts to extract novel expressed intergenic regions to identify novel small RNAs, riboswitches, and putative novel proteins. sRNA ,70 bp in length were discarded to minimize the number of false positives. For each novel expressed region, BLAST sequence searches were performed against the non-redundant protein database at NCBI to identify potential protein coding regions. Intergenic regions within predicted operons [24] represent expressed regions and can be mis-classified as sRNAs. Therefore, these regions were excluded.
We analyzed BLAST results manually, to identify novel protein coding regions and start codon corrections. If no protein coding region was found in the intergenic expressed regions, the presence of a promoter or a rho-independent terminator allowed us to classify the regions as sRNA. Bacterial promoter sequences were predicted by Neural Network Promoter Prediction program (http://www.fruitfly.org/seq_tools/promoter.html) [43]. Rho-independent transcription terminators were identified using the program TransTermHP [44]. For functional annotation, all identified identified sRNA sequences were searched against the Rfam database [23]. sRNA sequence conservation among other genomes was determined by blastn searches against nonredundant nucleotide database at NCBI. We mapped sRNAs, along with additional features, onto genome browsers like IGV [45] and Artemis [46] for further visualization, manual analysis, and interpretation.

Analysis of annotated regions of H. somni genome
Gene expression: expressed reads with coverage above background were mapped onto the annotated genes of H. somni 2336. Genes that had a significantly higher proportion of their length (.60%) covered by expressed reads were considered to be expressed.
Operons: RNA-Seq can identify and predict operon structures in bacteria. We considered two or more consecutive genes to be part of an operon, if they fulfilled the following criteria: (a) they are expressed; (b) they are transcribed in the same direction; and (c) the intergenic region between the genes is expressed. Overlapping pairs of such genes were joined together to identify large operon structures. We used in-house perl scripts for the analyses. Figure S1 Mutated start codon. The Figure shows that the predicted protein coding frame (MH_748) is shorter at the 59 end than the corresponding transcript level shown by the RNA-Seq coverage. Although the transcript is longer near 59 end, no start codon is found in that region which might be a result of the mutation in that region of the start codon. This was further validated using homology searches of the full length transcript which shows high homology (95% Identity and .95% coverage) to a alpha-L-fucosidase protein from M. haemolytica PHL213. (TIF)