StreptoBase: An Oral Streptococcus mitis Group Genomic Resource and Analysis Platform

The oral streptococci are spherical Gram-positive bacteria categorized under the phylum Firmicutes which are among the most common causative agents of bacterial infective endocarditis (IE) and are also important agents in septicaemia in neutropenic patients. The Streptococcus mitis group is comprised of 13 species including some of the most common human oral colonizers such as S. mitis, S. oralis, S. sanguinis and S. gordonii as well as species such as S. tigurinus, S. oligofermentans and S. australis that have only recently been classified and are poorly understood at present. We present StreptoBase, which provides a specialized free resource focusing on the genomic analyses of oral species from the mitis group. It currently hosts 104 S. mitis group genomes including 27 novel mitis group strains that we sequenced using the high throughput Illumina HiSeq technology platform, and provides a comprehensive set of genome sequences for analyses, particularly comparative analyses and visualization of both cross-species and cross-strain characteristics of S. mitis group bacteria. StreptoBase incorporates sophisticated in-house designed bioinformatics web tools such as Pairwise Genome Comparison (PGC) tool and Pathogenomic Profiling Tool (PathoProT), which facilitate comparative pathogenomics analysis of Streptococcus strains. Examples are provided to demonstrate how StreptoBase can be employed to compare genome structure of different S. mitis group bacteria and putative virulence genes profile across multiple streptococcal strains. In conclusion, StreptoBase offers access to a range of streptococci genomic resources as well as analysis tools and will be an invaluable platform to accelerate research in streptococci. Database URL: http://streptococcus.um.edu.my.


Materials and Methods Datasets
Seventy-seven genome sequences of S. mitis group bacteria were downloaded from the public NCBI database. We also have included 27 novel strains/genomes of S. mitis group generated from our laboratory in a sequencing project. All 27 strains were clinical isolates from individuals with dental plaque or infective endocarditis from different geographical locations (Table 1). Of these strains, 14 strains were isolated in the United Kingdom, 10 in United States, 2 in Australia and 1 in Denmark (Table 1). S. sanguinis NCTC 7863 is also known as ATCC 10556 while S. gordonii Blackburn and Channon are designated NCTC 10231 and NCTC 7869, respectively. Additionally, a number of these S. mitis group strains including JPIIBBV4, JPIIBV3, JPIBVI, LRIIBV4, DGIIBVI and DOBICBV2 have been previously described [9]. The isolation of strain M99 was described in a study of mechanisms of platelet aggregation by oral streptococci [10]. The other two oral isolates, SK120 and SK184 have also been described by Mogens Kilian and his fellow researchers in their taxonomic study of 'Viridans' Streptococci conducted in 1989 [11].
Briefly, the 27 S. mitis group genomes were sequenced using Next-Generation Sequencing Illumina HiSeq2000 platform. Data pre-processing was performed by a trimming approach (Phred score Q20) and assembled using CLC Genomic Workbench V6.5 (CLC BIO Inc., Aarhus, Denmark). In general, these assemblies showed high N50 values and low contig numbers, indicating high quality genome assemblies. The assembled mitis group genomes harbor an average GC content of 35% to 45% and with an average genome size of approximately 2MB ( Table 2).
To facilitate comparative analysis across different S. mitis group genomes, consistency in annotation is important. Therefore, we annotated all 104 genome sequences using the Rapid Annotation using Subsystem Technology (RAST) pipeline, which is a well-established and fully open web-based engine, supporting annotation of both complete and draft genomes [12]. The RAST pipeline enables genome identification of an array set of distinct genome components including protein-coding genes, ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs), pseudogenes, gene function prediction. The RAST genome annotation works by mapping a set of genes to their corresponding subsystems as well as their metabolic reconstructions. Moreover, it predicts functional proteins assignment according to their relatedness in the subsystems of FIGfams database. Using the RAST pipeline, we predicted 213,268 Coding Sequences (CDSs), 5,140 RNAs and 4,542 tRNAs in all 104 genomes in the mitis group genomes.
To systematically predict subcellular localization of each RAST-predicted gene, we utilized the latest PSORTb subcellular localization tool (version 3.0) program [13]. PSORTb is an efficient, open-source tool which supports high precision of proteome-scale prediction coverage and refined sub-categories localization. The predicted subcellular localization sites were computationally calculated based on the values of feature variables which infer the sequences characteristics. Each of the generated values was then sorted to their respective candidate site through their estimated relativity. Besides the subcellular localization information, we also ran our in-house Perl script to estimate the GC content, hydrophobicity and molecular weight of each protein or gene.

Database structure, composition and implementation
StreptoBase was designed to provide a wide range of useful information and functionalities (Fig 1). For instance, StreptoBase provides users with some background information about S. mitis group species. Within the homepage of StreptoBase, there is a summary box which comprises the genome information stored in the database, such as number of species, strains, number of CDS, number of RNAs and number of tRNAs (Table 4), which are useful before users proceed to further genome details and downstream analyses. Furthermore, we have compiled and gathered information from various sources on S. mitis group species, for example, news and conferences, blogs and information and recently published papers, which are available in the StreptoBase homepage. By clicking on "Browse" menu, users will see the list of 11 S. mitis group species along with their respective number of draft and complete genomes, while each "View Strains" button, enabling users to visualize all available Streptococcus genomes of any particular species, respectively. Under the "Browse Strains" page, a summarized genome description which encompasses genome size (Mbp), GC content (%) and a list of contigs, genes and rRNAs of that particular species strain are tabulated and displayed. The "Details" button allows users to access further detailed and additional data of that particular strain such as a complete list of ORFs in the genome, their corresponding functions, start and stop chromosomal positions of each ORF/gene in the "Browse ORF" page. To display all information about an ORF or gene, users can click on the "Details" button associated with the ORF. This will open the "ORF Detail" page, displaying information such as their gene type, start and stop positions, nucleotide length, amino acid sequences, functional classification, SEED subsystem (if available), direction of transcription (strand), subcellular localization, hydrophobicity (pH) as well as molecular weight (Da) will be displayed.
Streptococcus Genome Browser (SGB). StreptoBase is equipped with a real-time and interactive Streptococcus Genome Browser (SGB), which was customised from a well-established genome browser, JBrowse [14], a fast and modern JavaScript-based genome browser which performs navigation on genome annotations and visualization of the location of genes and flanking genomic regions/genes of a selected Streptococcus strain. This interactive SGB enables users to browse genes or genomic regions with graphic-wise motion smoothly and rapidly. SGB overcomes the discontinuous transitions and provides efficient panning and zooming of a specific genomic region in each Streptococcus genome. Furthermore, users can remotely turn on or off the DNA, RNA, and CDS tracks during the navigation process, providing flexibility in customizing what to view in the SGB viewer. We have also implemented a "Search" feature in the genome browser page, allowing users to quickly search a gene by keyword or ORF ID which is not provided by JBrowse.
Real-time keyword search engine. Considering the fact that StreptoBase would host an extensive number of genes and their annotation and this information will increase periodically,  The Apache web server is equipped with Linux OS to manage the comprehensive Streptococcus genomic data housed in StreptoBase. The front end PHP framework of CodeIgniter version 2.1.3 was implemented to offer model-view-controller (MVC), dividing application data, presentation and background logic and process into three distinct modules. With this advanced feature, all Streptococcus related sources codes and biological data are arranged in a clear and organized fashion which facilitate future updating of new Streptococcus genomes into the existing database system. For Streptococcus biological data storage and management, we utilized MySQL version 14.12 in order to store the extensive Streptococcus genome information into a well-designed database schema and tables. The backend process of StreptoBase is monitored by Perl script, Python script and R script which support the efficiency and functionality of our integrated bioinformatics tools.
Additionally, users are able to download all the Streptococcus genome sequences, ORF annotation details in table format, ORF sequences, RNAs and CDSs as well as nucleotide and amino acid sequences via the "Download" menu.

Database features and incorporated bioinformatics tools
The S. mitis group species are important colonizers of the oral cavity, and are occasionally associated with serious infections [15]. In addition, these organisms have recently been suggested to play important roles in the pathogenesis of influenza [8]. Therefore, the genomic study of diverse S. mitis group bacteria is essential in order to understand how these microorganisms transit from a commensal lifestyle in the mouth to subsequent pathogenesis. However, there is no existing specialized genome database available for the wide array of S. mitis group genomes for comparative genomics. While most biological genome databases only focus on the genome content and genetic variation, we have identified a need to create functional bioinformatics tools to investigate virulence determinants within genomes through comparative pathogenomics, as well as to compare the genome content and genetic variation within the S. mitis group bacteria. Pairwise Genome Comparison (PGC) tool. We designed and customised a web-based PGC tool for S. mitis group bacteria, enabling users to select and perform pairwise comparisons between two user-selected Streptococcus genomes. A list of Streptococcus genomes is available on PGC tool of StreptoBase, allowing users to choose two Streptococcus genomes for cross strain or cross species comparison. Alternatively, users can upload their own genome sequences, either nucleotides or protein, and compare with the Streptococcus genomes in StreptoBase.
Briefly, the PGC pipeline is supported by NUCmer that is designed to align whole-genome sequences, and Circos that is a well-established tool for genome visualisation. Once users submit their jobs to our server, PGC will call NUCmer program to align user-selected genomes and in-house scripts will be used to process the genome alignment output and generate input files parsed to Circos in order to generate a circular ideogram layout of alignments. Unlike the conventional linear display of alignments, the circular layout shows the relationship between pairs of positions with karyotypes and links encoding the position, size and orientation of the related genomic elements.
Three user-defined parameters are provided in the PGC web interface including minimum percent identity (%), merge threshold (bp) and link threshold (bp). The minimum percent identity cut-off defines a homologous region (represented by links/ribbons in the Circos plot) between the two compared genomes. The merge threshold allows merging of two links/ribbons which have distance within the user-defined threshold, and the link threshold allows users to eliminate any mapped/homologous regions that have genomic size less than the user-defined cut-off. A histogram track is added in the outer ring of the circular plot to indicate the percentage of mapped regions, allowing users to quickly identify potential indels (indicated by white gaps) and mapping regions (indicated by green charts) between the two aligned genomes. The implementation of the PGC pipeline is governed using Perl scripts. This pipeline produces two types of outputs: NUCmer alignment results and the high quality Circos plot (SVG format). Users can freely download these results for publication or further analyses in the PGC result page.
The existing Microbial Genome Comparison (MGC) tool utilizes an in silico genome subtraction method to identify genetic elements specific to a group of strains [16]. While PGC tool uses genome files and NUCmer to perform pairwise genome alignment, the MGC tool uses in silico fragmented genome sequences and performs BLASTN on groups of queries. On the contrary, the VISTA Browser which is well-known for its biological application is able to perform pre-computed pairwise and multiple whole-genome alignments using both global and local alignments [17]. In contrast to circular plots and histograms that are generated by the PGC tool, the alignment results generated by VISTA Browser are displayed using VISTA track in graph plot format to show conserved regions. Additionally, the open source Java-based Artemis Comparison Tool (ACT) requires users to generate a comparison file which identifies homology regions between assembly and reference genome using programs such as BLASTN, TBLASTX or Mummer to be loaded on ACT [18]. The comparative ACT visualization is performed using Artemis components. By contrast, our PGC tool enables a single-flow process of pairwise genome alignment and instant display of the comparative alignment Circos plot.
To demonstrate the utility of PGC, we compared S. mitis B6 (complete genome) and 17/34 (draft genome) as a case study in Fig 2. The parameters were set as 80% of minimum percent, default value of 1000bp link threshold and 2000bp merge threshold. S. mitis B6 was isolated in Germany, whereas S. mitis 17/34 was isolated from the urethra of a Russian patient with urethritis. Based on the generated PGC plot, both S. mitis genomes generally shared high similarity as most of their genomic regions could be aligned (Fig 2). One of the features of PGC plot is its ability to quickly identify putative indels via visualization of the gaps in the plot chart which is supported by information displayed in the histogram track. For instance, two of the gap occurrences (Fig 2) indicate the absence of genomic regions in the S. mitis 17/34 genome. The external circular bar of the plot shows the genome size measurements which are approximately 2MB for both S. mitis genomes. Based on the gap observed in Fig 2 (indel 'A'), the gene loss occurred close to position 400,000bp.
Next, we examined the genes located at indel 'A' in S. mitis B6 (Fig 2) by visualising this region using SGB. We identified many phage-related genes associated with this region. To further examine this region, we utilized PHAST (PHAge Search Tool) to annotate and identify prophages sequences found within S. mitis B6 genome (You Zhou et al., 2011). A 56Kb intact prophage with 82 CDSs and GC content of 39.9% was detected from 390,924bp to 446,969bp. Since S. mitis B6 is a complete genome, we can therefore imply the base pair position directly into our B6 annotation file. According to PHAST results, this intact prophage of S. mitis B6 comprised phage-associated genes including phage integrase protein, phage CI-like repressor, phage binding protein, phage portal protein, SPP1 family phage head morphogenesis protein and phage capsid proteins. Therefore, we suggest that S. mitis B6 might have recently acquired this intact prophage. The graphical display of the intact prophage with different types of phage-related genes is shown in Fig 3. Based on the indel 'B' detected on the PGC plot in Fig 2, we have revealed a 24Kb incomplete prophage with GC content of 39.17% located at position 1356040bp to 1380128bp Interestingly, this region contains a complete atp operon regulated by the CcpA protein within this incomplete prophage of S. mitis B6 genome. The genes of the atp operon are shown in Table 5. These genes encoding ATP synthases are commonly possessed by oral streptococci for adaptation to the acidic host environment by creating a more alkaline internal system. This protective mechanism is critical especially for streptococcal acid-sensitive glycolytic enzymes [19]. Hence, it may be that the acquisition of this atp operon carried by the incomplete prophage of S. mitis B6 via horizontal gene transfer has assisted its commensal status in maintaining the optimal pH level for bioenergetics processes of S. mitis B6 cells.
Pathogenomics Profiling (PathoProT) tool. PathoProT was designed to predict virulence genes by comparing Streptococcus amino acid sequences against the Virulence Factors Database (VFDB) [20]. PathoProT utilizes the stand-alone BLAST tools downloaded from the NCBI website. VFDB (Version 2012) currently hosts a set of 19,775 experimentally verified virulence genes originating from a wide range of different bacterial species, providing a useful resource for sequence homology searches. Users can select a list of Streptococcus strains for comparative analysis and set the cut-off, for example, genome identity and completeness for the BLAST search through our provided online web form. The default parameters of Patho-ProT pipeline are set at 50% sequence identity and 50% sequence completeness for searching and identifying orthologous virulence genes across the selected Streptococcus genomes. However, users can apply their desired cut-offs for the homology search in order to achieve the optimal stringency levels in their analyses.
Briefly, PathoProT pipeline was mainly implemented using Perl. In-house Perl scripts will process BLAST outputs (generated by searching these query sequences against VFDB) for each RAST-predicted protein (query sequence) in the user-selected genomes and identify putative virulence based on user-defined parameters. The filtered BLAST results are consolidated and organised in a matrix table containing information of presence or absence of virulence genes (rows) and Streptococcus strain names (columns). Finally, PathoProT will pass and process this output with our in-house R scripts for hierarchical clustering (complete-linkage algorithm) and generating a heat map for visualisation. The Streptococcus strains will be sorted based on their virulence gene profiles (Fig 4) and a phylogenetic tree will be drawn, users are able to gauge the relationships among the closely-related S. mitis group species/ strains as well as their corresponding virulence genes form noticeable clusters through the dendrograms. Therefore, this comparative pathogenomics analysis pipeline is able to provide excellent insight into the virulence gene profiles across different species of Streptococcus. For instance, there is no existing bioinformatics tool that serves the same functionality as Patho-ProT, which is to predict and allow comparison of virulence genes across different species of bacterial genomes.
To demonstrate the features or functionalities of PathoProT, we present a comparative pathogenomics study among the S. mitis group bacteria using a threshold of 50% for both sequence identity and coverage to give an insight into their virulence gene profiles. Based on the generated PathoProT heat map, a number of putative virulence genes appear to be Table 5. The ATP synthases within the atp operon of S. mitis B6.
Another conserved virulence gene, slrA encodes streptococcal lipoprotein rotamase A, which is one of the major surface proteins expressed by S. pneumoniae. This gene is an important cyclophilin which modulates biological function of virulence proteins during the first stage of pneumococcal infection [29]. It is likely that the slrA gene promotes invasion of host cells and facilitates pneumococcal colonization and adherence in S. mitis group bacteria [30,31]. Furthermore, it has been reported that deficiency in slrA reduces bacterial virulence due to its impact on the adherence and internalization by epithelial and endothelial cells [29]. Likewise, the conserved lmb gene encodes a laminin-binding protein which was first identified in Streptococcus agalactiae [32]. The virtually identical adhesins were later discovered in both Streptococcus suis [33] and Streptococcus pyogenes [34,35]. The lmb adhesins have been proposed to help in bacterial pathogenesis via invasion of the damaged epithelium [36]. Overall many surface lipoproteins and adhesins that are important in virulence and pathogenic infections are highly conserved across the S. mitis group bacteria.
According to the phylogenetics tree generated on the left side of the PathoProT heat map (Fig 5), the mitis group can be clearly categorized into two clades: S Clade (S. sanguinis, S. gordonii, S. parasanguinis, S. australis, S. cristatus and S. oligofermentans) and M Clade (S. mitis, S. infantis, S. tigurinus, S. oralis and S. peroris). This phylogeny relationship of the S. mitis group species indicates the close relatedness of cross-species within M Clade and species-to-species of S Clade. Interestingly, we found the rgp genes can be used to differentiate the two different clades in the heat map. For instance, these marker genes are present in all S Clade species but absent in all the M Clade species.
The rgp genes cluster (B, C, D, F and G) is responsible for the synthesis of rhamnose-glucose polysaccharide (RGP) in Streptococcus mutans. Notably, similar genes have been found to be involved in rhamnan synthesis in Escherichia coli [37]. In fact, it has been suggested that E. coli and S. mutans share a common pathway for rhamnan synthesis based on their similarities in RGP synthesis [37]. The function of rgpB is to transfer the second rhamnose residue to a rhamnose residue on N-acetylglucosamine linked to the lipid carrier, followed by rgpF which later catalyzes the transfer of the third rhamnose residue to the second rhamnose residue of the resultant glycolipid carrier. Both rgpB and rgpF have presumably to work alternately in the elongation of the rhamnan chain. Homologous rhamnosyl transferases of rgpB and rgpF have been detected in Streptococcus thermophilus (STER1436) and Streptococcus gordonii (SGO1022). On the other hand, rgpC and rgpD genes encode the putative ABC transporters specific for RGP (homologous STER1434 in S. thermophilus and homologous SGO1024 in S. gordonii) which play role in polysaccharide export [37]. The rgpG gene (S. gordonii SGO1723 homolog) initiates the RGP synthesis by transferring N-acetylglucosamine-1-phosphate to a lipid carrier [38].
The rgp genes are also implicated in pathogenesis in several Streptococcus species. For instance, rgp plays an essential role in bacterial virulence as well as eliciting an inflammatory response in S. suis [39]. Induction of infective endocarditis by S. mutans has been reported to be triggered by rgp genes via nitric oxide release [40], platelet aggregation [41] and conferring resistance to phagocytosis by human polymorphonuclear leukocytes [42]. Therefore, S Clade S. mitis group species which produce these rhamnose rich polymers might exhibit a different pattern of pathogenesis from M Clade Streptococcus species in order to establish greater virulence and increased survival in host cells. A recent study has identified the Sanguinis group of streptococci as a common causative agent of transient bacteremia which potentially can lead to infective endocarditis. This group has also been reported to be present in a few cases of virulent septicemic infection in neutropenic patients [43].
Sequence search tools. We have incorporated two types of BLAST engines, standard BLAST and VFDB BLAST, into StreptoBase to search for the closest Streptococcus strains to the query strain. These exclusive BLAST searches are functionally based on the stand-alone BLAST tool [44] downloaded from NCBI. Both BLAST engines support three types of BLAST functions, namely, BLASTN, BLASTP and BLASTX. Users are allowed to define the genome completeness (%) and genome identity (%) on the BLAST tools submission forms. These specialized BLAST tools are aimed to facilitate users to perform similarity searches of their query sequences against Streptococcus genome sequences, gene sequences (standard BLAST) as well as against the virulence genes of VFDB (VFDB BLAST), which allows users to examine whether their genes of interest are potential virulence genes using a sequence homology approach.

Future work and conclusion
With advances in NGS technology, further Streptococcus species or strains will be sequenced and this creates an urgent need to store, browse, retrieve and analyze vast amounts of genome data and the development of specialized tools for comparative analyses of these genomes.
Here we have successfully described and demonstrated the functionalities of StreptoBase particularly our in-house designed bioinformatics pipelines for the analyses of Streptococcus genomic data.
This specialized biological database will be constantly updated in order to provide the latest genome updates and research developments associated with the Streptococcus genus, and to ensure the accuracy and usefulness of the S. mitis group species genome data and annotation. We anticipate that StreptoBase will serve as a useful resource and analysis platform particularly for comparative analyses of the S. mitis group genomes for research communities. We encourage other researchers or research groups to offer suggestions and share their annotations, opinions, and curated data with us at girg@um.edu.my.

Availability and system requirements
StreptoBase is available online at http://Streptococcus.um.edu.my. Users can download and visualize all sequences and annotations described in this paper on the StreptoBase website. Strains that have not already been deposited in the NCTC or ATCC culture collections are available on request from NSJ. This analysis platform is generally compatible with multiple type of browsers including Internet Explorer 8.x or higher, Mozilla Firefox 1 10.x or higher, Safari 5.1 or higher, Chrome 18 or higher and any other equivalent browser software. This web site is best viewed at a screen resolution of 1024 × 768 pixels or higher.