Analysis of the complete genome sequence of Nocardia seriolae UTF1, the causative agent of fish nocardiosis: The first reference genome sequence of the fish pathogenic Nocardia species

Nocardiosis caused by Nocardia seriolae is one of the major threats in the aquaculture of Seriola species (yellowtail; S. quinqueradiata, amberjack; S. dumerili and kingfish; S. lalandi) in Japan. Here, we report the complete nucleotide genome sequence of N. seriolae UTF1, isolated from a cultured yellowtail. The genome is a circular chromosome of 8,121,733 bp with a G+C content of 68.1% that encodes 7,697 predicted proteins. In the N. seriolae UTF1 predicted genes, we found orthologs of virulence factors of pathogenic mycobacteria and human clinical Nocardia isolates involved in host cell invasion, modulation of phagocyte function and survival inside the macrophages. The virulence factor candidates provide an essential basis for understanding their pathogenic mechanisms at the molecular level by the fish nocardiosis research community in future studies. We also found many potential antibiotic resistance genes on the N. seriolae UTF1 chromosome. Comparative analysis with the four existing complete genomes, N. farcinica IFM 10152, N. brasiliensis HUJEG-1 and N. cyriacigeorgica GUH-2 and N. nova SH22a, revealed that 2,745 orthologous genes were present in all five Nocardia genomes (core genes) and 1,982 genes were unique to N. seriolae UTF1. In particular, the N. seriolae UTF1 genome contains a greater number of mobile elements and genes of unknown function that comprise the differences in structure and gene content from the other Nocardia genomes. In addition, a lot of the N. seriolae UTF1-specific genes were assigned to the ABC transport system. Because of limited resources in ocean environments, these N. seriolae UTF1 specific ABC transporters might facilitate adaptation strategies essential for marine environment survival. Thus, the availability of the complete N. seriolae UTF1 genome sequence will provide a valuable resource for comparative genomic studies of N. seriolae isolates, as well as provide new insights into the ecological and functional diversity of the genus Nocardia.


Introduction
Members of the genus Nocardia are Gram-positive, non-motile and aerobic actinomycetes, belonging to the family Nocardiaceae. This genus contains more than 90 recognized species and are widely distributed in both aquatic and terrestrial habitats [1]. Many species of this genus are known as the causative agent of nocardiosis in humans and a variety of animals, which cause various clinical diseases and high mortality rates in some cases [2,3]. In aquatic environments, four species of Nocardia, N. asteroides, N. seriolae, N. salmonicida and N. crassostreae, have also been found in diseased aquatic animals [4].
In Japan, members of the genus Seriola, including yellowtail (S. quinqueradiata), amberjack (S. dumerili), and kingfish (S. lalandi), are the most produced and economically important aquaculture fish species. Nocardiosis, caused by N. seriolae [5] (initially reported as N. kampachi [6]), is one of the most serious economic threats in the Seriola aquaculture. N. seriolae also infects other fish species including both marine and freshwater fishes and is found in other Asian countries [7]. To date, only two antibiotics, sulfamonomethoxine and sulfisozole sodium, are licensed for treatment of N. seriolae infections in Japan [8,9]. Although these antibiotics are valuable for the control of nocardiosis, there are some concerns about the emergence of antibiotic-resistant strains and environmental impacts. Vaccination is thought to be another effective strategy for control of nocardiosis. However, the intracellular parasitic nature of N. seriolae makes development of vaccines for the disease difficult [10].
Complete genome sequences of pathogenic bacteria provide a powerful tool for understanding their biology, including mechanisms of bacterial pathogenicity and their drug-resistant properties, as well as for the development of new genetic and molecular approaches for disease control strategies [11]. So far, four Nocardia species have been fully sequenced, including three agents of human nocardiosis, N. farcinica IFM 10152 [12], N. brasiliensis HUJEG-1 [13] and N. cyriacigeorgica GUH-2 [14], and a rubber and gutta-percha-degrading strain, N. nova SH22a isolated from a root of Couma macrocarpa [15]. Although two draft sequences of N. seriolae isolates, ZJ0503 [16] and N-2927 [17] have been reported recently, these draft genome sequences consist of a large number of contigs, 319 contigs in 315 scaffolds for ZJ0503 [16] and 339 large contigs (>500bp) for N-2927 [17]. Therefore a complete genome sequence of N. seriolae is essential for a robust annotation, overall genome organization and comparative genomics of this species [18].
In this study, using Single Molecule, Real-Time (SMRT) DNA sequencing [19,20], the complete genome nucleotide sequence of N. seriolae UTF1 isolated from a yellowtail that succumbed to nocardiosis in Japan was determined and annotated. We explored the virulence factors and antibiotic resistance gene candidates in the N. seriolae UTF1 genome. In addition, to investigate genomic diversity, the N. seriolae UTF1 genome sequence were compared with the four existing complete genomes of Nocardia. This is, to the best of our knowledge, the first report of the complete genome of the fish pathogenic Nocardia species. This genomic information will provide a reference genome data set of N. seriolae that could provide a basis for understanding the ecological and functional diversity of the genus Nocardia by comparative studies in future studies.

Materials and methods
Genome sequencing and assembly N. seriolae UTF1 was originally isolated from a cultured yellowtail (Seriola quinqueradiata) in 2008 (Miyazaki Prefecture, Japan). This isolate was cultured in Brain Heart Infusion Broth (Difco, Sparks, MD, USA) at 25˚C for 5 days under constant shaking with 150 rpm. After treatment with lysozyme, genomic DNA was extracted using the Maxwell Cell DNA Purification Kit (Promega, Madison, WI, USA). The nucleotide sequence of the N. seriolae UTF1 was determined by the Pacific Biosciences (PacBio) RS sequencing platform (Pacific Biosciences, Inc., CA, USA) at Tomy Digital Biology Co., Ltd (Tokyo, Japan). Briefly, genomic DNA (7 μg) was sheared using the g-TUBE (Covaris Inc., MA, USA) and a library was prepared using a DNA Template Prep Kit v2.0 (Pacific Biosciences) by the manufacturer's instructions. The library was run in four Single Molecule, Real-Time (SMRT) cells on a PacBio RS sequencer (Pacific Biosciences) using P4C2 chemistry and a 120 minute data collection mode. The PacBio RS platform generated 227,796 sequence reads (mean read length: 4,756 bp, N50 read length: 7,364 bp) with 1,083,416,572 bp, providing a 133-fold sequencing coverage of the genome (Table 1). De novo assembly of sequence reads was performed using a SMRT Analysis v2.2.0 software package (Pacific Biosciences) and resulted in two contigs with lengths of 6,093,704 bp and 2,035,833 bp.
For the genome sequence finishing, gaps between the two contigs were amplified by a longrange PCR method. The both gap-flanking sequences encoded ribosomal RNA (rRNA) operons, and therefore the PCR primer sets were designed for the outside of the region of rRNA operons. Using the gap-flanking PCR primer sets shown in S1 Table, long-range PCR was conducted using Phusion High-Fidelity PCR master mix with HF buffer (Thermo Fisher Scientific, Waltham, MA) following the manufacturer's protocol. The long-range PCR amplification was performed using 50 ng of extracted total genomic DNA of N. seriolae UTF1 with an initial denaturation step of 30 s at 98˚C and then a two-step PCR procedure (35 cycles of 98˚C for 10 s and 72˚C for 3 min), and 10 min of final extension at 72˚C. The long-range PCR successfully amplified approximately 7-8 kb fragments correspond to the both gap regions and the two PCR products were purified with QIAquick PCR Purification Kit (QIAGEN, Hilden, Germany). The purified long-range PCR products (100 ng) were shotgun-sequenced using the Ion PGM platform (Thermo Fisher Scientific) and the sequence reads were assembled using V-GAP [21]. These two assembled sequences aligned to the two contigs, and consequently the complete nucleotide sequence of N. seriolae UTF1 comprising a circular chromosome was determined. The final sequence was submitted to DDBJ under accession number AP017900.

Genome annotation
The complete genome sequence of N. seriolae UTF1was annotated using the Rapid Annotations using Subsystems Technology (RAST) server v2.0 with SEED data [22] and using BLASTP [23] against the NCBI RefSeq protein data [24] (E value threshold of 1E-5). Functional categories of the predicted genes of N. seriolae UTF1 and four other Nocardia spp. genes were assigned with the Clusters of Orthologous Groups of proteins (COGs) database [25] using COGsoft [26] and with the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database using KEGG Orthology And Links Annotation (BlastKOALA) program [27]. Virulence factors of N. seriolae UTF1 were predicted by BLASTP against the Virulence Factor Database (VFDB) set A (a core dataset that covers genes associated with experimentally verified VFs) [28] and the virulence genes data set of N. farcinica IFM 10152 at the Nocardia farcinica Genome Project Page (http:// nocardia.nih.go.jp/), using a cut off E-value of 1E−5. These BLAST results were then filtered using the criteria of query coverage per HSP (qcovhsp) greater than 80% and sequence similarity greater than 50%. The organization of the mce operon structures was identified based on the annotation result from the RAST server and visualized using Easyfig version 2.1 [29]. Genes for antibiotic resistance were estimated by Antibiotic Resistance Genes Database (ARDB) [30] with default parameters.

Comparative genomics
To Identify and characterize the gap regions in the previous reported N. seriolae draft genome sequences, the contig sequences of N. seriolae ZJ0503 (GenBank: NZ_JNCT01000000, 319 contigs) and N-2927 (GenBank: NZ_BAWD02000000, 339 contigs) were aligned to the complete sequence of N. seriolae UTF1 genome using MUMmer version 3.22 [31]. The uncovered region sequences in the comparison with ZJ0503 and with N-2927 were subjected to BLASTX search against the NCBI RefSeq database (E value threshold of 1E−5). Four available complete genome sequences of the genus Nocardia, N. farcinica IFM 10152 (GenBank: AP006618), N. brasiliensis HUJEG-1 (GenBank: CP003876), N. cyriacigeorgica GUH-2 (GenBank: FO082843) and N. nova SH22a (GenBank: CP006850), were used for the comparative genomic analysis with N. seriolae UTF1. For visualization of circular genome comparisons, the BLASTN-based ring image was generated by BLAST Ring Image Generator (BRIG) version 0.95 [32], with N. seriolae UTF1 as a reference. Dot plots of complete nucleotide sequences were generated by MUMmer version 3.22 and the mummerplot script and the Unix program gnuplot [31]. Average Nucleotide Identity (ANI) and Amino Acid Identity (AAI) were calculated using the ANI calculator and AAI calculator, respectively (default settings) [33]. The orthologous and species-specific genes were identified using OrthoMCL [34]. Comparisons of functional profiling of the N. seriolae UTF1 and four other Nocardia spp. complete genomes were carried out by the method by Verma et al. (2014) [35] with some modification. The top 50 SEED subsystems [36,37] from the RAST analysis and the KEGG modules assigned by BlastKOALA were clustered hierarchically by the abundance of gene content for each categories among the Nocardia genomes using Cluster 3.0 software [38]. The results were visualized by Java Treeview version 1.1.6r4 [39].

Assembly and general genomic features
The relatively large genome size (6-10 Mbp) with high GC contents (approximately 70%) of genus Nocardia [40] makes the completion of their genome sequences difficult. Because of the genomic complexity, draft assemblies of N. seriolae from an Illumina MiSeq platform [16] or a Roche 454-GS Junior System combined with the Illumina reads (NCBI Sequence Read Archive, DRX020602) [8,17] consist of a large number of contigs. It should be noted that our initial draft sequence assembly of the N. seriolae UTF1 genomes using a 454 GS-FLX+ System resulted in a total of 134 scaffolds comprising 365 contigs (unpublished data). The PacBio RS platform, a third-generation sequencing technology and based on single-molecule real-time (SMRT) sequencing, can achieve unbiased GC coverage with extremely long reads [19,20]. This platform has been employed successfully for sequencing in complex bacterial genomes such as those with extremely high GC content genomes [41] and with multiple chromosomes containing more repetitive sequences [42]. In this study, we determined the complete genome nucleotide sequence of N. seriolae UTF1 using the PacBio RS.
Our de novo assembly with a 133-fold genome coverage of PacBio RS long-reads (mean read length: 4,756 bp, N50 read length: 7,364 bp) produced two large contigs with lengths of 6,093,704 bp and 2,035,833 bp ( Table 1). The flanking sequences at both sides of the two assembled contigs encoded ribosomal RNA genes. Since the length of multiple copies of rRNA operons (16S-23S-5S rRNA) are approximately 5 kb, the current sequence reads with an average of 4,756 bp may have not been able to fully cover these rRNA operon regions. It should be noted that the newest PacBio RS II sequencer generates an average read length of 10-15 kb [43], and therefore the new sequencer is assumed to be able to assemble around these rRNA operon regions. After closing the gaps with a long-range PCR method, the complete nucleotide sequence of N. seriolae UTF1 comprising a circular chromosome of 8,121,733 bp with a G+C content of 68.1% was determined (Fig 1 and Table 2). The complete genome of N. seriolae UTF1 contains 7,697 predicted coding DNA sequences (CDSs) with an average length of 909 bp, 4 rRNA operons, and 62 transfer RNA (tRNA) sequences ( Table 2). The genome sizes vary among the fully sequenced Nocardia genomes that range from 6,021,225 bp for N. farcinica IFM 10152 to 9,436,348 bp for N. brasiliensis HUJEG-1, while the numbers of CDSs also vary among them that range from 5,491 for N. farcinica IFM 10152 to 8,414 for N. brasiliensis HUJEG-1 ( Table 2). On the other hand, the N. seriolae UTF1 genome has four rRNA operons and 62 tRNAs, while three rRNA operons and 49-53 tRNAs have been found in the other four fully sequenced Nocardia genomes ( Table 2).
The genome sequences of N. seriolae UTF1 and the previously reported N. seriolae isolates were quite similar. The N. seriolae UTF1 genome showed 99.99 and 99.95% ANI with ZJ0503 [16] and N-2927 [17], respectively. It should be noted that the ANI between N-2927 and U-1 [8], the most recently reported draft genome of N. seriolae isolate, was 100%. Therefore, we did not use the U-1 genome for further comparisons and analysis. To clarify the uncovered regions (gaps) in the previously reported N. seriolae genome sequences, the contig sequences of N. seriolae ZJ0503 [16] (319 contigs) and N-2927 [17] (339 contigs) were aligned to the complete sequence of N. seriolae UTF1 genome. As a result, 300 uncovered regions (average length: 1,373 bp) were detected in the draft genome of ZJ0503, while 297 uncovered regions (average length: 1,502 bp) were detected in the draft genome of N-2927. These gap sequences were subjected to BLASTX searches against the NCBI RefSeq database (E value threshold of 1E−5) and as a result, 294 (98.0%) for ZJ0503 and 290 (97.6%) for N-2927 had significant BLAST hits. The BLAST results revealed that most assigned genes (76.9% for ZJ0503 and 73.1% for N-2927) were mobile element-related, such as for transposase, endonuclease DDE and integrase (S1 Fig). The comparative genomic map of N. seriolae UTF1 with ZJ0503 and N-2927 also showed that these mobile element-related genes are interspersed across the N. seriolae UTF1 genome and coincides with the gap regions of ZJ0503 and N-2927 (S2 Fig). Since the nucleotide length of these genes was more than 1kb, these repeat sequences have a significant influence during de novo assembly when relatively short reads (several hundred bp) from MiSeq and 454-GS Junior are used [8,16,17]. In the present study, the PacBio long-reads (mean read length: 4,756 bp, N50 read length: 7,364 bp) could cover over these repeated sequences and achieve completion of the N. seriolae UTF1 genome.

Overview of N. seriolae UTF1 virulence factors
Like causal agents of human nocardiosis, N. seriolae is considered to be an intracellular pathogen that invades and grows within host cells, even including phagocytes [44,45]. The intracellular  nature of this bacteria makes it difficult to control disease. As a first step toward understanding the virulence factors and pathogenic properties of N. seriolae UTF1, we conducted BLASTP searches of N. seriolae UTF1 CDSs against the Virulence Factor Database (VFDB) [28] and a well-annotated virulence genes data set of N. farcinica (http://nocardia.nih.go.jp/) (E-value < 1E-5, sequence length overlap > 80% and sequence similarity > 50%). The VFDB search identified 173 CDSs as candidate virulence factors (S2 Table). In addition, almost all of N. farcinica putative virulence genes were found in the N. seriolae UTF1 genome (Table 3). Mammalian cell entry (Mce)-family proteins, virulence factors of Mycobacterium tuberculosis (the class Actinobacteria), have the ability to enter into mammalian cells and survive inside the macrophage [46]. The genome of M. tuberculosis contains four mce operons which comprise eight genes per operon in identical manner (two yrbE genes, A and B; six mce genes, A, B, C, D, E and F) [47,48]. Mce proteins are found in diverse Actinobacteria including Nocardia spp. [49]. Six copies of mce operons have been found in the three human clinical isolates (N. farcinica IFM 10152 [12], N. brasiliensis HUJEG-1 [13] and N. cyriacigeorgica GUH-2 [14]), whereas 14 mce operons have been found in N. nova SH22, which was isolated from a plant root and has the ability of rubber and gutta-percha degradation [15]. Recently, Carrillo-González et al. (2016) demonstrated the importance of mce proteins for Nocardia pathogenesis from whole-genome comparison of an attenuated N. brasiliensis HUJEG-1 and the parental strain [50]. In the N. seriolae UTF1 genome, we found eight complete mce loci with nucleotide length of 6,814 bp to 8,964 bp (Figs 1 and 2). It should be noted that the locus mce3 has two extra genes (endonuclease DDE and orf3406) between mce3E and mce3F (Fig 2). However, the influence of these two extra genes to the function of the mce3 operon is unclear. Amino acid sequence similarities of N. seriolae UTF1 Mce1 proteins with the other four Nocardia species, and two Actinobacteria, Rhodococcus equi and M. tuberculosis are shown in S3 Invasin also plays a role in attachment and penetration into host cells by several bacterial species [51,52], Nocardia species also possess an invasin gene [12,13] and the N. seriolae UTF1 ORF-4357 is very similar to N. farcinica IFM 10152 invasin (80.4% similarity) ( Table 3). Since Nocardia species are facultative intracellular pathogens, they most likely use this protein for entry into host cells. Further studies to determine the function of Nocardia invasin involved in the host cell entry are required.
A phagosome, a cellular compartment, is essential for intracellular killing and digesting of pathogenic microorganisms [60]. Nucleoside diphosphate kinase (Ndk) and protein tyrosine phosphatase A (PtpA) arrest macrophage phagosomal maturation for the intracellular survival and persistence of pathogenic mycobacteria [61]. The N. seriolae UTF1 ORF-5937 and ORF-5498 are homologous to N. farcinica IFM 10152 ndk (57.6%) and ptpA (61.4%), respectively  (Table 3). Since iron is in very low concentration and in an insoluble state within macrophages, efficient iron-acquisition systems are required for pathogenic bacteria to survive [61]. Putative N. seriolae UTF1 metal importer related genes (ideR and nbtA-G, S, T) were identified (Table 3), and may contribute to the abilities of N. seriolae to survive in fish tissues including within macrophages. Other virulence factor candidates were also found in the N. seriolae UTF1 genome. Antigen 85 (Ag85) complex of M. tuberculosis is a family of fibronectin binding proteins (Fbp) that plays an essential role in the pathogenesis of tuberculosis [62]. The Ag85 complex consists of three proteins (Ag85A, Ag85B and Ag85C: encoded by the genes fbpA, fbpB and fbpC) that possess mycolyltransferase activity involved in the final stages of mycobacterial cell wall assembly [63]. The N. seriolae UTF1 genome was found to have at least seven putative fbp genes: three fbpA (ORF-144, ORF-145 and ORF-147), two fbpB (ORF-146 and ORF-7590) and two fbpC (ORF-148 and ORF-150). Among these seven fbp gene candidates, protein sequences of ORF-145 (fbpA), ORF-7590 (fbpB) and ORF-150 (fbpC) were most similar to N. farcinica IFM 10152 fbpA (79.9%), fbpB (75.9%) and fbpC (73.0%) ( Table 3). TlyA proteins of the pathogenic bacteria has a virulent hemolytic ability [64,65], and the N. seriolae UTF1 ORF-2873 encodes a TlyA, with a homology of 83.2% to the N. farcinica IFM 10152 (Table 3). In contrast to the PtpA (mentioned above), PtpB is considered to be nonessential for the phagosome arresting function of M. tuberculosis [61]. On the other hand, Zhoua et al. (2010) reported that M. tuberculosis PtpB depresses the innate immune responses by inhibiting the signaling pathway involved in interleukin-6 (IL-6) production and promoting host cell survival by activating the Akt pathway for their survival in macrophages [66]. The N. seriolae UTF1 genome contains a ptpB gene (ORF-2644), with a similar protein in N. farcinica IFM 10152 (70.7%) (Table 3).
Overall, the whole-genome analysis of N. seriolae UTF1 reveals that the genome contains known virulence genes of mycobacteria and human clinical Nocardia isolates for host cell invasion, modulation of phagocyte function and surviving inside the macrophages. Therefore, the presence of these virulence genes in the N. seriolae UTF1 may explain their ability to survive intracellularly and within macrophages. The virulence gene set of N. seriolae UTF1 we present provides the material basis for further study of their pathogenic mechanisms at the molecular level. In addition, the complete genome sequence of N. seriolae UTF1 can be utilized for comparing the genomes between pathogenic and non (less)-pathogenic isolates to more fully resolve the genes responsible for N. seriolae pathogenesis in future studies.

Potential antibiotic resistance genes of N. seriolae UTF1
In general, Nocardia spp. are naturally resistant to many antibiotics and most β-lactams [7,[67][68][69]. The N. seriolae UTF1 encodes at least 14 β-lactamases, while the number of β-lactamase genes is one for N. farcinica IFM 10152 [12], 29 for N. brasiliensis HUJEG-1 [13] and 12 for N. cyriacigeorgica GUH-2 [14]. In addition, as found in the draft genome of N. seriolae genomes [8,16,17], the N. seriolae UTF1 genome has one vancomycin and two fluoroquinolones resistant gene candidates according to the RAST annotation. To explore more antibiotic resistant genes in the N. seriolae UTF1 chromosome, all their CDSs were compared with the Antibiotic Resistance Genes Database (ARDB) [30]. This analysis identified 20 CDSs as candidate antibiotic resistant genes that were classified into 16 antibiotic resistance gene types ( Table 4). The presence of these antibiotic resistance genes of the N. seriolae UTF1 may, in part, explain the difficulty in treating diseases caused by N. seriolae.
It has been reported that N. seriolae isolates could be divided into two phenotypic groups using α-glucosidase (α-glu) activity (α-glu-positive or -negative) [7,68]. These two groups showed different oxytetracycline (OTC: a tetracyclinic antibiotics)/erythromycin (Em: a macrolide antibiotic) susceptibility profiles. Most of the α-glu-positive isolates were OTCresistant and Em-sensitive, while most of α-glu-negative isolates were OTC-sensitive and Emresistant [69,70]. N. seriolae UTF1 was α-glu-positive isolate which exhibited resistance to OTC and sensitivity to Em. Ismail et al. 2011 [69] found that OTC-resistant strains of N. seriolae possess tet(K) and/or tet(L) gene(s), while the Em-resistant strains possessed mef(A) and msr(D) genes. The tet(K) and tet(L) genes are generally found on small transmissible plasmids [71,72], while mef(A) and msr(D) genes are encoded in chromosomes of Gram-positive bacteria and associated with conjugative transposons [73]. Despite obtaining a 1.1 Gbp (133-fold genome coverage) of PacBio RS long-reads in this study and a 247 Mbp (30-fold genome coverage) of 454 reads (unpublished data), we could not find any plasmids or tet(K) and tet(L) gene sequences. The absence of these genes might have been caused by plasmid elimination during culture propagation. On another front, the N. seriolae UTF1 chromosome was found to have two tetracycline resistance gene candidates, otr(B) (ORF-1750) and tcr3 (ORF-6293) ( Table 4), both found in Streptomyces spp. (Actinobacteria) [71], suggesting that these two genes might enhance some degree of resistance to OTC in N. seriolae.
As expected, mef(A) and msr(D) genes were not found in the N. seriolae UTF1 chromosome. However, the results from ARDB includes three candidates for macrolide efflux pump genes: carA (ORF-4637), macB (ORF-1987, ORF-4813, ORF-6113 and ORF-6376) and srmB (ORF-4948) ( Table 4). Since N. seriolae UTF1 is an Em-sensitive isolate, these three candidates are not likely to be involved in Em-resistance of N. seriolae UTF1. In particular, MacB requires TolC and MacA for its function in E. coli [74,75], but the N. seriolae UTF1 genome lacks both genes. Thus, further information on the antibiotic profile of more N. seriolae isolates, as well as their genomic sequences are needed for an accurate view of their antibiotic resistance genes, and the complete genome sequence of the N. seriolae UTF1 can be used as a reference for these surveys in future studies.

Genome comparison with fully sequenced genomes of the genus Nocardia
Four available complete genome sequences of the genus Nocardia; N. farcinica IFM 10152, N. brasiliensis HUJEG-1, N. cyriacigeorgica GUH-2 and N. nova SH22a, were used for comparison of the genomic structure with the N. seriolae UTF1. From the comparative genomic map (Fig  1), we found that there were no large-scale variations among the genomes, but a considerable number of non-homologous regions are scattered around the N. seriolae UTF1 genome. Most of these non-homologous regions were linked to mobile element-related genes (transposase, endonuclease DDE and integrase) (Fig 1), suggesting they serve as plastic and variable regions for the N. seriolae UTF1 genome. The whole-genome alignments of N. seriolae UTF1 with the other four Nocardia displayed an X-shaped distribution across the origin of replication (Fig 3), which is explained by the fork replication theory [76]. By contrast, the alignments of the three agents of human nocardiosis; N. farcinica IFM 10152, N. brasiliensis HUJEG-1 and N. cyriacigeorgica GUH-2, showed only a few visible symmetric inversions and a diagonal line with a slope of approximately 1 (S3 Fig). On the other hand, the alignments of N. nova SH22a and the three human clinical isolates were arranged in an X-shaped distribution as in the case of N. seriolae UTF1. Thus, the degree of symmetrical inversions indicates that the genetic distance among the three human clinical isolates are close to each other, while N. seriolae UTF1 and N. nova SH22a are genetically distant from these species.
The ANI and AAI based pairwise comparisons between each genome are shown in Table 5. Typically, the ANI values between genomes of the same bacterial species show above 95%, while the values below 75% are too divergent to be compared based on this measurement [33]. For the latter case, AAI provides a much more robust resolution. AAI cut-offs for genus and species boundary have been estimated to be 55-60% and 85-90%, respectively [33]. The ANI and AAI between N. seriolae UTF1 and the other four Nocardia genomes ranged between 79.21-79.88% and 68.96-69.62%, respectively. Like the results of whole-genome alignments, the three human clinical isolates showed higher ANI and AAI values (81.19-81.61% ANI and 73.61-74.99% AAI) than the comparisons with N. nova SH22a (79.43-79.89% ANI and AAI (68.06-69.82%) and with N. seriolae UTF1 (described above). On the other hand, the ANI and AAI between N. seriolae UTF1 and N. nova SH22a had relatively low values, that displayed 79.63% ANI and 69.17% AAI, respectively. Overall, the ANI and AAI values indicate that the three human clinical isolates are a genetically similar group and N. seriolae UTF1, an agent of fish nocardiosis, is genetically distant from these species as well as N. nova SH22, an environmental isolate.
We also focused on similarity of the functional profiling among the five complete Nocardia genomes. Fig 4 shows a dendrogram constructed based on the top 50 subsystems from the RAST analysis. Interestingly, although a distant genetic relationship was observed between N. seriolae UTF1 and the three agents of human nocardiosis through whole genome comparisons, the analysis revealed that the functional repertoire of N. seriolae UTF1 is closer to the three agents of human nocardiosis than N. nova SH22a (Fig 4). Similar results were obtained with the dataset of overall subsystems (S4 Fig) and the KEGG modules assigned by BlastKOALA (S5 Fig). These findings suggest that the closer functional relationship between the agents of the human and fish nocardiosis may be associated with their adaptations to infected animal hosts and pathogenic properties.  . Dot plot comparisons of N. seriolae UTF1 genome against N. farcinica IFM 10152 (A), N. brasiliensis HUJEG-1 (B), N. cyriacigeorgica GUH-2 (C) and N. nova SH22a (D). Nucleotide-based alignments were performed with MUMmer version 3.22 and dot plots were generated by the mummerplot script and the Unix program gnuplot [31].
doi:10.1371/journal.pone.0173198.g003 Table 5. Average nucleotide identity (ANI, upper grids) and average amino acid identity (AAI, lower grids) values (in percent) calculated between Nocardia genomes. UTF1 N. farcinica IFM 10152 N. brasiliensis HUJEG-1 N. cyriacigeorgica GUH-2 N.   In summary, when the five complete genomes of Nocardia are compared, we found that N. seriolae UTF1 is genetically distant from the three agents of human nocardiosis, but is similar to the functional repertoire of them. Further research studies are required to more fully resolve the phylogenetic relationship of Nocardia spp. [40] and the differentially enriched pathways according to their habitat and lifestyle, when a greater number of completely sequenced and thoroughly annotated clinical and environmental Nocardia isolates become available.

Finding functional features in N. seriolae UTF1 genome
The complete sequence of the N. seriolae UTF1 genome has been determined for the first time as a marine Nocardia species, and therefore it is interesting to investigate the characteristic features of its genome. According to the COG classifications, the distribution of COG categories is mostly similar among Nocardia genomes (Fig 5). In the N. seriolae UTF1, the number of genes for the category 'Mobilome: prophages, transposons (X)' and 'Unknown' (not assignable to COG categories) is higher than that in the other four Nocardia genomes (Fig 5). Notably, genes related to 'Mobilome: prophages, transposons (X)' are quite abundant in the N. seriolae  Table). According to the comparative genomic map in Fig 1, these   Fig 6. Identification of N. seriolae UTF1-specific genes (A) and their functional annotations (B). The orthologous and species-specific genes were identified using OrthoMCL [34] (A). The protein sequences of 1,982 N. seriolae UTF1-specific genes were functionally annotated with metabolic information from the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database using KEGG Orthology And Links Annotation (BlastKOALA) program [27] (B). mobile element genes are interspersed throughout the N. seriolae UTF1genome, and their sequences correspond to the variable regions (Fig 1). Overall, the abundance of the genes related to mobile element proteins and unknown function in the N. seriolae UTF1 genome can partially explain the divergence of their genome structure and gene content from the other Nocardia genomes (Figs 1 and 3).
To further characterize the genomic features of N. seriolae UTF1, we focus on the functional characterization of the N. seriolae UTF1-specific genes. As based on the OrthoMCL clustering, 2,745 orthologous genes were presented in all 5 Nocardia genomes and 1,982 genes were Analysis of the complete genome sequence of Nocardia seriolae UTF1 unique to N. seriolae UTF1 (Fig 6A). Of the 1,982 N. seriolae UTF1-specific genes, 217 genes (10.9%) were annotated in the KEGG database. The proportion of KEGG categories of the N. seriolae UTF1-specific genes shows some differences compared to those of all N. seriolae UTF1 genes ( Fig 6B). The 'Environmental Information Processing' category (21%) are the most abundant in the N. seriolae UTF1-specific genes, followed by 'Carbohydrate metabolism' (14%) ( Fig  6B). Focusing on the abundant KEGG modules, 14 genes are assigned in the ABC transport system (Module ID: M00254 and M00258) ( Table 6). Because of limited resources in ocean environments, marine bacteria require various efficient transport systems to capture essential nutrients [77]. Therefore, the N. seriolae UTF1-specific transport systems-related genes identified in the present study may have a role in adaptation to the marine environment. In addition, most of the N. seriolae UTF1-specific genes (89.1%) were not assignable to KEGG. Further study on such transport systems-related genes and to the genes of unknown function will provide unique insight into their adaptation to fish hosts in the aquatic environment.

Conclusions
The complete nucleotide sequence of N. seriolae UTF1 consists of a circular chromosome of 8,121,733 bp with a G+C content of 68.1% and 7,697 predicted CDSs. The genome possesses known bacterial virulence genes that have functions in host cell invasion, modulation of phagocyte function and for survival within macrophages. The detected candidate virulence factors provide a novel resource for further study of their pathogenic mechanisms at the molecular level in the fish nocardiosis research community. We also found many antibiotic resistance genes on the N. seriolae UTF1 chromosome, suggesting natural resistance of this bacteria to many drugs. Our comparative analysis with the four existing complete Nocardia spp. genomes revealed that the N. seriolae UTF1 genome structure and gene content differs from the other Nocardia genomes due to a large amount of mobile element genes. In addition, there are homologs of many transporters among the N. seriolae UTF1-specific genes allowing us to speculate on their role in adaptation in the marine environment. Thus, we expect that the availability of the complete genome of N. seriolae UTF1 can be used as the reference sequence not only for N. seriolae isolates, but also for the comparative genomic studies of genus Nocardia as an example of a marine fish pathogen to provide insights into the ecological and functional diversity of this genus [1] in the near future.