De novo transcriptome analysis of Bagarius yarrelli (Siluriformes: Sisoridae) and the search for potential SSR markers using RNA-Seq

Background The yellow sisorid catfish (Bagarius yarrelli) is a carnivorous freshwater fish that inhabits the Honghe River, Lanchangjiang River and Nujiang River of southern China and other Southeast Asian countries. However, the publicly available genomic data for B. yarrelli are limited. Methodology and principal findings Illumina Solexa paired-end technology produced 1,706,456 raw reads from muscle, liver and caudal fin tissues of B. yarrelli. Nearly 5 Gb of data were acquired, and de novo assembly generated 14,607 unigenes, with an N50 of 2006 bp. A total of 9093 unigenes showed significant similarities to known proteins in public databases: 4477 and 6391 of B. yarrelli unigenes were mapped to the Gene Ontology (GO) and Clusters of Orthologous Groups (COG) databases, respectively. Moreover, 9635 unigenes were assigned to 242 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. In addition, 8568 microsatellites (simple sequence repeats, SSRs) were detected, and 31 pairs of polymorphic primers were characterized using wild populations of B. yarrelli from the Nujiang River, Yunnan Province, China. Conclusion/Significance These sequences enrich the genomic resources for B. yarrelli and will benefit future investigations into the evolutionary and biological processes of this and related Bagarius species. The SSR markers developed in this study will facilitate construction of genetic maps, investigations of genetic structures and germplasm polymorphism assessments in B. yarrelli.


Introduction
The yellow sisorid catfish (Bagarius yarrelli, Osteichthyes, Siluriformes, Sisoridae) [1] is distributed mainly in southwestern China, including the Nujiang, Lanchangjiang, and Yuanjiang Rivers, Southeast Asia, including Vietnam, Laos, Thailand and Burma, South Asia, and India [2]. Its common names are "yellow fish" in the Yuanjiang River region, "tiger fish" in the Nujiang River region, and "surface melon fish" in the Lanchangjiang River region. Though wild resources of B. yarrelli are currently under threat of population decline due to environmental changes and overfishing, the fish is consumed by locals, which has resulted in a high price. Previous investigations have largely addressed its biological characteristics [3], noting that B. yarrelli is a benthic carnivorous fish. Its physiology [4] results in a protein content that is higher than that of conventional fish, including Cyprinus carpio, Carassius auratus, Ctenopharyngodon idellus, and Hypophthalmichthys molitrix. Artificial propagation of B.yarrelli succeeded for the first time on 24 March, 2012 and the larvae were hatched 20 hours after fertilization at a water temperature of 27˚C [5]. In addition, mitochondrial and nuclear DNA sequencing have been utilized to study the species' genetic phylogeny, including 16S rRNA [6], cytochrome oxidase [7], ND6 [8] and rag1 and rag2 [9]. Random amplified polymorphic DNA (RAPD) and microsatellite (simple sequence repeat, SSR) marker technologies have also been applied to study the population genetic diversity of B. yarrelli [10][11]. However, the available gene sequences and molecular markers are extremely limited, and as of August 15, 2017, only 37 nucleotide sequences were available in the NCBI GenBank database.
In investigations of model and non-model fish species, next-generation sequencing techniques, such as 454 pyrosequencing technology and Illumina paired-end sequencing technology [12], have provided abundant data at a low cost. Recently, High-throughput sequencing technology has been broadly implemented for many species, for example, sesame [13], blunt snout bream [14], common carp [15], butterfly [16], lake sturgeon [17], and mud crab [18].
To acquire comprehensive gene sequences of B. yarrelli, next-generation sequencing technology was used to perform transcriptome sequencing with a pool of tissues (this fish originates from the Honghe River,Yunnan Province,China) and the first sequencing library was constructed. We then performed de novo assembly and gene annotation and utilized a set of SSR markers to assess the genetic diversity of a wild B. yarrelli population (these fish originate from the Nujiang River, Yunnan Province, China). These findings provide a very useful genomic resource for subsequent investigation of B. yarrelli and related species at the levels of biochemistry, molecular biology and genetics.

Sequencing and de novo assembly
Approximately 1,706,456 raw reads were obtained from muscle, fin and liver tissues of B. yarrelli specimens obtained from the Honghe River in Hekou County, Yunnan Province, China. The entire length of the raw reads was 5.119 Gb. Raw reads were processed to eliminate lowquality reads and then de novo assembled into 151,911 contigs using Trinity software, with an average length of 644 bp. The resulting final assembly contained 14,607 unigenes, with an average length of 1216 bp and the N50 length was 2006bp (Table 1). N50 is an indicator of the assembly effect, which is calculated by sorting the assembly fragments from large to small and starting to add their length values; when the cumulative sum is 50 percent greater than the total length, the final cumulative length of the segment is the N50 value. The raw reads in this study were submitted to the NCBI Short Read Archive (SRA) database under accession number SRR5943894.

Gene identification and open reading frame (ORF) search
The numbers of contigs and unigenes in this study were 151,911 and 14,607, respectively. Approximately 9093 unigenes yielded BLASTX hits to known proteins in the NR database; 4477 matched to the Gene Ontology (GO) database, 6391 matched to the Clusters of Orthologous Groups (COG) database and 9635 matched to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. The final unigenes and annotation information are provided in S1 File and S1 Table. The predicted protein number using Transdecoder (http://transecoder. fourceforge.net/) in Trinity software, which was set at a length of greater than 80 aa, was 9310; the data are listed in S2 File.

KEGG pathway analysis
A total of 9635 unigenes were classified into 242 KEGG pathways (S2 Table).  Table 2.

PCR amplification and polymorphisms of genic SSRs
Primer3 software [21] was used to design 3951 SSR primer pairs for 8568 SSR sequences and 90 primer pairs were randomly selected and synthesized by Sangon Biotech Co., Ltd.     De novo transcriptome analysis of Bagarius yarrelli and the search for SSR markers using RNA-Seq

Samples, RNA isolation, and library preparation
The samples examined in this study were caught from Honghe River in Hekou County, Yunnan Province, China (103˚56'21E, 22˚31'51N) by professional fishers using a fishnet. The fish were killed using MS-222 (finquel) followed by a stick to their heads. The samples were dissected on ice using scissors. Muscle, liver, and caudal fin tissues were excised and rapidly frozen at -80˚C. A wild population of 40 B. yarrelli individuals was obtained from the area downstream of the Nujiang River in Yunnan Province, China. These 40 individuals were used to test microsatellite polymorphisms. Total RNA was extracted using TRIzol (Invitrogen, USA) following the manufacturer's protocol, and total RNA from tissues was pooled and purified to obtain mRNA using oligo (dT) magnetic beads. The quality results of B. yarrelli RNA are shown in S7 File. cDNA libraries were analyzed after preparation for Illumina sequencing.

Sequence assembly
An Illumina TruSeq RNA Sample Prep Kit was applied to construct cDNA libraries in accordance with the manufacturer's protocol. The libraries were then sequenced using the HiSeq 2500 sequencing platform (TruSeq SBS Kit v3-HS, Illumina). Low-quality sequences, trimming primers and adaptors were filtered. In other words, the reads with sequences containning the adaptor sequence, the paired reads with the amount of N contained in single-end sequencing reads exceeding 10% of the read length and the paired reads with a low-quality(less than 5) base number of the single-end sequencing reads greater than 50% of the read length were excluded. High-quality sequences (>Q20) were assembled using Trinity software (r20131110) [22]. The ending contigs were embedded in scaffolds based on paired-end information using TGI Clustering (TGICL) tools [23].

Identification of microsatellite markers
To detect SSR markers, the MISA tool (http://pgrc.ipk-gatersleben.de/misa/) was used with the default settings: the smallest numbers of repeats were five for tri-, tetra-, penta-and hexanucleotides, six for di-nucleotides, and 10 for single nucleotides. Altogether, 90 pairs of specific primers were designed with Primer3 [21]. These SSR primer pairs were synthesized by Shanghai Sangon Biological Engineering Technology Service Co., Ltd. (Shanghai, China). Forty individuals from the wild population were used to screen these microsatellite loci through PCR. The PCR reaction system and the cycling parameters for PCR amplification were performed as described by M Du et al [12]. After staining with ethidium bromide, a 8% non-denaturing polyacrylamide (w/v) gel was used to identify the PCR products. POPgene 1.32 software was applied to perform the genetic analysis. All SSR markers examined in unigenes are listed in S4 File. Polymorphism microsatellite loci and genetic parameters are provided in S5 File.

DNA extraction and electrophoresis
The normal phenol-chloroform method described previously [24] was used to extract genomic DNA from the caudal fin tissues of the 40 wild B. yarrelli individuals. The DNA quality and concentration were assessed by 1% agarose gel electrophoresis and then use of a Thermo Scientific NanoDrop 2000 spectrophotometer, respectively.

Discussion
The next-generation sequencing technique is used to obtain large amounts of transcriptome data from a known or an unknown reference sequence organism [25] because it is inexpensive and rapid [26]. In this study, we use the high-throughput Illumina Solexa paired-end technique for de novo assembly of a transcriptome using clean reads for B.yarrelli,a species with limited sequence data in public databases. Here, the N50 length and the average length of unigenes of B.yarrelli were 2006bp and 1216bp, respectively, indicating that our assembly was effective and accurate. Similar results have been found for speices such as mud crab (Scylla paramamosain), in which the N50 length and the average length of contigs are 639bp and 606bp, respectively [18], sesame (Sesamum indicum L.), in which the N50 length and the average length of unigenes are 1901bp and 1127bp, respectively [13], whitefly (Bemisia tabaci), in which the average length of scaffolds is 266bp [27], and sweet potato (Lpomoea batatas), in which the N50 length and the average length of unigenes are 765bp and 481bp, respectively [28]. Illumina sequencing produced 1,706,456 raw reads for B.yarrelli.The 14607 unigenes yielded in the present study will facilitates further research on the physiology, biochemistry, and molecular genetics of B.yarrelli or related species. SSR markers have been extensively used to perform genetic diversity research, gene mapping and population genetic analysis in many species [29][30]. NGS technology is a powerful tool for identifying microsatellites for plant or animal organisms [31][32]. Here, a total of 8568 microsatellites were identified based on the B.yarrelli dataset. Di-nucleotide repeats were the most frequent type, and mono-nucleotide repeats were the second most frequent type, followed by the tri-nucleotide type, which is somewhat similar to previous reports. For example, di-nucleotide repeats as the most common type in EST data was also found in mud crab (S. paramamosain) (N = 8161, 42.9%) [18] and blunt snout bream (Megalobrama amblycephala) (N = 3107,62.7%) [32]. The top di-, mono-, and tri-nucleotide motifs in this study were AC/ GT, A/T and CTC/GAG in B.yarrelli, respectively.
In this study, 90 primer pairs were randomly designed to assess the successful amplification ratio of these SSRs by Primer 3. Thirty-one (33.3%) loci showed polymorphisms across panels of 40 individual B.yarrelli fish from the Nujiang River. The ratio (33.3%) of polymorphism microsatellites investigated in this study is lower than that of a previous study [11] (47%), in which the core motifs of the microsatellite loci only contained five bases, possibly because the core motifs of the microsatellite loci examined in this study are shorter than five bases. From a genetic perspective, the average observed heterozygosity (H O ), average expected heterozygosity (H E ), and the average polymorphic information content (PIC) can reflect the genetic diversity and inheritance patterns of a population from multiple angles [33]. We generally believe that PIC >0.5 denotes a high polymorphism rate, 0.25<PIC<0.5 denotes a moderate polymorphism rate, and PIC<0.25 denotes a low polymorphism rate [34]. In this study, the average observed heterozygosity (H O ), average expected heterozygosity (H E ), and average PIC value were 0.19, 0.3588 and 0.2954, respectively, indicating moderate genetic diversity among Nujiang River populations of B.yarrelli. Here, 8568 SSRs were defined in our dataset, and PCR primers could be designed for further research on the germplasm polymorphism [31], comparative genomics [35], and functional genomics of this fish.

Conclusions
This study constitutes the first analysis of the transcriptome of B. yarrelli using HiSeq 2500 technology, generating 14,607 unigenes. A total of 8568 sequences contained SSR motifs. Thirty-one of 90 primer pairs were assessed in 40 wild B. yarrelli via PCR amplification, and the loci exhibited polymorphisms in the wild populations from Nujiang River of China. The large quantity of transcriptome and SSR data for B. yarrelli obtained in this study will facilitate future detailed genetic studies.