Classification of Takifugu rubripes, T. chinensis and T. pseudommus by genotyping-by-sequencing

Takifugu rubripes is more expensive than other species of the genus because of its high protein content and special flavor. However, it is easily confused with imported T. chinensis and T. pseudommus because they have similar morphological characteristics. We identified single nucleotide polymorphism (SNP) markers of T. rubripes by genotyping-by-sequencing (GBS) and evaluated their ability to distinguish among T. rubripes, T. chinensis, and T. pseudommus. In all, 18 polymorphic SNPs were subjected to phylogenetic analyses of the three Takifugu species. Additionally, we subjected a second set of samples to Sanger sequencing to verify that the polymorphic SNPs could be used to evaluate the genetic variation among the three Takifugu species. A phylogenetic tree that included the analyzed sequence of set A, which is referred to as the reference sequence, and a validation sequence of set B with 18 SNPs were produced. Based on this phylogenetic tree and STRUCTURE analyses, T. rubripes, T. chinensis and T. pseudommus have low genetic variation and should be considered the same gene pool. Our findings suggest that further studies are needed to estimate the genetic association of the three Takifugu species.


Introduction
The family Tetraodontidae comprises 187 species of 28 genera, among which 33 species of 10 genera have been reported in the Republic of Korea (ROK) [1,2]. Puffer fish of the genus Takifugu (Tetraodonitiformes, Tetraodontidae) are distributed mainly in the Western Pacific Ocean. These are valuable fish species and about 2,200-3,800 tons are caught wild and over 8,000 tons are imported for consumption in the ROK annually [3]. Although several Takifugu species are consumed in East Asia, T. rubripes is the most economically important due to its special flavor [4]. However, the domestic wild catch of T. rubripes has dramatically decreased due to overfishing, spawning ground destruction, and contamination. To meet demand, this species is imported from China, India, and Japan live or in a frozen, dry form (http://www. mof.go.kr). The identification and classification of these fish species had raised some controversy. Especially, T. rubripes, T. chinensis, and T. pseudommus have the same habitat and a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 similar morphological characteristics [5]. Comparison of their meristic characters indicates overlapping morphological characteristics, such as the number of dorsal, vertebral, anal, and pectoral fin rays [6]. T. rubripes, unlike T. chinensis and T. pseudommus, has irregular round black spots and white patterns in front of the caudal fin on the sides of the body. T. chinensis has no such black marks while T. pseudommus has white spots scattered on a black background on the dorsal and lateral sides of the body [2,7]. However, species identification based on external evidence can lead to controversy. In addition, species identification is hampered by distribution in the form of processed products, such as pellets, sashimi, cooked food, and puffer-fish skin. Therefore, the relationships among T. rubripes, T. chinensis and T. pseudommus must be evaluated using other characteristics.
DNA-based identification methods, including molecular markers, have been applied to identify the various marine species [8]. The used technique for identification was performed in the pacific tuna using allozyme [9], the horse mackerel using mitochondrial DNA (mtDNA) [10], the Indian sciaenidsr using randomly amplified polymorphic DNA (RAPD) [11], the gadoid using restriction fragment length polymorphism (RFLP) [12]. Of the nuclear single locus, microsatellites or simple sequence repeats (SSRs) have been used for individual and population genetics because of their high abundance, variability, and neutrality [13]. For instance, polymorphic loci have been used to distinguish T. rubripes from T. chinensis and T. pseudommus [14]. Although microsatellites facilitate examination of genetic variation and genetic structure, tools that use such markers have disadvantages such as inconsistencies in allele size calling caused by use of automatic sequencers and errors in size determination [15]. Among the various types of markers in use, single nucleotide polymorphisms (SNPs) are the most abundant in a genome and the most suitable for identification and genotyping [16]. Genotyping-by-sequencing (GBS) enables the identification of SNPs for high-throughput screening [17]. Compared to next-generation sequencing (NGS), GBS is a simpler and more rapid method for generating SNP data [18]. It has important advantages such as simplified library preparation, high-level genome coverage, and cost effectiveness if an appropriate restriction enzyme is used [17]. It has been used for discovery genome mapping, quantitative trait loci (QTL) analyses, and genetic stock identification for marine species such as the Chinook salmon Oncorhyncus tshawytscha and the threespine stickleback Gasterosteus aculeatus [19,20]. No study of the genetic diversity of marine organisms based on SNPs and GBS has been reported.
In this study, we developed a new set of SNPs by GBS to evaluate the genetic variation among T. rubripes, T. chinensis, and T. pseudommus. To enhance our understanding of the genetic relationships among these three species, we created a phylogenetic tree and analyzed the population structure. We also estimated the reproducibility of the GBS results for species identification. These results will facilitate the identification of these three morphologically similar species.

Sample collection and DNA extraction
Refrigerated fish were purchased at the cooperative fish market from May 2018. These samples were sent to the Marine Fish Resource Bank of Korea at Pukyong National University for morphological identification by a specialist and deposit as reference samples. The 96 samples of set A included 38 of T. rubripes, 28 of T. chinensis, and 30 of T. pseudommus and were used to identify SNPs by GBS. The samples included wild T. rubripes from Sinan-gun, Incheon in Republic of Korea (ROK), cultured T. rubripes from China and Japan; wild T. chinensis from Mokpo, Gunsan in ROK; and wild T. pseudommus from Boryeong, Mokpo in ROK (Table 1). Sample set B, which included wild T. rubripes from Sinan and Incheon, was used to verify the SNPs.
Pieces of fin from raw fish samples were fixed in 99% alcohol and genomic DNA was extracted using the DNeasy 96 Blood & Tissue Kit (Qiagen) according to the manufacturer's instructions. DNA was quantified using a NanoDrop ND-1000 instrument (Thermo Fisher, Barrington, IL). The DNA concentration was normalized to 10 ng/μL and used for library preparation.

Analysis of COI gene
To analyze genetic variation among the three Takifugu species, the polymerase chain reaction (PCR) products of the cytochrome oxidase subunit I (COI) gene in mitochondrial DNA were obtained using the universal primers VF2 (5 0 -TCA ACC AAC CAC AAA GAC ATT GGC AC-3 0 ) and FishR2 (5 0 -ACT TCA GGG TGA CCG AAG AAT CAG AA-3 0 ). The PCR mixture consisted of 1× PCR buffer, 250 μM each dNTP, 2.5 U Ex-Taq DNA polymerase (TaKaRa), 20 ng DNA sample, and 10 pmol universal primers in a total volume of 20 μL. PCR amplification was performed as follows: 10 min of denaturation at 95˚C, 35 cycles (45 s at 95˚C, 45 s at 56˚C, 1 min at 72˚C) of amplification, and a final extension for 5 min at 72˚C using an ABI 2729 Thermocycler. PCR products were visualized by gel electrophoresis and purified using an Expin PCR SV (GeneAll, ROK). COI sequences were amplified by PCR using the Big Dye Terminator v. 3.1 Cycle Sequencing Kit (Applied Biosystems) and ABI 3730XL DNA analyzer (Applied Biosystems, Waltham, MA).

High-quality library production
The in silico digestion was performed with ApeKI, PstI, and ApeKI-PstI to select the appropriate enzyme for the T. rubripes DNA. GBS libraries were constructed using ApeKI (GCWGC) and based on a protocol modified from Elshire et al. [17]. The oligonucleotides consisting of a common adapter and the upper and lower strands of each barcode adapter were diluted individually in 50 μM TE and annealed in a thermocycler (95˚C for 2 min; ramp down to 25˚C at 0.1˚C/s; 25˚C for 30 min; 4˚C hold). Barcodes and common adapters were diluted in 10× adapter buffer (500 mM NaCl, 100 mM Tris-Cl) to 10 μM, mixed at a one-to-one ratio, and then 2.4 μL was added to a 96-well PCR plate. DNA samples (100 ng/μL) were added to the wells containing the individual adapters. The samples (DNA plus adapters) were digested at 75˚C with ApeKI (New England Biolabs, Ipswich, MA) overnight. Adapters were ligated to sticky ends by adding 30 μL solution containing 10× ligase buffer and T4 DNA ligase (200 units) to each well. The samples were incubated at 22˚C for 2 h and 65˚C for 20 min to inactivate T4 DNA ligase. Sets of 96 digested DNA samples, each using a different barcode adapter, were combined (5 μL each) and purified using a QIAquick PCR Purification Kit (Qiagen) following the manufacturer's protocols and eluted in a final volume of 50 μL. The amplified restriction fragments from each library were added to 50 μL of a mixture of 2 μL pooled DNA fragments, Herculase II Fusion DNA Polymerase (Agilent), and 25 pmol of the following primers: (A) 5 0 -AAT GAT ACG GCG ACC ACC GAG ATC TAC ACT CTT TCC CTA CAC GAC GCT CTT CCG ATC T-3 0 (58 mer) and (B) 5 0 -CAA GCA GAA GAC GGC ATA CGA GAT CGG TCT CGG CAT TCC TGC TGA ACC GCT CTT CCG ATC T-3 0 (61 mer). PCR was performed using a Life ECO Thermal Cycler (Bioer Technology Co.) with the following parameters: initial denaturation at 95˚C for 2 min, 16 cycles (95˚C for 30 s, 62˚C for 30 s, and 68˚C for 30 s) of amplification, and a final extension at 68˚C for 5 min. The amplified sample pools represented a sequencing library. The libraries were purified as above, except that the final elution volume was 30 μL. To evaluate the amplified fragment size, 2 μL was loaded onto an 1.5% agarose gel with running conditions of 200V for 25min and library quality was checked using an Agilent Tape station with a high-sensitivity DNA chip.

Preprocessing
Raw sequences were de-multiplexed using the barcode sequences and trimmed using an SEEDERS in-house python script [21]. While filtering reads that contain ambiguous bases in the barcode, this script divides the raw Illumina FASTQ file into the 96 individual FASTQ files based on the barcode sequences. Reads were also trimmed using the cutadapt version 1.8.3 [22] if the sequence contained the common adapter.
The de-multiplexed reads were trimmed using DynamicTrim and LengthSort programs in the Solexa QA package v.1.13 [23]. The sequence near the primer that had Phred quality score fell below Q = 20 (or error probability 0.05) were removed. In addition, 5 0 and 3 0 stretches of ambiguous 'N' nucleotides were trimmed. Finally, poor-quality sequence reads, along with the DynamicTrim phred score was <20 bases and the LengthSort process used a short read length of <25 bases [23], were discarded.

Alignment & detection of SNP and InDel
Burrows-Wheeler Aligner (BWA, 0.6.1-r104) software was used to align the clean reads to the reference genome [23]. The BWA default values for mapping were used, except for the following: seed length (-l) = 30, maximum differences in the seed (-k) = 1, number of threads (-t) = 16, mismatch penalty (-M) = 6, gap open penalty (-O) = 15, gap extension penalty (-E) = 8. Mapped reads were extracted from the resulting BAM file using SAMtools v. 0.1.16 for further analyses [24]. The high mapping quality assured reliable mapping of the reads, which is important for variant calling. SNPs were called only at the variable positions with a minimal mapping quality (Q) of 30 using the varFilter command. The minimum and maximum read depths were 3 and 95, respectively. A SEEDERS in-house python script considering biallelic loci was used to select significant SNPs from the called SNPs positions [25]. Depending on the ratio of SNP/InDel reads to mapped reads, the variant type was classified into three categories: homozygous SNP/InDel for > 90%, heterozygous SNP/InDel for 40-60%, and other [21,26]. To control the quality of markers, a missing proportion (MSP) < 0.3 and a minor allele frequency (MAF) > 0.1 were selected [27].

Experimental validation of SNPs
Common SNPs were selected by comparing the nucleotide sequences of the SNP loci of the samples of the same variety, and the polymorphic SNP loci were selected by comparing common SNPs among the varieties. When selecting common SNPs in the same variety, �50% missing data were allowed.
A subset of selected polymorphic SNPs was used for validation by Sanger sequencing. Primers were designed to produce 500-600 bp amplicons containing at least one putative SNP using Primer3 v. 2.3.5 [28]. Each 20 μL PCR reaction comprised 1× PCR buffer, 2.5 U Ex-Taq DNA polymerase (TaKaRa), 250 μM each dNTP, 10 pmol each SNP marker, and 20 ng genomic DNA template. PCR was performed using an ABI 2720 Thermocycler as follows: 10 min of denaturation at 95˚C, 35 cycles (45 s at 94˚C, 45 s at 58˚C, and 45 s at 72˚C) of amplification, and a final extension for 5 min at 72˚C. The PCR products were assessed by 2% agarose gel electrophoresis and purified with Expin PCR SV (GeneAll Biotechnology, Seoul, ROK). The purified PCR products were amplified using an ABI BigDye Terminator v. 3.1 Cycle Sequencing Kit (Applied Biosystems), and run on an ABI 3730XL DNA analyzer (Applied Biosystems). The acquired bidirectional sequences and the reference sequence were aligned in SeqMan software (DNASTAR).

Data analysis
The COI sequences were assembled using ClustalW multiple alignments with BioEdit software v. 7.2 (http://www.mbio.ncsu.edu/bioedit/bioedit), and haplotypes and polymorphisms were analyzed using DnaSP v. 5.1 [29]. Genetic relationships were inferred by the maximum-likelihood method with 1,000 replications based on the Tamura-Nei model using MEGA v. 6 [30]. Analyses of population genetic structure were assessed to SNP dataset by Bayesian modelbased cluster analyses in STRUCTURE v. 2.3.4 [31]. The analyses were based on the admixture and correlated allele frequencies models. Clusters (K) were estimated in five replicates using K values of 1 to 10, 10,000 of burn-in period and 10,000 iterations for each run. STRUCTURE analyses were performed on the polymorphic SNP dataset of T. rubripes, T. chinensis, and T. pseudommus individuals to evaluate the population structure of these three species.

Genetic variation of COI gene
The 141 individuals of the three Takifugu species were identified as two or more species based COI sequence of T. rubripes (Acc. No. AP006045), T. chinensis (KY514072), or T. pseudommus (KY514075). Sequence analyses of the 600 bp COI gene fragment identified 15 polymorphic sites with sequence variation, which resulted in 10 haplotypes (Table 2). There were seven, six, and seven haplotypes in the T. rubripes, T. chinensis, and T. pseudommus samples, respectively. Haplotype analyses demonstrated no variation in alleles among T. rubripes, T. chinensis, and T. pseudommus. Most individuals of the three Takifugu species were of Hap_1. The 30 individuals of the Shimonoseki cultured population were of one haplotype, which had a nucleotide diversity of 0%.

In silico genome digestion and sequencing raw data
To obtain restriction fragments of 200-500 bp, we performed in silico enzymatic digestion of the T. rubripes genome using ApeKI, PstI, and ApeKI-PstI. ApeKI was predicted to generate 344 K fragments, PstI 52 K fragments, and ApeKI-PstI 97 K fragments (Fig 1). Therefore, we conducted in silico digestion using ApeKI (GCWGC) with a protocol modified from that of Elshire et al. [17].
Sequencing on the Illumina Hiseq 2500 platform generated 573,121,290 paired-end unique reads and 57,885,250,290 bp. The number of raw data reads for each sample after de-multiplexing using the barcode sequence was 537,661,624 (93.81%). After de-multiplexing, removing the barcode and adaptor sequences, and quality trimming, the number of reads mapped to the reference genome was 425,121,518, which accounted for >90% of the annotated genes.

Selection of useful SNPs and phylogenetic analysis
In all, 638,028 SNP matrix loci were combined using the raw SNPs of each sample including 96 of the genus Takifugu. These SNP markers were subjected to a filtration for MAF and missing data and 16,577 loci with MAF> 10% and <30% missing data were obtained. These SNPs were analyzed common SNP loci referring SNP location common within the same species in integrated SNP matrix of the 96 sample, resulted in 210,656 of T. rubripes, 141,481 of T. chinensis and 117,348 of T. pseudomonas and 159,284 comparable common SNP loci were selected. Among these, 18 polymorphic SNP loci which showed the interspecies polymorphism at the same SNP location among Takifugu were selected. Eighteen SNPs were used to construct a maximum-likelihood phylogenetic tree (Fig 2) with 1,000  Fig 2). To analyze genetic differentiation confirmed the results exhibited by STRUCTURE outputs predicted K = 2 as the most likely value of clusters to distinguish T. rubripes clade (Fig 3A). STRUCTURE analyses based on all individuals showed that T. chinensis and T. pseudommus belong to one cluster while T. rubripes formed a separate cluster(Orange), with the exception of three individuals (Fig 3B). This indicates that T. rubripes could be differentiated from T. chinensis and T. pseudommus. Next, we validated these data by Sanger sequencing using sample set B.

Experimental validation of SNPs
Eighteen SNP loci identified based on the GBS library were amplified for sample set B. Eighteen SNP details and information were shown in Table 3. These loci were validated from 45 individuals by Sanger sequencing. A phylogenetic tree of the 18 SNPs was produced using the reference sequence of set A and the validation sequence of set B. The tree indicated that 89% of all individuals identified as T. rubripes in set B were grouped with T. chinensis and T. pseudommus (Fig 4). This is in contrast with the grouping of only three T. rubripes individuals in the T. chinensis and T. pseudommus clade. STRUCTURE analyses showed that the T. rubripes cluster of set B was not separated from T. chinensis and T. pseudommus (Fig 5).

Discussion
SNPs are the most frequently used polymorphisms for simultaneous identification and genotyping [32]. Genome-wide SNP discovery may be performed using GBS, restriction-associated DNA sequencing, or reduced representation libraries. GBS offers ultra-high-throughput sequencing, requires a small amount of DNA (100 ng), and enables a high level of multiplexing [33]. Selection of appropriate restriction enzymes is important for GBS because this influences the properties of the libraries and the extent of SNP coverage [34]. The initial protocol involved one enzyme and was subsequently modified to use two [35]. A two-enzyme GBS protocol reduces genome complexity by avoiding sequencing of repetitive regions and generates a more uniform library for sequencing than a one-enzyme GBS protocol [36]. Therefore, we performed genome digestion using both one-and two-enzyme protocols. The restriction enzyme was selected based on a prior report [34]. Using only a single enzyme, ApeKI yielded more fragments than the other enzymes (Fig 1). In addition, use of ApeKI resulted in identification of more SNPs than the other restriction enzymes, but with an imbalanced density distribution, probably because of its sensitivity to methylation [37]. GBS typically results in a large amount of missing data because of its limited sequencing depth [38]. To increase data quality, Liu et al. [38] proposed modulation of high-quality SNPs with < 20% missing data. However, this approach was at risk of loss of some SNPs, so Lin et al [39] suggested <50% missing data. Therefore, common SNPs were composed of data having less than 50% missing values. GBS-SNP could simplify the alignment problem in species with a high level of genetic diversity. Indeed, GBS is suitable for population diversity, germplasm characterization, breeding, and trait mapping in diverse organisms [17]. In addition, it can resolve phylogenetic relationships and genetic diversity in tomato, rice, and lentil [40,41]. In our study, 18 polymorphic SNPs were identified to evaluate the genetic relationships among three Takifugu species. Phylogenetic analyses based on all individuals of set A indicated that the T. rubripes clade, with the exception of three individuals from Sinan, Incheon, was distributed in the T. chinensis and T. pseudommus clade (Fig 2). STRUCTURE analyses using the same individuals yielded the same result (Fig 3B). Most T. rubripes individuals of set A were from aquaculture in Japan; the remaining individuals were caught wild in the ROK and from culture in China, including three individuals in the T. chinensis and T. pseudommus clade (Fig 2). Therefore, the three T. rubripes individuals from Sinan, Incheon, which were grouped in the T. chinensis and T. pseudommus clade, have a genetically complex structure or unclear phenotype, such as an interspecies hybrid. Indeed, one interspecies hybrid was 99-100% genetically identical to T. rubripes, T. chinensis, and T. pseudommus and had morphological features of both T. rubripes and T. chinensis [2]. In addition, we speculate that most T. rubripes in set A were from aquaculture in Japan, and so had low genetic diversity. Thus, T. rubripes was genetically different from T. chinensis and T. pseudommus. All of the cultured T. rubripes individuals were of one haplotype in COI region (Table 2). Cui et al. [14] reported that cultured T. rubripes had lower genetic diversity (0.7832) than wild T. rubripes and T. pseudommus (0.8661 and 0.8272, respectively).
We used sample set B to validate the ability of the polymorphic SNPs to evaluate genetic variation among the three Takifugu species. Unlike the results obtained using sample set A (Fig 2), most T. rubripes in set B grouped with the T. chinensis and T. pseudommus clade (Figs  4 and 5). This indicates no genetic variation among T. rubripes, T. chinensis, and T. pseudommus. A variety of DNA-based markers, such as microsatellites, mitochondrial DNA, and randomly amplified polymorphic DNA (RAPD), have been used to evaluate the genetic relationships among T. rubripes, T. chinensis, and T. pseudommus. Reza et al. [7] performed sequence analyses of mitochondrial and nuclear genes of T. rubripes and T. chinensis, and the results indicated that both are of the same species but have different phenotypes. In addition, Reza et al. [42], using microsatellites and the mitochondrial control region, reported that T. rubripes and T. chinensis could be considered the same species. In addition, RAPD, mitochondrial 16S ribosomal RNA gene sequences, and microsatellites have indicated that T. rubripes and T. pseudommus are genetically similar [43]. Moreover, Baek et al. [2] reported that the COI sequence of T. rubripes, T. chinensis, T. pseudommus, and an interspecies hybrid were 99-100% similar. Therefore, further studies of the genetic relationships among the three Takifugu species are warranted. we could not identify the genetic difference among the three species of Takifugu species, whole genome sequence analysis by NGS might be necessary for this purpose. In this study, we used GBS-SNP to assess the genetic relationships among T. rubripes, T. chinensis, and T. pseudommus. Our results suggest that these species exhibit a very low level of genetic variation and could be considered one species, in agreement with previous reports.
All 18 SNPs were selected because they were polymorphic in three Takifugu species and expected to differentiate the species. However, the validation process showed that these three Takifugu species could not be differentiate as in case of COI comparison. Therefore, there is not species specific marker until now and further study for molecular maker is need or these three species could be regard as subspecies of one species as remarked in the discussion.