Genome-Wide SNP Discovery, Genotyping and Their Preliminary Applications for Population Genetic Inference in Spotted Sea Bass (Lateolabrax maculatus)

Next-generation sequencing and the collection of genome-wide single-nucleotide polymorphisms (SNPs) allow identifying fine-scale population genetic structure and genomic regions under selection. The spotted sea bass (Lateolabrax maculatus) is a non-model species of ecological and commercial importance and widely distributed in northwestern Pacific. A total of 22 648 SNPs was discovered across the genome of L. maculatus by paired-end sequencing of restriction-site associated DNA (RAD-PE) for 30 individuals from two populations. The nucleotide diversity (π) for each population was 0.0028±0.0001 in Dandong and 0.0018±0.0001 in Beihai, respectively. Shallow but significant genetic differentiation was detected between the two populations analyzed by using both the whole data set (FST = 0.0550, P < 0.001) and the putatively neutral SNPs (FST = 0.0347, P < 0.001). However, the two populations were highly differentiated based on the putatively adaptive SNPs (FST = 0.6929, P < 0.001). Moreover, a total of 356 SNPs representing 298 unique loci were detected as outliers putatively under divergent selection by FST-based outlier tests as implemented in BAYESCAN and LOSITAN. Functional annotation of the contigs containing putatively adaptive SNPs yielded hits for 22 of 55 (40%) significant BLASTX matches. Candidate genes for local selection constituted a wide array of functions, including binding, catalytic and metabolic activities, etc. The analyses with the SNPs developed in the present study highlighted the importance of genome-wide genetic variation for inference of population structure and local adaptation in L. maculatus.


Introduction
Considering the ongoing worldwide depletion of most marine populations [1], accurate estimates of population demographic parameters are often necessary for fisheries management [2,3]. In the past decades, tens to hundreds of neutral markers have been used for population genetic inference [4][5][6]. However, the applications for recently isolated populations of marine species with shallow genetic structure and large effective population size have been limited.
In recent years, advances in high-throughput reduced-representation genome sequencing (RRGS) technology have provided an unprecedented opportunity to conduct population genomic studies in both model and non-model organisms. Restriction-site associated DNA tag sequencing (RAD-seq) is a powerful RRGS protocol [21,22]. RAD-seq approach has been successfully applied in a variety of organisms to identify resources of genome-wide SNPs, including both plants [23,24] and animals [25,26]. The advantages of RAD-seq in efficiency, costs and accuracy have revolutionized the field of population genetics and facilitated population structure inferences and local adaptation studies at a genome wide scale [27].
The spotted sea bass, Lateolabrax maculatus, belongs to the family Moronidae (Perciformes) [28,29]. Lateolabrax maculatus is distinguished newly described species from the Japanese sea bass, L. japonicus and is characterized by many clear black dots on lateral body region [30]. It is widely distributed along coasts of the Bohai Sea, Yellow Sea, East China Sea and South China Sea, reaching south to borders between China and Vietnam and north to Southeast coast of South Korea [31,32]. L. maculatus is a species of high commercial value and mainly found in moving water of inshore rocky reefs. Population decline of L. maculatus has been recorded due to overfishing and habitat deterioration resulting from anthropogenic activities [33,34]. Although previous population genetics studies using both mitochondrial DNA (mtDNA) sequences and microsatellites showed some genetic structuring between populations of L. maculatus [32,33], fine-scale population structure still remains to be revealed by genomic-wide genetic data. Moreover, the Northwest Pacific marginal seas provide an excellent natural system for studying local adaptation. The Northwest Pacific marginal seas are relatively young postglacial ecosystems (< 10 000 years) and characterized by environmental gradients [32]. For example, the average annual sea surface temperature ranged from 10.9°C in Bohai Sea to 26.5°C in South China Sea (data provided by the National Oceanic and Atmospheric Administration; NOAA). As a widely distributed marine fish species in the Northwest Pacific, populations of L. maculatus may experience divergent selection in heterogenous environments. Furthermore, naturally spawned fry of L. maculatus were captured from coasts of China, Korea, and Taiwan and transported to different regions of China, Japan and Korea for cage cultivation in the past three decades [35,36]. The development of a set of appropriate molecular markers will also facilitate the scientific management of the genetic resource and the avoidance of the genetic disturbance of the natural populations caused by the occasional escape of cultured individuals.
In the present study, we generated a novel resource of genome-wide SNPs for L. maculatus by paired-end sequencing of restriction-site associated DNA (RAD-PE) for 30 individuals collected from two populations across its distribution range in China. The SNPs were then used to evaluate the levels of genetic diversity and population divergence between the two populations. Outlier tests were also conducted to detect loci under putative selection. Finally, function annotation of the outlier loci was performed to determine whether the potentially adaptive loci localized to known genes or conserved genomic regions.

Ethics statement
The field studies did not involve any endangered or protected species. Lateolabrax maculatus is not protected by Chinese law. No fishing license was required for collection of samples from all locations. It is a commercially harvested species in China. The fish were collected by trawling by local fishermen for commercial purposes and were already dead when collected. No of the authors was involved in the collection of the fish. Animal Ethics Committee approval was not needed because no handing of live animals was involved.

Sample collections and DNA extraction
Samples were collected from two separate locations of heterogenous environments in May 2014: one from coast of Beihai, Guangxi Province (21°24' N, 109°05' E, T a = 26.5°C, T a , average annual sea surface temperature) and the other from Dandong, Liaoning Province (39°52' N, 124°19' E, T a = 10.9°C). Muscle tissue samples of a total of 30 individuals (16 from Beihai and 14 from Dandong) were collected and preserved in 96% ethanol for DNA extraction. Genomic DNA was extracted from~100 mg muscle tissue using a standard phenol-chloroform extraction protocol [37]. Samples were treated with RNase A to produce pure, high molecular weight, RNA-free DNA. Quality and concentration of DNA samples were measured by a Nanodrop TM 2000 (Thermo Scientific) spectrophotometer and a Qubit 1 2.0 fluorometric quantitation. The optimal concentration was no less than 50 ng/μL, and the total DNA recovered was more than 2 μg.

Library preparation and sequencing
RAD-PE libraries were prepared using the protocol outlined by Baird et al. [21] and Etter et al. [38]. Genomic DNA from each individual was digested with high fidelity restriction enzyme EcoRI (G^AATTC). Then, Illumina P1 adapter containing individual-specific index (6 bp) was ligated to the digested products. The adapter-ligated DNA was sheared and separated by electrophoresis on a 2% agarose gel. Fragments in the 200-600 bp size range were collected using a MinElute Gel Extraction Kit (QIAGEN, Beijing). After treating double-stranded DNA ends with blunt-ending enzymes and adding 3'-adenine over-hangs, a modified Illumina P2 adapter was ligated. Finally, the libraries were enriched by high-fidelity PCR amplification (8-12 cycles). RADs for each individual were sequenced on an Illumina HiSeq TM 2500 sequencing platform at Novegene in Beijing, China. Due to the unavailability of existing genomic information for the diploid L. maculatus, one individual was deep sequenced (approximately 32× coverage) to assemble reliable contigs as a reference assembly for downstream alignment and SNP calling.
using the following criteria: 1) removing reads with adapter contamination; 2) reads with 10% unidentified nucleotides were removed; 3) reads with > 50% bases having phred quality < 5 were removed; 4) putative duplication reads were removed to reduce the impact of PCR artifacts on allele frequency estimation; 5) reads were checked for presence of the partial EcoRI motif (^AATTC).
For the reference individual, the remaining first reads with restriction enzyme recognition site after quality control were clustered into RAD cluster tags using cd-hit-est [39]. A maximum of three mismatches between reads was allowed, which corresponded to~3% of the single-end read length (125bp) [40]. RAD cluster tags with less than 10 or more than 400 reads (approximately 20× of the average read coverage) were discarded. The paired-end reads associated with each RAD cluster tag were extracted and the sequences were sent to the assembly program Velvetopt [41] to construct scaffolds using adjacent contigs identified by paired-end information.

Read alignment, SNP discovery and filtering
Allowing one permissible alignment per pair read, quality-filtered reads of each individual were aligned to the assembled reference sequences using BWA (version 0.6.2) with default parameters (mismatch penalty 4; gap open penalty 6) [42]. Following the alignment, SNP calling was performed by a conservative Bayesian approach as implemented in the SAMTOOLS package [43]. SNPs were further filtered to maximize data quality according to the following criteria: (i) bi-allelic SNPs; (ii) an average phred score > 20; (iii) coverage depth 4 and 100; (iv) missing ratio within each population < 20%; (v) a global minor allele frequency (MAF) 0.05 in the two pooled populations. Considering the high proportion of paralogous sequence variant (PSVs), only SNPs with F IS values between -0.3 and 0.3 and observed heterozygosity values < 0.5 were retained for subsequent analyses [44]; (vi) one SNP was randomly chosen from each RAD tag for subsequent population genetic analyses.

Outlier tests
Two F ST -based outlier tests were applied to identify loci that showed divergent patterns of differentiation compared to neutral expectations, and therefore have been potentially affected by selection. First, polymorphic loci were screened for outliers using the coalescent method of Beaumont & Nichols [45] as implemented in LOSITAN [46]. LOSITAN was run using parameter setting of 100 000 simulations, confident interval of 0.995, false discovery rate (FDR) of 0.05, subsample size of 28, attempted F ST of 0.055 and simulated F ST of 0.052. Second, outlier SNPs were also detected by using the Bayesian simulation approach of Beaumont & Balding [47] as implemented in BAYESCAN [48]. BAYESCAN runs were implemented using default values for all parameters, including a prior odds value of 10, with 100,000 iterations and a burn-in of 50,000 iterations. Loci were considered under selection with a FDR of 0.05.

Genetic diversity and population differentiation
The VCFTOOLS package [49] was used to estimate observed (H O ) and expected (H E ) heterozygosity for each population. The loci with minimum depth of 4 were generated using ref_map.pl in Stacks version 1.32 [50]. Then the nucleotide diversity (π) for each population was calculated by the POPULATIONS program (-r 0.8 -m 4-min_maf 0.05) based on these loci. The whole data set, the neutral SNPs and the putatively adaptive SNPs were used to assess the current distribution of genetic variation by using the Bayesian model-based clustering program of Admixture version 1.2.3 [51]. Furthermore, relationships among individuals within and between populations were calculated and visualized using the NetView P version 0.6 software at a knn = 10 [52]. NetView P is a network analysis pipeline designed for detecting and visualizing complex population structure based on genome-wide SNPs [53]. The VCF files were reformatted with PGDSpider version 2.0.1.1 [54]. F ST values between populations based on different datasets were calculated using ARLEQUIN version 3.5.1.3 [55], and significance was determined using 10 000 permutations.

Population assignment tests
Assignment power of four data sets was evaluated with leave-one-out tests in GeneClass version 2.0 [56] to compare the influence of number of SNPs and relative divergence of SNPs on assignment accuracies. These data sets included (i) the complete putative outlier SNPs (298); (ii) 298 randomly chosen SNPs from the complete neutral data set; (iii) 20 randomly chosen SNPs from the complete neutral data set; and (iv) 20 randomly chosen SNPs from the complete putative outlier data set. Individuals were considered to be assigned to a population if the assignment probability to that population was higher than to the other population.

BLASTX analyses and GO annotation
Contigs containing the outlier SNPs were used as queries in nucleotide searches with BLASTX against the non-redundant protein database of bony fishes at the National Center for Biotechnology Information (NCBI) website (E-value < 1.0E-6). In case of multiple hits, the best match was selected for each outlier containing contig. Gene ontology (GO) functional annotation of the contigs with significant BLASTX hits were obtained using BLAST2GO suite (http://www. blast2go.com/b2ghome) [57], which conducts BLAST similarity searches and maps GO terms to the homologous sequences detected. Only ontologies with E-value < 1.0E-6, annotation cutoff > 55 and a GO weight > 5 were considered for annotation.

RAD tag sequencing and data filtration
RAD-PE sequencing generated 24.29 million raw read pairs (6.07 G (gigabases) raw data) for the reference individual. After quality filtering, 23.57 million clean read pairs (5.89 G clean data) with the effective rate of 97.03% were retained. After removal of PCR duplicates and only keeping read pairs with the partial EcoRI motif (AATTC), 19.50 million reads were finally retained, presenting a clean duplication rate of 11.36% and digestion ratio of 93.35%, respectively (Table 1). For the 29 normally sequenced individuals, sequencing of the RAD libraries generated a total of 169.26 million raw read pairs (45.43 G raw data) (S1 Table). After quality control, a total of 160.8 million clean read pairs (43.18 G clean data) was retained, which presented an average effective rate of 95.0%. Of the retained read pairs, an average of 5.52 million read pairs per individual were kept after removing putative duplication reads and reads without intact EcoRI cutting sites (average clean duplication rate of 20.11% and digestion ratio of 95.12%, respectively). Overall, the data showed a high phred quality (phred score 20 89.47%; phred score 30 81.38%), a stable GC content ranging from 38.67% to 41.7% and a high digestion rate from 76.62% to 98.25%. The Raw RAD-seq reads pairs have been deposited in the Sequence Read Archive database under Accession no. SRP072011.

Assembly of the reference sequence
Allowing for a maximum of three mismatches, a total of 3.43 million cluster tags were generated. After removing those cluster tags with less than 10 or more than 400 reads, a total of 223 573 cluster tags containing 15.1 million pair reads were retained. In total, the resulting reference assembly consisted of over 285 408 contigs (~113 million nucleotides) with an N 50 size of 509 bp and a GC content of 40.11% (S1 File). After the filtered pair-end reads were realigned onto the assembled contigs, an average depth of 31.56× was obtained and approximately 87.22% of the reference assembly was covered by four or more reads ( Table 2).

SNP discovery and analysis
Prior to any quality filtering, a total of 1 184 075 putative SNPs were detected among 30 individuals. After retaining bi-allelic loci with phred score 20, a total of 1 052 835 SNPs were left. Applying a minimum coverage of four reads and the missing ratio within each population < 20%, a total of 109 307 SNPs were retained. After removing SNPs with a global MAF < 0.05, 64 008 SNPs were left. After only keeping loci with F IS values between -0.3 and 0.3 and H O < 0.5 in both populations, 42 733 SNPs were finally retained (Table 3; S2 File). The  Table). About 61% of the retained SNPs were proved to be transitions, corresponding to an observed transition / transversion ratio of 1.59 (Fig 1).

Outlier detection
A total of 42 733 SNPs were included in both tests for outliers. Using LOSITAN, a total of 3 122 SNPs were identified as outliers possibly under divergent selection after applying a significance level of 0.995. A total of 356 outlier SNPs representing 298 unique contigs were detected by BAYESCAN, all of which were part of those identified using LOSITAN (Fig 2; S4 File). To remove linkage disequilibrium, only one SNP was randomly chosen from each RAD tag for subsequent population genetic analyses, which produced a final data set of 22 648 SNPs. Admixture results based on all three different SNP data sets (whole, neutral, and outlier SNPs) showed that individuals from Dandong and Beihai were clearly separated from each other (Fig 3). Besides, the network of the two populations agreed well with structure detected in the Admixture analyses and genetic break between Beihai and Dandong was clearly visualized in the network topology (Fig 4). F ST between the two populations was small but significant based on the whole data set (F ST = 0.0550, P < 0.001) and neutral SNPs (F ST = 0.0347, P < 0.001). As expected, F ST estimation based on the outlier SNPs yielded a much larger value (F ST = 0.6929, P < 0.001).

Population assignment
Assignment accuracy was 100% by using both the complete outlier data set and the equal number of neutral data set. The accuracy based on 20 randomly chosen outlier SNPs,( 93.8%) was higher than that based on 20 randomly chosen neutral SNPs ( 78.6%) ( Table 4).

BLASTX analysis and GO annotation
BLASTX analysis of the 298 contigs harboring outlier SNPs against various bony fish genomes resulted in significant hits to 40 fish species. BLASTX similarity results showed that 55 of the 298 contigs corresponded to known proteins in the UniProt database (E-value 1.0E-6). Functional categorization of the annotated sequences involved in binding and recognition, catalytic and metabolic activities, etc (S3 Table). GO functional annotation of the 55 contigs with significant BLASTX hits yielded GO terms for 22 contigs (40.0%), which were classified into 25 functional groups in three functional categories: molecular function, biological process, and cellular component (Table 5 and Fig 5). Some contigs were classified into more than one functional category, which resulted in the sum of the contig ratio in each category exceeding 100%. Among the contigs categorized as cellular components, 36.67% were classified as cell and 36.67% as cell part. The majority of the contigs categorized as molecular functions was associated with binding (50%) and catalytic activity (41.67%). Most of the contigs categorized as biological process were involved in cellular process (60%) and metabolic process (50%).

Discussion
In present study, we developed a genome-wide SNP resource of L. maculatus using RAD-PE method. To our knowledge, this was the first report about the generation of such a large panel of novel SNPs for L. maculatus. Furthermore, we highlighted the potential advantages of the genome-wide SNPs for inference of population divergence and candidate adaptive markers detection of L. maculatus.
Large-scale SNP identification, genetic diversity, and population genetic structure As a newly described species from L. japonicus, the limited number of available molecular markers has constrained population genetic studies of L. maculatus in the past 10 years. Only 37 polymorphic microsatellites were developed [33,58]. In addition, the complete mitochondrial genome of L. maculatus was also available in GenBank [59]. Most previous population genetic studies of L. maculatus were based on a handful of microsatellite markers, mitochondrial sequence analysis, and random amplified polymorphic DNA (RAPD) markers, which obtained inconsistent results [32,33,60,61]. The transition/transversion ratio was 1.59, which suggested a small influence of sequencing error on calling SNP. Similar transition/transversion ratios have also been observed in the great tit (1.7:1 [62]) and the European eel (1.65:1 [3]). In the absence of a reference genome for L. maculatus, the contigs generated using paired-end RAD data provided sufficient flanking region around SNPs for design of high-throughput SNP genotyping arrays. This approach has been proved successful for SNP assay design simultaneous with SNP discovery in several studies [38,63,64].
The nucleotide diversity was 0.0028 in Dandong and 0.0018 in Beihai. Similar level of variations was identified in the other marine species, such as European eel (π = 0.00529) and small Genome-Wide SNP Discovery for Lateolabrax maculatus yellow croaker (π = 0.00105) [3,65]. The higher nuclear genome-wide nucleotide diversity in Dandong than in Beihai was consistent with the results of previous mtDNA study. By using mtDNA control region sequences, Liu et al. [32] found that northern populations of L. maculatus generally showed higher nucleotide diversities than southern ones, with the lowest one found in Beihai. All these results was consistent with the hypotheses that the glacial refugium of L. maculatus was located in the basin of East China Sea and the genetic diversity is expected to be higher in the ancestral population than in the derived population. Our genome-wide SNP data set demonstrated high power in resolving population genetic structure of L. maculatus. Both the Structure and NetView P analyses with the whole SNP dataset revealed a clear separation of distinct genetic clusters corresponding to the two geographic populations. However, no genealogical clustering that corresponded to sampling localities was detected by using mtDNA control region sequences [32]. Previous population genetic and phylogeographic studies based on traditional markers demonstrated that most marine fishes generally show low levels or absent of genetic differentiation among geographic regions due to high dispersal potential and an absence of physical barriers [66][67][68]. The high resolution of genome-wide SNPs has sufficient power to detect population structure even when genetic differentiation is low, as it is  typical for marine species. The advantage of genome-wide SNPs over traditional genetic markers in population genetic analyses has been increasingly reported in marine fishes with high gene flow [13][14][15], which highlighted the utility of genome-wide data in delineating shallow population structure. The genome-wide panel of high quality SNPs generated will facilitate further population genomic and phylogeographic studies on L. maculatus.

Population assignment
In the present study, both the putative outlier loci and neutral loci were powerful in population assignment of L. maculatus. In the past three decades, naturally spawned fry of L. maculatus were captured from coasts of China, Korea, and Taiwan and distributed to different regions of China, Japan and Korea for cage cultivation [35,36]. Escaping of cage-cultured L. maculatus imported from China has been reported in various localities around western Japan, where the spotted sea bass is vigorously cultured [31]. These new informative SNPs, especially the outliers, would be useful for increasing accuracy when assigning individual L. maculatus to population-of-origin in aquaculture using naturally spawned fry, which would facilitate the scientific management and sustainable exploitation of the genetic resource of natural populations of L. maculatus. Since the two populations analyzed in the present study were geographically distant and genetically differentiated, screening of further samples from geographically close localities will be required to assess the accuracy reported in this study. Non-neutral markers can be useful for individual and composition assignment [69]. Indeed, the 20 randomly chosen outlier SNPs performed better than the 20 neutral SNPs in L. maculatus. Outlier loci have also been proved successful for individual and compositional assignment in various fishes. For example, Larson et al. [18] demonstrated that outliers identified by RAD-seq in Chinook salmon (Oncorhynchus tshawytscha) can be used to create high-resolution panels for genetic monitoring and population assignment.

Local adaptation
Recently, the advent of high-throughput DNA sequencing technology provides a novel approach for investigating local adaptation in natural populations of marine fishes [14,18,70]. BLASTX analyses of the outlier-containing contig sequences revealed that only 55 out of 298 (18.4%) highly divergent contigs were located in functional genes or genomic regions, suggesting that most of the putative outlier SNPs detected in L. maculatus were located in unknown proteins and non-coding genomic regions influenced by selection through genetic hitchhiking [48]. The BLASTX annotated contigs in the present study are involved in metabolism, growth, immunity and biorhythm. Contig_1733564 was annotated as an E3 ubiquitin-protein ligase gene (HERC1), which is involved in the ubiquitin mediated proteolysis. contig_1782285 (diacyllycerol kinase zeta isoform x1 gene, DGKZ) is a gene involved in pathways for glycerolipid metabolism and glycerophospholipid metabolism. Contig_612117 (C-terminal binding protein gene, CtBP) is a key transcriptional coregulator in adipose tissue, which works with several different partner proteins to regulate the development of both white and brown adipocytes [71]. Contig_1242038 (lipase maturation factor 2 gene, LMF2) may be required for maturation and transport of active lipoprotein lipase through the secretary pathway. Contig_2602294 (deathassociated protein kinase 1-like gene, DAPK1) is an important regulator of the cellular antiviral response [72]. Contig_3052505 was annotated as aryl hydrocarbon receptor nuclear translocator-like 2 (ARNTL2), which is an essential component within the clock gene regulatory network. Contig_628717, contig_2583004, contig_432419, and contig_525464 were annotated as zinc finger protein gene (ZNF), which was reported to play broad-spectrum cellular functions in eukaryotic cells biology [73]. Meanwhile, other studies of marine fishes also found the same or similar functional candidate genes potentially important for local adaptation, such as transcription factor (contig_474018), Golgi apparatus protein (contig_1075810) [70] and zinc finger protein, RNA-directed DNA polymerase from mobile element jockey (contig_1450955; contig_1699444), RNA-directed DNA polymerase from mobile element jockey-like (con-tig_105161), RNA-directed DNA polymerase from transposon BS (contig_283613) [65]. The consistent results suggested that these candidate genes may play important roles in local adaptation. Moreover, GO functional annotation of 22 out of the 55 contigs with significant BLASTX hits demonstrated that the majority of the contigs categorized as molecular functions was associated with binding and catalytic activity, and most of the contigs categorized as biological process were involved in cellular process and metabolic process, indicating that these outliers are likely to be biologically relevant for adaptation of populations to local environments. Species that occupy heterogeneous environments (i.e. temperature) along their geographical distribution experience spatially varying selective pressure, which often result in local adaptation of ecologically important traits [74]. The two L. maculatus populations were collected from the Yellow Sea and the South China Sea with highly heterogeneous environments. Indeed, variance in ecologically important life history traits such as growth rate, size at maturity and spawning season have been observed among populations of L. maculatus [75,76]. Since L. maculatus re-colonized the extensive continental shelf of the China sea from glacial refugium in the East China Sea after the Last Glacial Maximum (LGM [32]), these putative adaptive outliers suggested that natural populations adapt to local environments could have occurred after LGM. Guo et al. [70] analyzed > 30 000 SNPs based on a pooled RAD-seq approach from 10 populations of Baltic three-spined sticklebacks and provided strong evidence for heterogenic genomic divergence driven by local adaptation along an environmental gradient in this postglacial ecosystem. We recommend that further population genomic studies use multi-populations across the distribution of L. maculatus and couple the allele frequencies with environmental data to pinpoint regions of the L. maculatus genome under selection.
Supporting Information S1