Genome-wide SNP analyses reveal population structure of Portunus pelagicus along Vietnam coastline

The blue swimming crab (Portunus pelagicus Linnaeus, 1758) is one of the commercially exploited crab fishery resources in Vietnam. This is the first study to provide a broad survey of genetic diversity, population structure and migration patterns of P. pelagicus along the Vietnamese coastline. The crab samples were collected from northern, central and southern Vietnam. Here, we used a panel of single nucleotide polymorphisms (SNPs) generated from restriction site-associated DNA sequencing (RADseq). After removing 32 outlier loci, 306 putatively neutral SNPs from 96 individuals were used to assess fine-scale population structure of blue swimming crab. The mean observed heterozygosity (Ho) and expected heterozygosity (He) per locus was 0.196 and 0.223, respectively. Pairwise Fst and hierarchical AMOVA supported significant differentiation of central and northern from southern populations (P<0.01). Population structure analyses revealed that P. pelagicus in the south is a separate fisheries unit from the north and center. Contemporary migration patterns supported high migration between northern and central populations and restricted genetic exchange within the southern population. In contrast, historic gene flow provides strong evidence for single panmictic population. The results are useful for understanding current status of P. pelagicus in the wild under an environment changing due to natural and anthropogenic stresses, with implications for fisheries management.


Introduction
The tropical to subtropical Vietnamese coastal zone is divided into the Gulf of Tonkin in the North, the central coast, the southeast coast and the Gulf of Thailand in the South [1,2]. The exclusive economic zone (EEZ) covers about 1 million km 2 and 3260 km of coastline along the East Sea (the Vietnamese name for the South China Sea). In winter, currents flow in a North East-South West direction while in summer, ocean currents flow from the South West-North East [3,4] with the eddies existing at the southern and central parts of the Vietnamese coastline PLOS  and treated with RNase (100 mg/mL) to remove residual RNA. Extracted DNA was eluted three times (100 μl elution/time) to get better DNA quality. All elutions were assessed using gel electrophoresis (1% agarose gel). The best elution (sharp, high weight molecular bands, no smear) was selected to determine the concentration by Qubit 1 2.0 Fluorometer (Invitrogen). Selected DNA templates (100 ng, concentration � 3 ng/μl) were then purified using AMPur-eXP (Agencourt) beads using a 2:1 template to bead volume ratio with the beads left in. Purified DNA from each crab individual was simultaneously digested with two restriction enzymes: MboI and Sau3AI (NEB). Each digestion was performed in 25 μl reactions: 2.5 μl SmartCut Buffer (10X), 0.5 μl MboI and 0.5 μl Sau3AI (5 unit/μl), and 21.5 μl of DNA template (eluate from the beads). Digestions were incubated at 37˚C for 3 h to overnight, and then 65˚C for 20 min, cleaned with PEG solution (10 g PEG, 7.3 g NaCl, plus water up to 49 ml), and eluted with 20.1 μl Illumina Resuspension Buffer.
EzRAD library preparation. Cleaned digestions were inserted directly into the Illumina TruSeq nano DNA library Prep kit following the Sample Preparation v2 Guide starting with the "Perform End Repair" step for one-third volume reactions (Supplement S1 [38]). Digested libraries were end-repaired, 350 bp size-selected by SP bead. Firstly, SP bead:H20 (1.5:1) were added to removed >550 bp fragments, the supernatant collected and applied to 10 μl SP bead to subsequently remove <350 bp fragments". The 3' ends of selected libraries were then adenylated and Illumina adapters were ligated to the digested genomic DNA samples. PCR reactions were performed using a total volume of 15 μl including 1.5 μl Illumina PCR Primer Cocktail, 6 μl Illumina Enhanced PCR Mix, 1.875 μl ddH 2 O and 5.625 μl DNA libraries. Biorad thermocyclers (Icycler) were used under the following temperature program: initial denaturation at 95˚C for 3 min, followed by 8 cycles of 98˚C for 20s, 60˚C for 15s and 72˚C for 30s. Final extension was done at 72˚C for 5 min and the soaking temperature was set to 4˚C.
PCR products (The 400-500 bp fragments of which 120 bp are the ligated adapters) were inspected using a 1.5% agarose gel with ethidium bromide and bands were visualized under UV transilluminator. PCR products were purified using SP Beads (1:1), and quantified using qPCR. DNA libraries were sequenced as paired-end 100 bp runs on HiSeq 2500/4000 system (Illumina) in Texas A&M University Corpus Christi Genomics Core Laboratory, USA.

Pop ID
Outlier loci detection and Linkage-disequilibrium (LD) analysis. Our final filtered panel of SNPs was run in BayeScan v2.1 [65] under default parameter settings to identify loci under divergent or balancing selection. A false discovery rate (FDR) correction of 0.05 was applied [66].
LD was measured as the squared pairwise correlation coefficient between loci (r 2 ) calculated using the 'LD' function in the R package 'genetics' [67]. Selected outlier clusters (SOC) and Compound outlier clusters (COC) were identified by LD network analysis using R package 'LDna' [68], optimal value of φ and |E|min parameter and LD threshold was set up for SOC. LD network were constructed using the R package 'igraph' [69].
All loci putatively identified by either programs were removed from the dataset to generate a panel of neutral SNPs.
Genetic diversity and relatedness. Numbers of alleles (N a ), effective numbers of alleles (N e ), expected (H e ) and observed (H o ) heterozygosity, and inbreeding coefficients (G IS ) were calculated for each sampled population and over all populations across the Vietnam coastline using GenAlexv6.5 [70] and GenoDive v.2.0b27 [71].
High levels of relatedness can impact analyses of population structure and estimates of population size, so relationships between individuals were estimated with the R package 'related' [72] using the dyadic ( [73] and triadic [74] maximum likelihood estimators and allowing for inbreeding. For both estimators 95% confidence intervals were calculated with 500 bootstrap events for each pairwise comparison. Potential pairs were identified as exhibiting a related value. Due to the imbalanced numbers of related pairs among populations leading to reduced sample size and avoiding positive bias in estimates due to underestimating relatedness in the overall population [75], further analyses were run with two datasets, one containing all individuals (with related pairs) and one with one putative individual removed per related pair (related individuals removed).
Analyses of population structure. Pairwise comparisons of Fst values between P. pelagicus populations werecomputed in ARLEQUIN [76] to test for significant differentiation among sampled sites. All p-values underwent FDR correction to avoid false positives resulting from multiple comparisons [66]. A hierarchical analysis of molecular variance (AMOVA) was performed to test for significant population structure within species, following two group options: geographically-defined populations (northern, central and southern) and combined individuals from the north and center into a single population, and considering individuals from the south as a separate population (northern-central and southern) using the program ARLEQUIN.
We tested for population connectivity and structure in the program Structure v2.3.4 [77,78] using a model-based Bayesian clustering method to infer the number of lineages, K, in a dataset. Structure was run to test K values of 1 through 4 with 10,000 iterations of burn-in followed by 5,000 Markov Chain Monte Carlo (MCMC) steps, using the correlated allele frequencies admixture model. The optimal value of K was evaluated using the Evanno method [79] by Structure Harvester v0.6.94 [80]. A Discriminant analysis of principal components (DAPC) was performed using the R package 'adegenet' [81]. This analysis provides a graphic description of the genetic divergence among populations in multivariate space.
Migration patterns. Historic gene flow between populations was estimated using the Bayesian inference implemented in MIGRATE-n v3.6.11 [82]. MIGRATE-n's implementation of coalescent theory measures migration 4 × Ne generations in the past [32,83]. Sample sizes were reduced for each population to obtain 200 loci genotyped in 100% of individuals used for the analysis. The run was performed using 500,000 recorded genealogies sampled every 100 steps, preceded by a burn-in of 20,000. Four hot chains were used with temperatures: T1 = 1.0, T2 = 1.5, T3 = 3.0 and T4 = 1.0x10 6 . After optimization, the maximum mutation-scaled effective population size (θ) prior was set at 0.1 while the maximum mutation-scaled migration (M) prior was set at 20,000. Five hypotheses of migration among populations were tested: (1) symmetric migration rates between all sites (Panmixia Model), (2) non-symmetric migration rates between all sites (Full Model) (3) migration between all sites only from the north to the south (North-South Model), (4) migration between all sites only from the south to the north (South-North Model), (5) migration occurring only between neighboring, north-center sites but no migration between south population (South Separate Model). The most likely model was chosen using the Bezier ln produced by Migrate-N according to Beerli et al. (2009) [84]. To elucidate the recent migration patterns, estimate relative migration levels (Nm) between populations were calculated based on neutral SNPs using divMigrate function [85] of R package "diveRsity" [86]. Gene flow patterns were visualized using network graphics produced using the R package "qgraph" [87].
Ethics Statement: All crab were collected from fish markets or through normal fishing activities and therefore within the guidelines of approved IACUC procedures, and did not need sampling permission in Vietnam. This study did not involve protected or endangered species Data Archiving: Upon acceptance, the unmodified sequence data in FASTQ format used in this research along with corresponding metadata will be uploaded and archived in the publicly accessible Genomic Observatories Metadatabase (GeOMe, http://www.geome-db.org/).

SNP discovery and filtering
Results of 165 libraries of P. pelagicus along the Vietnamese coastline generated 604123297 reads with a reading length of 101 bp. The optimal reference assembly of 3280843 bp was constructed from 9583 RAD tags. Initially, 107115 raw SNPs were detected. After filtering steps, 96 individuals were successfully genotyped at 338 valid SNPs. Information on individuals removed and SNPs retained at each step of filtering and data analysis is presented in S2 Table. Outlier loci detection BayeScan identified thirteen SNPs as outliers (q<0.05, α>0, FDR � 0.05) from the panel of 338 putative SNPs used to detect selection footprints (Fig 2D). LD network presented one selected outlier cluster (SOC) including 32 loci (φ = 1 and |E| min = 30, λ min = 0.79, LD threshold = 0.39) (Fig 2A-2C). The outlier loci detected by BayeScan were included in the SOC of LD network. In total, 32 loci were removed from the SNP panel and the 306 remaining loci were assumed to be neutral.

Genetic diversity and relatedness
Genetic diversity of P. pelagicus is presented in Table 1 Analyses of genetic relationships between individuals revealed 23 pairs of putative half siblings and 5 pairs of putative full sibling ( Table 2) following removal of 16 individuals (S2 Table). Both full and half siblings (22 pairs) occurred abundantly within the southern population, while the other sibling pairs occurred in remaining sampling sites ( Table 2).

Population structure and migration patterns
AMOVA results ( Table 3) of two hierarchical arrangements (3 populations versus 2 populations) and with two data set (with related pairs and related individuals removed) showed the majority of the variation (80.91-87.5%) in P. pelagicus was found within individuals, and highly significant in all cases (F IT = 0.125-0.19, P<0.001). The proportion of variance explained by differences among populations (F ST ) were larger in the two-populations (17.43% with related pairs and 15.03% when related individuals removed) than in the three-pops arrangements (11.29% and 8.06%, respectively). It is clear that related individuals contributed to the percentage of variation according to different clustering of populations, however, in all cases the difference were highly significant (P<0.001). With all arrangements and two datasets, among individuals within populations (F IS ) differentiation were not significant.
Pairwise Fst values between southern population to northern and central populations showed statistically significant genetic differentiation (P<0.001) in all arrangements, and data sets ( Table 4). In three-population clustering, the southern population showed more differentiation with the northern (0.199 with related pairs and 0.181 with related individuals removed) The STRUCTURE analysis, plotted with a K of 2 as chosen by the Evanno method, also showed a clear distinction between the south and the remaining two populations. The similar patterns were observed either with related pairs or related individuals removed from SNPs panels. The southern population was assigned to a first lineage with high certainty (98.4% and 98% composition of the "red" lineage and 1.6% and 2% of "green". Northern and central populations were assigned to the second lineage with the north represented by a dominance (98.2% and 98%) of "green" lineage, and central exhibiting a mixing of "green/red" with percentages of 82.5/17.5 and 80/20 (Fig 3A left and right).
The Discriminant analysis of principal component (DAPC) showed a clear distinction between the southern population from northern and central populations in both neutral SNPs data sets (Fig 3B). In the dataset with removed related individuals, the northern and central populations were somewhat separated ( (Fig 3B, right). However, DAPC analysis based on the 32 under-selected loci showed similar results to neutral related pairs SNPs (Fig 3C). Genetic studies on P. pelagicus Historic migration results strongly supported the Panmixia model based on the highest Bezier approximation score (ln = -115475.9) in which migration was maintained among all sites with random mating between crab individuals ( Table 5). The analyses disclosed that there was no mating restriction between crab individuals in the history supposed to be over 1000s of years [32,83]. The populations were able to share genetic material either through larval dispersal due to currents or via migration of adult crabs. Directional migration relative rates among recent P. pelagicus populations range from 0.1 to 1 (Fig 3D). Among these, asymmetric directional migration seems to have occurred from southern to northern and central populations, however, bootstrap analysis (nbs<0) showed that directional migration was not significant. Migration from northern to central, however, involved significant asymmetric migration (nbs>0) (S1 Fig).

Discussion
The fine scale population structure of swimming crab, applicable in fisheries management was investigated in both putative neutral and outlier loci. Overall, the current analysis based on SNP panels (including or removing related individuals) all showed similar results. The genetic patterns appear to indicate that P. pelagicus in northern-central and southern areas of the Genetic studies on P. pelagicus Vietnamese coastline maintain distinct populations. Significant pairwise Fst comparisons showed strong genetic differentiation between southern to central and northern populations. Furthermore, the hierarchical AMOVA results supported two regional clusters with higher proportions of variation compared to a three-population arrangement ( Table 3). Structure and DAPC analyses clearly divided the populations into two subdivisions (northern-central and southern). Outlier SNPs, which represented higher genetic differentiation, and respected providing better resolution to detect fine-scale population structure, identified the same patterns as neutral loci, suggested neutral loci themselves may reflected geographical adaptation ( [88]. P. pelagicus is well known as a migratory species, both in adult and larval stages. Male Genetic studies on P. pelagicus and female crabs can move between estuaries and open oceans for spawning and/or responding to lowered salinities [7]. As spawning of P. pelagicus occurs year-round [11], following the northeast and southwest monsoons, crab larvae may be dispersed by surface currents (Fig 1) along the coast from the Gulf of Tonkin up to the Gulf of Thailand and vice versa. However, all analyses revealed the consistent patterns of non-connectivity from the south to the remaining P. pelagicus populations. The Vietnamese coastline in the East Sea is influenced by seasonally complex water circulations, which result in upwelling and anticyclonic/cyclonic eddies along the south and central coasts [3,4,89]. In general, eddies may limit larval dispersal, acting as a larval retention system [90,91] and maintaining divergence in marine populations [92]. Winds, together with tidal and Mekong river discharge (6000-12000 m 3 /s) [93,94] were reported as the factors involved in the upwelling, and separate currents in the southern shelf of Vietnam. That may further well explain restricted gene flow in the southern population. Analyses of contemporary gene flow demonstrated the limited genetic exchange in P. pelagicus from the south, while the extensive migration occurring along the northern and central coasts. The migration relative rate (Nm) indicated 10 fold greater migration between northern and central populations than from these sites to the south. What makes this more interesting is that significant asymmetric migration from northern to central populations (S1 Fig). Monsoon-induced currents and eddies reported in the central coast [3] make the central region a potential population sink. In contrast, estimates of historical gene flow provided strong evidence for a single panmictic population. This may indicate historical patterns of connectivity were different to those detected today. Vietnamese coasts are currently undergoing dramatic changes due to human activities that heavily affect ecosystems and organisms [1,2]. These human induced disturbances such as overexploitation and habitat degradation/fragmentation as well as coastal pollution may prevent larval transport and dispersal by inducing broad-scale larval mortality [33] and obstructing adult migration [28,95], which may be one of the leading causes of current population isolation.
In term of genetic diversity, the lowest value was detected in the southern population, in concordance with high inbreeding coefficient (0.265) as well as related pairs ( Table 2). This heterozygosity deficiency is also recorded in P. pelagicus populations in Malaysia [50], and in other marine and freshwater organisms due to widespread habitat loss, degradation and fragmentation [32,96,97]. Significant relatedness and sib-ships have been observed in marine populations due to biophysical larval behavior [98,99], self-recruitment [31,100], and overexploitation/restocking [101]. Kien Giang was the main harvested area of P. pelagicus in Vietnam, high level of inbreeding and relatedness, and significant genetic differentiation may indicate that local recruitment originates from a limited pool of successful reproductive adults, and reflect somewhat the pressure of overexploitation on crab populations.
This was the first study to apply the powerful technique of over a hundred SNP markers to infer the natural and/or manmade barriers to gene flow in Portunus pelagicus. The population  [53], shown using mitochondrial makers. According to Lemopoulos et al. (2018) [102], RADseq-generate SNPs outperformed microsatellites (and possibly other markers) for investigating individual-level genotypes, and can be applied to studies of small-scale population structure such as the swimming crab in Vietnam. Looking at the swimming crab in the Indo-Pacific region overall, different patterns and levels of genetic structure of P. pelagicus have been detected, such as significant genetic differentiation [44,45,47] as well as high gene flow [50,51]. Highly restricted gene flow is mainly reported due to geographic distributions (even at a fine scale) [46], while connectivity is explained by adult migration (such as for spawning), larval dispersal [42], and a lack of physical barriers in the marine environment [50]. In case of Vietnamese P. pelagicus, the complex natural and anthropogenic biophysical factors may be driving restricted gene flow along the coastline. However, closely related individuals found in the southern population may affect current results such as reducing the sample size (in the case of related individuals removed), or creating an artificial population structure (when included related pairs). However, the two data sets analyses give the same structure, so we can also confirm an accurate reflection of results for a true phenomenon in this species. P. pelagicus can therefore be considered two fisheries and conservation management units. The factors driving current connectivity patterns of P. pelagicus are complex, and cannot accurately be identified. P. pelagicus is likely at risk from inbreeding and subpopulation isolation, and subsequently poor adaptive potential. The management for this species should be careful to ensure that overfishing and habitat degradation do not further affect the vitality of existing populations. Immediate actions such as a seasonal ban on catching crabs in the autumn and late summer to increase successful spawning [32], establishment of marine reserves to reduce genetic losses [101], and coastal pollution control to increase numbers of breeding individuals and larval dispersal [28,33]. Moreover, gear regulation, habitat monitoring and restoration might be one of the most effective ways to manage healthy populations. The appropriate explanation for the high rate of self-recruitment observed in the southern swimming crab remains open. Periodic surveys on genetic diversity, and seascape research [100] should be conducted to provide an overall temporal and spatial view of crab populations. This study of P. pelagicus highlighted the important of conservation genetic studies using advanced genomics for information-lacking geographic zones such as Vietnam East Sea. These results also provide important baseline measures of diversity that can be used for future genetic surveys as well as for monitoring responses of P. pelagicus for environmental changes and temperature rises due to climate change.  Table. Sample sites and size of Portunus pelagicus with successful sequences, pre-analyzed (de novo assembly, mapping) and analyses of population structure. Carapace width (CW) and weight (W). Abbreviation for sampling locations as shown in Table 1