New diagnostic SNP molecular markers for the Mytilus species complex

The development of diagnostic markers has been a long-standing interest of population geneticists as it allows clarification of taxonomic uncertainties. Historically, there has been much debate on the taxonomic status of species belonging to the Mytilus species complex (M. edulis, M. galloprovincialis and M. trossulus), and whether they are discrete species. We analysed reference pure specimens of M. edulis, M. galloprovincialis and M. trossulus, using Restriction site associated DNA (RAD) sequencing and identified over 6,000 SNP markers separating the three species unambiguously. We developed a panel of diagnostic SNP markers for the genotyping of Mytilus species complex as well as the identification of hybrids and interspecies introgression events in Mytilus species. We validated a panel of twelve diagnostic SNP markers which can be used for species genotyping. Being able to accurately identify species and hybrids within the Mytilus species complex is important for the selective mussel stock management, the exclusion of invasive species, basic physiology and bio-diversity studies.


Introduction
Blue mussel (Mytilus edulis, Linneaus, 1758) has been integral part of humans' diet for millennia, their shells have been found in middens dated back to the late Mesolithic periods, 5,000 B. C. [1]. Today, Europe is a major contributor to mussel's production, supplying over a third of the total commercial outputs. Aquaculture is by far the main source of this commodity and it is responsible for over 90 percent of total landings. M. edulis and Mytilus galloprovincialis (Lamarck, 1819) are the two main species cultivated in Europe with an output of 550,000 tonnes and € 900 million per year [2]. These two species, together with the Baltic mussel (Mytilus trossulus, Gould, 1850), belong to the "Mytilus species complex". Hybridisation between these species has been observed across the world; e.g., in the Pacific Ocean [3][4][5][6], in the Irish Sea [7][8][9] and Scotland where all combinations of species hybrids have been identified [10][11][12][13]. M. trossulus has been often associated with lower meat yield, thinner shell and reduced shelf life compared with M. edulis [10,14] and it is therefore considered undesirable within the European aquaculture context. Being able to accurately identify species and hybrids within the a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Mytilus species complex is therefore important for the management of a potentially economically damaging species.
Hybridisation and introgression between species are common evolutionary phenomena [15][16][17]. Introgression arises from repeated backcrossing with fertile hybrids, allowing stable integration of genomic material of one species into the genome of another species without a significant deleterious effect on fitness [18]. Accurate identification of species (including cryptic or complex species) is important from commercial, conservation and research viewpoints and can be significantly impaired when hybridisation or introgression occur.
There are some distinguishing morphological features that could be employed for marine mussel species (Mytilus spp.) identification, such as shell colour, shape, texture and size [19][20][21]. However, a range of biotic and abiotic factors, including hydrodynamic conditions [22], water temperature and salinity [23], can affect these features making individuals often morphologically similar, especially in sympatric populations [24][25][26].
The development of diagnostic markers has been a long-standing interest of population geneticists as it allows clarification of taxonomic uncertainties. The first tools used to distinguish between mussel species were allozymes. Numerous allozyme markers have been developed and used for Mytilus species identification [10,14,23,27]. Nonetheless, low as well as high variation between individual of the Mytilus complex render the technique only marginally more useful than the identification by morphological features [28]. Over the past three decades, a range of species diagnostic markers have been developed for single locus genotyping of Mytilus species of which the most routinely used is the nuclear DNA marker Me15/16 [29]. Genotyping with the Me15/16 is favoured due to its simple methodology of PCR amplification and identification of a size-specific gene fragments unique for each of the Mytilus species [30]. Single locus genotyping delivers more accurate identification of Mytilus species. than studying morphology or allozymes, but has limited potential for analysing patterns of hybridisation or genome introgression [17]. Multilocus genotyping, using genome wide panels of single nucleotide polymorphism (SNP) markers by comparison, allows for a far better understanding of introgression [31][32][33][34][35]. There are only few studies on multilocus genotyping in Mytilus species complex [12,36], which are however limited to small number of markers used to resolve population structures based on allele frequencies.
In this study, we have employed an easy and rapid de novo SNP discovery method to develop genome-wide species-specific markers to genotype the Mytilus species complex, which also identify hybrids and introgressed individuals in field populations. More accurate characterisation of Mytilus specie populations' structure will aid improved conservation and aquaculture management strategies.

Ethics statement
Animal (mussels) handling and collection was done under Marine Science Scotland (Scottish Government) authority and following both Marine Science Scotland and University of Stirling Ethical recommendations and guidance.
M. galloprovincialis were sourced from Slovenia [Bay of Piran (BP)]; and M. trossulus were acquired from Penn Cove (PC), USA (Table 1). Additional adults M. trossulus from Bras d'Or Lake (BDL), Canada, and juvenile mussels (approximately 15-months old) from Loch Etive (LET), Scotland were also obtained and used for markers validation purposes (Table 1). Tissue samples (gill/mantle from adults; all body tissues from juveniles) were taken and stored in 99% ethanol at -20˚C.

RAD library preparation and sequencing
A total of 40 pure specimens (21 M. edulis, 15 M. galloprovincialis and 4 M. trossulus) were chosen for species reference library construction (S1 Table). The RAD library was prepared as originally described in Baird et al. [39] and comprehensively detailed in Etter et al. [40], with minor modifications [41]. Briefly, each sample (0.25 μg DNA) was digested at 37˚C for 40 min with the high-fidelity restriction enzyme PstI that recognises the CTGCA|G motif (New England Biolabs; NEB) using 6 U PstI per μg genomic DNA in 1× Reaction Buffer 4 (NEB) at a final concentration of about 1 μg DNA per 50 μL reaction volume. The samples (12 μL final volume) were then heat-inactivated at 65˚C for 20 min. Individual specific P1 adapters, each with a unique 5 or 7 bp barcode (S1 Table), were ligated to the PstI digested DNA at 22˚C for 15 min by adding 0.6 μL (DNA samples) 100 nmol/L P1 adapter, 0.15 μL 100 mmol/L rATP (Promega), 0.25 μL 10× Reaction Buffer 2 (NEB), 0.125 μL T4 ligase (NEB, 2,000 U/μL) and reaction volumes made up to 15 μL with nuclease-free water for each sample. After heat-inactivation at 65˚C for 20 min, the ligation reactions were slowly cooled to room temperature (over 1 h), then combined in appropriate multiplex pools. Shearing (Covaris S2 sonication) and initial size selection (100 to 800 bp) by agarose gel electrophoresis [41] was followed by gel

Genotyping RAD alleles
Reads of low quality (i.e., with an average quality score less than 20), that lacked the restriction site or had ambiguous barcodes were discarded. Retained reads were sorted into loci and genotypes using Stacks v1.13 [42]. Stacks assigns loci based on nucleotide positions in RAD tags using a likelihood-based algorithm [43] to separate actual SNPs from SNPs likely to have arisen from sequencing error. Using the default parameters for de novo assembly pipeline, a minimum stack depth of 5 and a maximum of 2 mismatches were allowed per locus in an individual, with no more than 1 mismatch between alleles. Informative RAD markers were kept only when presenting a maximum of three SNPs and one to three alleles present in all three species and at least 50% of the samples. Diagnostic species-specific markers were Informative RAD markers with a fixed allele within species but presenting different allele between at least two of the three species.

Phylogenetic analysis
Sequencing data from filtered Informative RAD markers was combined into a single alignment of alleles (composite genotype) for a total of 40 individuals used in RAD library construction. Phylogenetic trees were constructed with RAxML (Randomised Axelerated Maximum Likelihood), using the RAxML v8.0.0 [44]. Maximum-likelihood phylogenetic trees were inferred using the GTR+CAT nucleotide substitution model [45] and bootstrap support values estimated from 10,000 replicate searches of randomly generated trees.

Data analysis
Data analysis was carried out using R v3.3.2 [46] and an associated R/adegenet package v1.4-1 [47] for Principal Component Analysis (PCA) and Discriminant Analysis of Principal Components (DAPC). PCA creates simplified models of the total variation within the dataset and DAPC identifies clusters of genetically related individuals [48].

SNP-assay design
Each tested locus comprised two alleles that were identifiable by the presence of a SNP. One allele was diagnostic for a single species, while the other allele was shared by the remaining two species. For primer design to be feasible, the SNP of interest at a given locus needed to be at least 20 bp from the end of a given sequence. This allowed enough sequence for compatible primers to be designed. SNP assays were designed and manufactured for use with KASP genotyping technology by LGC Genomics Ltd. (S2 Table).

Population structure
Structure v2.3 [49] was used to identify distinct genetic populations from multilocus (SNP) data, assigning individuals to populations, and identifying admixed individuals. The "Admixture Model" was used assuming that each genotyped individual could have mixed ancestry, inheriting some fraction of its genome from ancestors in a different population. This would be an assumption made for reference populations of pure species, while the other could have migrants that have interbred with native individuals.

RAD library sequencing
High throughput sequencing of these 40 individuals produced 574,728,488 raw reads in total (four HiSeq lanes and one MiSeq lane). MiSeq technology was used to adjust the libraries.
After the removal of low-quality and incomplete reads, 71.9% of the total raw reads were retained (413,377,018 reads). As only M. galloprovincialis has a published draft genome (NCBI assembly GCA_001676915.1) of over 1 million contigs and that only 35% of the reads from M. galloprovincialis samples were aligned to it, a de novo approached was used to assemble the RAD tags. A total of 3,253,798 RAD tags were detected (S1 Table).

Sequence analysis
The number of RAD tag detected per individual was relatively consistent, ranging from 59,000-313,000 RAD tag (Table 1). There were two exceptions among M. edulis individuals with significantly lower number of tags obtained (RB_01 and PC_01, which had 18,220 and 5,459 RAD tags respectively) which was most likely caused by low quality DNA resulting in effecting the library preparation efficiency. Between 14% and 15% of the RAD tags were polymorphic. To identify robust genetic markers and to minimise the proportion of erroneous data, all informative markers were filtered to show only those with one to three alleles and a maximum of three SNPs, and which were detected in all three species and at least 50% of the samples. A total of 14,212 SNPs spread across 6,220 informative RAD markers were identified (some loci had more than one SNP), and used in subsequent analyses. A reduced set of markers, 378 SNPs spread across 365 diagnostics species-specific markers, were filtered out as the informative RAD markers when exhibiting fixed allele within species, but presenting different allele between at least two of the three species (S4 Table).

Phylogenetic reconstruction
The phylogenetic tree constructed from the composite genotypes of 6,220 informative RAD markers shared alleles (14,212 SNPs) showed three distinct clusters, accurately delineating the three species (3 sites for M. edulis, one site for M. galloprovincialis and M. trossulus) that were used for library construction (Fig 1A).

Marker selection
In order to better capture the Mytilus species complex structure the ability of each marker to discriminates each "pure" species, a principal component analysis (PCA) was conducted from 6,220 informative RAD markers using R/adegenet (Fig 1B). Three distinct clusters were separated using the first two components (88.7% of cumulative variance) despite the small number of samples examined. A second PCA was applied on the reduced set of 365 diagnostics species-specific markers, to ensure that they kept their discrimination power (Fig 1C). Subsequently, the Discriminant Analysis of Principal Components (DAPC) sorted species diagnostic loci by their "loading values", values based on the population coverage at each locus [47]. Loci with the highest presence had the highest loading values and, thus, were assumed to be less likely to be false positive, improving their reliability as potential diagnostic markers. Loci with the highest "loading values" were preferred; as long as they matched the other selection criteria for KASP assay development.

SNP assay optimisation
All SNP assays were designed for use with KASP genotyping technology by LGC Genomics Ltd. Assay optimisation was carried out with the 40 samples used in RAD library construction. A total of 12 SNP assays, three assays per Mytilus species, were successfully optimised (S2 Table): E1, E2 and E3 (M. edulis); G1, G2, G3 and G4 (M. galloprovincialis); and T1, T2, T3, T4 and T5 (M. trossulus). SNP genotyping results (KASP and RAD) were obtainable at all 12 loci for each of the 40 samples (S5 Table) revealing identical results regardless of the genotying technique.

SNP assay validation
238 samples, including 150 samples from 3 additional populations, were genotyped with the 12 SNP assays (Table 3 and S6 Table). Additional populations were sourced in order to validate the markers with individuals from different origins, thereby reducing the risk of developing markers that would be population-or location-specific rather than species-specific. Where all diagnostic alleles were attributed to only one species, individuals were identified as "pure" species (M. edulis [Me], M. galloprovincialis [Mg] or M. trossulus [Mt]). Individuals heterozygote at all diagnostic loci for two species would be identified as F1 hybrids; however, none were found in any of the population sampled. All other individuals were identified as introgressed individuals. Principal Component Analysis (PCA) discriminated the three "pure" species while the introgressed individuals were intermediates with respect to their genetic background   (Fig 2B). The four populations used for RADseq were suitable pure populations. These models suggest that, despite the introgression observable with the SNP genotyping, these populations are mostly pure with some mixing and thus were suitable for diagnostic marker design.

Discussion
Historically, there has been much debate on the taxonomic status of species belonging to the Mytilus species complex (M. edulis, M. galloprovincialis and M. trossulus), and whether they are discrete species. Species diagnostic marker development could only take place if three discrete species were present; thus, assessing the phylogenetic relationships of the populations used for diagnostic marker development was crucial for this research. Indeed, Fig 1 shows a clear separation of individuals based on genotype and not population, multiple sites were used for M. edulis. As confirmed by the KASP assays on the same dataset (Fig 2), all diagnostic markers appeared fixed in the putative "pure" populations, despite the limitation caused by the small number of M. trossulus specimens (only 4 samples); However, both phylogenetic tree and PCA clearly distinct three groups, consistent with the three species. Furthermore, the identification of 6,220 informative RAD markers and the development of 12 diagnostics markers (chosen out of the 365-possible identified) is a unique opportunity to enhance understanding of the genotype for the application of basic physiological studies and to understand the biological differences between M. trossulus, M. edulis and M. galloprovincialis. The informative RAD markers (polymorphic marker present in at least 50% of the samples) and diagnostic markers (Informative RAD markers with fixed allele within species but presenting different allele between at least two of the three species) identified in this study can be used for many  purposed: bio-diversity evaluation, phylogenetic studies, species and population identification. This enhanced understanding of the genetic structure within and between populations, also provides opportunities to better clarify the relationship between genotype and phenotype, particularly in sympatric populations.
Within the aquaculture context, the availability of this new suit of diagnostic markers can allow the sourcing of seeds of known genotypes and the selection of seeds of desirable genotypes for broodstock for ongoing efforts into the hatchery production of seeds. This will significantly contribute to the future eradication of potentially economically damaging species. Improved stock management and potential selective breeding will result in superior resilience and increased productivity of the mussel's aquaculture industry. Indeed, with the application of multilocus genotyping on Scottish mussels, we have identified introgressed genotypes that were hitherto unrecognisable by using single locus (Me15/16) genotyping. By doing so we improved our understanding of the genetic diversity within and between populations currently present in farmed and wild populations along the Scottish coast. This quick and relatively in-expensive methodology can be extended to any species complex where the phenotype does not provide conclusive evidence for species assignment and where the genetic structure is unknown or poorly established.
Supporting information S1 Table. Sample and RAD barcodes. Details each sample used: sample ID, library number, Me15/16 Genotype, RAD barcode (P1 adapter), RAD barcode (P1 adapter), number of raw reads (paired-ended) and number of RAD-tags. (CSV) S2 Table. KASP assay primer sequences. List of the allele-specific primers and common primer designed for the allele-specific PCR genotyping assay. (CSV) S3 To be selected a SNP needs to be present in at least 50% of all samples and to be reported in a species, the SNP needs to be in at least 50% of the samples of this species (see Materials and Methods).