SLAF-seq: An Efficient Method of Large-Scale De Novo SNP Discovery and Genotyping Using High-Throughput Sequencing

Large-scale genotyping plays an important role in genetic association studies. It has provided new opportunities for gene discovery, especially when combined with high-throughput sequencing technologies. Here, we report an efficient solution for large-scale genotyping. We call it specific-locus amplified fragment sequencing (SLAF-seq). SLAF-seq technology has several distinguishing characteristics: i) deep sequencing to ensure genotyping accuracy; ii) reduced representation strategy to reduce sequencing costs; iii) pre-designed reduced representation scheme to optimize marker efficiency; and iv) double barcode system for large populations. In this study, we tested the efficiency of SLAF-seq on rice and soybean data. Both sets of results showed strong consistency between predicted and practical SLAFs and considerable genotyping accuracy. We also report the highest density genetic map yet created for any organism without a reference genome sequence, common carp in this case, using SLAF-seq data. We detected 50,530 high-quality SLAFs with 13,291 SNPs genotyped in 211 individual carp. The genetic map contained 5,885 markers with 0.68 cM intervals on average. A comparative genomics study between common carp genetic map and zebrafish genome sequence map showed high-quality SLAF-seq genotyping results. SLAF-seq provides a high-resolution strategy for large-scale genotyping and can be generally applicable to various species and populations.


Introduction
The development of genotyping technologies has been synchronous with the development of molecular markers. Traditional gelbased molecular markers have played important roles in genotyping assays during the past 20 years [1][2][3]. Most of these have been based on length polymorphisms. However, restrictions in throughput have rendered large-scale genotyping difficult.
The collection of genome sequences and discovery of thousands of single nucleotide polymorphisms (SNPs) have made large-scale microarray-based SNP genotyping possible [4][5][6]. Microarraybased genotyping has been used in many well-studied species [7][8][9][10]. Millions of pre-discovered SNPs can be used to design probes, and sequence variations between samples can be detected by hybridization. However, microarray quality relies on the SNP profiling quality of the species. The more known SNPs there are, the higher the microarray quality. The limited number of known SNPs limits the applicability of this technology in many cases.
High-throughput sequencing technologies can provide new strategies for sequence-based SNP genotyping. Whole-genome resequencing strategies can be used to genotype a large number of variations among samples [11][12][13][14]. Population size may range from 8 to 50. This strategy mainly focuses on SNP discovery and evolutionary studies on domestication. For large populations, whole-genome deep sequencing is still cost-prohibitive. However, population size greatly affects the accuracy of association studies. To reduce the costs of sequencing, two kinds of strategies have been developed within the past several years. One of these strategies involves reducing sequence depth. Low-depth wholegenome resequencing methods have mainly been used in the mapping of populations. Huang et al. genotyped a rice RILs with 0.02-fold coverage per line [15].A slide-window approach was developed to improve SNP accuracy. Xie et al. sequenced 238 RILs with 0.055-fold genome coverage. This showed some improvement in parent's sequence independence, which reduced the cost of sequencing the parents' genomes [16]. However, it relies on high-quality reference genome sequences, and the sequence library cost cannot be reduced when processing large populations. The other strategy involves reducing the complexity of the genome with reduced representation library (RRL) sequencing [17][18][19][20]. Many properties can be used to prepare reduced representations. One of the simplest methods is the separation and purification of restriction fragments within a given size range. The use of RRL for SNP discovery was first described using Sanger sequencing during the Human SNP Map Project [17,21]. In recent studies, RRL were used with high-throughput sequence technologies. Baird et al. genotyped a stickleback F2 population, and a barcode was developed to distinguish each individual [22]. The restriction-site-associated fragment end was sequenced using an Illumina single-end protocol. Hyten et al. performed RRL sequencing on a wild soybean strain for SNP discovery [18]. Tassell et al. studied 66 cattle representing three populations [20]. These studies based on RRL sequencing strategies all relied on the reference genome sequences. Single-end sequencing was pursued, but this limits the study in many ways and is somewhat prone to error.
Here we developed a strategy for the de novo SNP discovery and genotyping of large populations using an enhanced RRL sequencing method. Reference genome sequences and polymorphism information are not necessary when this strategy is used. With barcode multiplexed sequencing, large populations with large numbers of loci can be genotyped simultaneously. With predesigned RRL schemes, repetitive sequences can be avoided, and the selected fragment number can be decided for personalized research purposes to maintain the balance between marker density and population size. In this study, we genotyped a common carp (Cyprinuscarpio L.) F 1 population. The genetic map that we constructed had the highest density of any map made of any organism lacking reference genome sequences to date. A weightrelated QTL was located using genome-wide association studies. The quality of the genotyping process was validated. This strategy is suitable for many types of populations and species. i) Pre-design scheme for SLAF selection using training data. The reduced representation design must be decided based on marker efficiency characteristics, which include random distribution throughout the genome, uniqueness in the genome, and consistent amplification efficiency among selected markers. A pilot experiment was performed to evaluate the amplification efficiency based on the predesigned scheme. ii) SLAF-seq library construction. Genomic DNA was digested by groups of enzymesdesigned for individuals. Double barcodes were added to two round PCR reactions to discriminate each individual and to facilitate the pooling of samples for size selection, which maintained consistent fragment size among individuals. iii) Deep sequencing for the pooled RRLs with the Illumina paired-end sequencing protocol, and genotype definition and validation by software. doi:10.1371/journal.pone.0058700.g001

Ethics statement
All experimental procedures were conducted in conformity with institutional guidelines for the care and use of laboratory animals in Centre for Applied Aquatic Genomics of the Chinese Academy of Fishery Sciences. The protocol was approved by the Committee on the Ethics of Animal Experiments of the Centre for Applied Aquatic Genomics at Chinese Academy of Fishery Sciences (2011AA1004020012).

SLAF library construction and high-throughput sequencing
Specific-locus amplified fragment sequencing (SLAF-seq) is an efficient method of large-scale genotyping, which is based on RRL and high-throughput sequencing. The procedure is shown in Figure 1. First, we performed a SLAF pre-design experiment. The enzymes and sizes of restriction fragments were evaluated using training data. Three criteria were considered: i) The number of SLAFs must be suitable for the specific needs of the research project. ii) The SLAFs must be evenly distributed through the sequences to be examined. iii) Repeated SLAFs must be avoided. These considerations improved the efficiency of SLAF-seq. To maintain the sequence depth uniformity of different fragments, a tight length range was selected (about 30,50 bp) and a pilot PCR amplification was performed to check the RRL features in this target length range, which would ordinarily include fragments with similar amplification features on the gel. When non-specific amplified bands appeared on the gel, then we will repeat the predesign step to produce a new scheme.
Next, we constructed the SLAF library in accordance using the pre-designed scheme. For the common carp F1 population, genomic DNA was incubated at 37uC with MseI (New England Biolabs, NEB), T4 DNA ligase (NEB), ATP (NEB),and MseI adapter. Restriction-ligation reactions were heat-inactivated at 65uC, and then digested for additional restriction enzyme AluI at 37uC. The PCR reaction was performed using diluted restrictionligation samples, dNTP, Taq DNA polymerase (NEB) and MseIprimer containing barcode1. The PCR productions were purified using E.Z.N.A.H Cycle Pure Kit (Omega) and pooled. The pooled sample was incubated at 37uC with MseI, T4 DNA ligase, ATP and Solexa adapter. The sample was purified using a Quick Spin column (Qiagen), then run out on a 2% agarose gel. Fragments with 450,500 bp (with indexes and adaptors) in size were isolated using a Gel Extraction Kit (Qiagen). These fragment products were then subjected to PCR amplification with Phusion Master Mix (NEB) and Solexa Amplification primer mix to add barcode2. Phusion PCR settings were as listed in the Illumina sample preparation guide. Samples were gel purified, excising DNA 450,500 bp, which was diluted for sequencing.
Then, pair-end sequencing was performed upon the selected SLAFs using an Illumina high-throughput sequencing platform (Illumina, Inc; San Diego, CA, U.S.). SNP genotyping and evaluation were then performed.

SLAF-seq data grouping and genotype definition
Software was developed to deal with SLAF-seq data. Procedures are shown in Figure S1. All SLAF pair-end reads with clear index information were clustered based on sequence similarity. To reduce computing requirements, identical reads were merged together, and sequence similarity was detected using one-to-one alignment by BLAT [23] (-tileSize = 10 -stepSize = 5). Sequences with over 90% identity were grouped in one SLAF locus.
Alleles were defined in each SLAF using the MAF evaluation. To prevent false positive results, the sequence error rate was estimated using the rice data as a control. These were obtained using the same sequencing scheme as that used with common carp ( Figure S1B). True genotypes had markedly higher MAF values than genotypes containing sequence errors. Tags with sequence errors were corrected to the most similar genotype to improve data efficiency. In mapping populations of diploid species, one locus can contain at most 4 genotypes, so the groups containing more than 4 tags were filtered out as repetitive SLAFs. SLAFs with sequence depth less than 213 were defined as low-depth SLAFs and were filtered out of the following analysis. Only groups with suitable depth and fewer than 4 seed tags were identified as highquality SLAFs, and SLAFs with 2-4 tags were identified as polymorphic SLAFs.
To evaluate the accuracy of our genotyping objectively, a Bayesian approach was proposed. Using the coverage of each allele and the number of single-nucleotide polymorphism, we calculated a posteriori conditional probability that a given individual would have a specific genotype at a corresponding locus. We proceeded as follows. Supposing there were m alleles at any given locus, denoted as fA 1 ,A 2 ,:::,A m g. For a diploid species, the number of all possible genotypes was equal to m(mz1) 2 and m is less than five regardless of the type of segregation of the loci. We assign a priori probability to each genotype according to the theoretical frequencies with which these genotypes would occur in such a finite probability space. For a homozygous genotype, this priori probability would equal 1 m 2 , but it would be double that for a heterozygous genotype. Consider a pair of distinguished alleles A i and A j , the probability of sequencing one allele to another can be calculated using the following formula: Here is the average ratio of sequencing error. In our model it took on a value of 0.015 for the Illumina sequencing platform, and we used r l to represent the length of reads and s for number of singlenucleotide polymorphisms. Based on this, we obtained the probability of allele A i conditioned on the genotype.A i A j , denoted as P A i jA u A v ð Þ . The depth observation of allele A i was assumed to be d i , and the conditional probability of observation (d 1 ,d 2 ,::,d m ) of each genotype can be illustrated as follows: In this way, we determined the probability of assigned genotype conditioned on the following coverage observation: The probability was translated to a genotyping quality score finally using: Genotyping quality Score{ The final genotyping quality score value indicated the confidence with which the genotype had been called. In particular, when the difference in depth between both alleles exceeded 1:5, the score value could be modified directly using formula (1) due to systematic bias. The upper bound of the score is 30. This genotyping quality score was used to select qualified markers and individuals for subsequent analysis. This was a dynamic optimization process. Briefly, we counted low-quality markers for each SLAF marker and for each individual and deleted the worst markers or individuals. We repeated this process, deleting one individual or marker each time. We ceased when the average genotyping quality score of all SLAF markers reached the cutoff value, which was 13.

SLAF-seq assay design and pilot studies on rice and soybean
To determine the reliability of this method, we performed a pilot study on rice and soybeans. Specifically, we expected that 21,000 and 76,000 SLAFs would be selected by HaeIII and MseI digestion and purification restriction fragments. We expected these SLAFs to be about 380,450 bp and 370,440 bp for rice and soybeans, respectively, and to comprise about 0.33% and 0.49% of the rice and soybean genomes [5,6] (Table 1). Repetitive sequences were controlled within 17.78% and 25.32%, respectively, to exclude most of the repetitive regions in rice and soybeans. Using these schemes, we carried out SLAF library construction and sequencing. In total, 0.5 M and 1.3 M pair-end reads were generated using Illumina Genome Analyzer IIx for rice and soybeans, respectively, by mapping the reads to genomes. About 25,000 and 83,000 SLAFs were observed. The SLAF density and distribution characteristics are shown in Figure 2. The properties of the observed SLAFs were consistent with those predicted for both rice and soybeans. In rice, about 90.18% of the In the rice pilot case, the density was designed using 20 kb per SLAF. In soybeans, 40 kb per SLAF was used. Both rice and soybean pilot SLAF data were found to be consistent with theoretical predictions. doi:10.1371/journal.pone.0058700.g002 expected SLAFs were present in the observed data, which covered 82.48% of the total reads. In soybeans, 79.32% of the expected SLAFs were observed and 65.86% of the total reads were mapped to the expected SLAFs ( Table 1). The average SLAF sequence depth was increased by 19-fold and 12-fold to ensure the base qualities. Repetitive SLAFs were controlled within 19.49% and 25.34%, which were consistent with the expected number. The pilot data indicated that the reliability and quality of this SLAF-seq strategy were relatively high.
Large-scale genotyping in common carp F1 population using SLAF-seq We genotyped a common carp(Cyprinuscarpio L.) F1 population de novo using SLAF-seq. MseI and AluI were used for library construction. Fragment length was between 330 bp and 380 bp. About 65,000 SLAFS were predicted from the carp genome based on analysis of zebrafish training data. The expected SLAFs may be randomly distributed throughout the zebrafish chromosomes ( Figure S2) and the expected repetitive SLAFs may be controlled within 21.3%.
In total, 103.8 M pair-end reads were generated using an Illumina Genome Analyzer IIx for 2 parents and 211progeny simultaneously. On average, 0.4 M reads were obtained for each descendant (Table S1). Based on the sequence similarity, reads were clustered into SLAFs (described in methods). As shown in Table 2, after about 26.02 M reads in low-depth SLAFs and 29.59 M reads in repeat-suspicious SLAFs were excluded, 50,530 high-quality SLAFs were defined with 48.19 M reads. The average sequence depth of these SLAFs was 954-fold with 52.37-fold in parents and 4.99-fold in each individual. The extreme depth of the locus-specific sequences indicates the accuracy of de novo SNP discovery.
A total of 10,662 SLAFs were detected. They contained polymorphisms among which 13,291 SNPs and 1,483InDels were identified. On average, the sequences of parents were 57.33-fold and those of progeny were 4.28-fold. The SNP ratio was about 4.38 per kb, and the InDel ratio was 0.49 per kb in common carp. Based on the barcode system, all polymorphic loci were genotyped separately for parents and 211progeny. Sequencing depth alone cannot fully reflect the genotyping quality. Genotyping quality scores (see methods) were used to evaluate genotyping quality in populations, as shown in Figure 3A. The quality scores used in this approach were transformed from error rates using the expression in PHRED, but error rates were calculated using a new method. Using quality scores, low-quality individuals and low quality markers were excluded, and 7,559 markers in 166 individuals were shown to be high-quality. For genotyping quality in parents, 99.67% of SLAF loci showed genotyping quality scores over 20 with average sequence depth of 68.8-fold. Among individuals, 69% had genotyping scores over 10 with an average sequence depth of 6.2-fold ( Figure 3B). To further evaluate the accuracy of genotyping data, we randomly chose 10 SLAF loci in 2 parents and 19 individuals and performed independent traditional Sanger sequencing. Of these 205 genotypes were found to be consistent with the SLAF-seq genotyping information. Five cases of incorrect genotyping were indicated by low genotyping quality scores. Details are shown in Table S2. These types of quality validation all confirmed the genotyping accuracy of SLAF-seq.
High density and quality genetic map of common carp produced using SLAF-seq genotyping data A high density genetic map was constructed using the SLAF-seq genotyping data. All markers were grouped in 50 linkage groups with LOD thresholds ranging from 4,7. Of these, 5,885 markers were arranged on the map using JoinMap4.0 [24]. A male map with 2,158 markers and a female map with 3,106 markers were  also generated. The total genetic distance of the sex-averaged map was 3,960 cM. The average interval between markers was 0.68 cM, making this the highest density genetic map yet created for any organism lacking reference genome sequences [25][26][27][28][29]. Detailed statistical information for all linkage groups is shown in Table S3 and Figure S3.
To evaluate the accuracy of the genetic map, the recombination breakpoints were analyzed to identify questionable genotyping data. Most of the recombination blocks were clearly defined and only 1.51% of the genotyping data were found in very small, doubtful recombination blocks (Figure 4). We mapped SLAF markers to the zebrafish genome. About 25.6% of these markers were uniquely mapped, and 2 carp linkage groups were located in one zebrafish chromosome ( Figure 5). The linkage distance was well consistent with the zebrafish physical distance. These two independent approaches all indicated that the marker positions on the map were valid. They also indicated the accuracy of the genotyping results.

Discussion
The SLAF-seq strategy combines locus-specific amplification and high-throughput sequencing to effect de novo SNP discovery and large-scale genotyping. To improve the efficiency of fragment selection, a bioinformatics statistical model was developed for the amplification of specific loci fragments selected based on training data. These included randomly sequenced BAC sequences and genome draft sequences. The training data were taken from both the target organism and from other, evolutionarily related organisms, even from organisms with GC content similar to that of the target organism. A SLAF-efficient selection scheme can then be developed using this training data (Figure 1). We analyzed several training data sets, as shown in Table S4. The repetitive sequences (both mathematically-defined repeat and biologicallydefined repeat) can be avoided efficiently and the scheme provided us with a clear set of expectations suitable to the specific purpose of Figure 3. Genotyping quality in common F1 population. The genotyping quality score was used to select qualified markers and individuals for subsequent analysis. This is a dynamic optimization process. We counted low-quality markers for each SLAF marker and for each individual, and deleted the worst marker or individual. We repeated this process, deleting an individual or a marker each time until the average genotyping quality score of all SLAF markers reached the cutoff value, which was 13.  our research project, which improved the RRL efficiency more than in previous studies [18,20,22,29].
To maintain copy number uniformity among fragments, we performed several optimization procedures for amplification and size selection. During the PCR amplification process, the copy number frequency of fragments of different lengths often changed with amplification efficiency (Figure 1). In previous studies, researchers tended to select relatively long length ranges to identify as many enzyme sites as possible because the pre-design step is ignored, allowing the selection of fragments with different copy numbers. This reduces the sequence efficiency [22,29]. The selection of a tighter length range (about 30,50 bp) may help to obtain fragments with similar copy numbers and to ensure similar sequence depths among fragments to solve this problem. The amount of template and number of PCR cycles were also optimized to prevent non-specific amplifications and to reduce the frequency of changes in copy number between fragments with similar lengths. This renders research more cost-effective. Most other sequence-based genotyping methods cannot do this [18,22]. A double barcode system was developed to distinguish individuals in large populations. Two rounds of PCR were performed to add two barcode oligos to each sample (Figure 1). In this study of 211 common carp progeny, 50 individuals were pooled together and tagged with a first-class barcode. Then five groups were pooled together and distinguished with a second-class barcode. A total of only 55 barcode adapters were used to distinguish all these 211 individuals. We developed 96612 index combinations, which expanded the pool scale more efficiently than the previous single barcode system, making it possible for us to genotype about 10,000 samples in a single high-throughput sequencing run and to reduce the complexity and cost of sample preparation [14][15][16]18,30].We used a pair-end sequencing procedure. This allowed us to design primers using the ends of the fragment and to perform further experimental validation, which is especially important for species for which reference genome sequences are not available.
In the present study, the quality score algorithm was developed to evaluate the quality of SNP discovery and genotyping. This quality score can be used to exclude suspicious genotyping results and to improve the accuracy of association studies. This score is suitable for use as a quality standard and can help researchers balance accuracy and costs during heterozygote detection using high-throughput sequencing technology. Sequencing depth is an important consideration in sequencingbased genotyping. To estimate the minimal sequencing depth required for accurate genotyping results, we randomly simulated genotype subsets of a CP population with various sequencing depths using the Poison process, which was used to simulate the human X chromosome dataset by Bentley et al. [31]. In the present study, the genotype subsets included 200 individuals and 500 markers. The genotyping errors in simulation analysis only include those errors that occurred when one of the two alleles in a heterozygous individual was not successfully sampled and sequenced because of the randomness of sampling. In such cases, a heterozygote would be wrongly called a homozygote. The simulation analysis showed that the error ratio of genotype calling dropped greatly from 16 to 46, and that further increases in sequencing depth above 46 sequencing depth had relatively little influence on sequencing error rates ( Figure S4), relative to the depth changes from 16 to 46. During genotyping calling, sequencing reads with low depth were filtered away using the quality score, defined by a Bayesian approach. The final sequencing depth of the reads, which was used to generate the genotype data, was greater than the initial average sequencing depth. After balancing cost and quality, we chose to sequence individuals with 4.286 coverage. After filtering out low-depth data, the sequencing depth of the reads that actually generated the final genotype data reached 6.26. Based on the simulation study of a CP (F1) population ( Figure S4), raw data of 6.26 sequencing depth were estimated and used to produce a genotype calling error rate of only 6.5%. This estimation was greater than that discussed in the report by Bentley et al. (about 5.6%) [31]. Our data simulated a CP population, and CP populations have high rates of heterozygosity, which renders genotyping prone to error. The error rate should be smaller than 6.5% for most populations, such as inbred populations. We randomly selected 10 SLAF and validated their accuracy in 19 randomly chosen individuals using Sanger sequencing. We found only five inconsistencies in 190 SLAF data. Further analysis showed that these five SLAF, which were inconsistent with results obtained using Sanger sequencing, had low genotyping quality scores (samples 2, 3, 5, 7, and 9). In subsequent rounds of linkage mapping, we screened incorrect genotypes and deleted some erroneous data. In this way, the study generated a high-quality linkage map, which reflects the accuracy of genotype data. Taken together, weighing the sequencing cost, genotyping quality, and linkage map quality, we concluded that the 66 is an acceptable minimal sequencing depth when this approach is used.
Generally, after the DNA is isolated, there is no difference between genotyping a plant gene and an animal gene. All the procedures of DNA sequencing and of subsequent sequence analysis that apply to plant DNA also apply to animal DNA. However, plant genomes are richer in repetitive sequences than animal genomes. To better understand the efficiency with which the SLAF-seq method filters away repetitive sequences, we used plant genomes to evaluate its reliability. We chose rice because its genome has been sequenced using a BAC-to-BAC sequencing strategy based on the Sanger sequencing platform and because the quality of said genome has been well recognized. However, its genome size is only about 389 Mb. For this reason, we also evaluated the reliability of our method on soybeans, whose 1115 Mb genome is much larger.
Reference genome sequences and reference SNPs were not found to be necessary with this method. De novo genotype definition was performed using high-depth sequencing. This overcomes many of the limitations of previous methods [15,16,32]. A comparison of different genotyping methods is shown in Table S5. The RRL strategy allows costs to be controlled at the lowest level. Depth sequencing ensures the accurate genotyping. The pre-design strategy can be used to predict fragmentation efficiency reasonably and reliably. The double barcode system allows large populations to be genotyped simultaneously. SLAF-seq combines all these features. SLAF-seq is suitable for widespread use in large-scale genotyping and genome-wide association studies.  Figure S4 Effects of sequencing depth on genotype calling quality. Simulated genotype subsets with various sequencing depths,including 100 individuals and 500 markers,were generated randomly using the Poison process.The genotyping errors in simulation analysis only included those errors that occurred when one of the two alleles in a heterozygous individual was not successfully sampled and sequenced because of the randomness of sampling. In such cases, a heterozygote was wrongly called a homozygote.The genotyping missing denoted that both the two alleles in an individual fail to be sampled and sequenced due to sampling randomness. (TIF)