Rapid Identification of Candidate Genes for Seed Weight Using the SLAF-Seq Method in Brassica napus

Seed weight is a critical and direct trait for oilseed crop seed yield. Understanding its genetic mechanism is of great importance for yield improvement in Brassica napus breeding. Two hundred and fifty doubled haploid lines derived by microspore culture were developed from a cross between a large-seed line G-42 and a small-seed line 7–9. According to the 1000-seed weight (TSW) data, the individual DNA of the heaviest 46 lines and the lightest 47 lines were respectively selected to establish two bulked DNA pools. A new high-throughput sequencing technology, Specific Locus Amplified Fragment Sequencing (SLAF-seq), was used to identify candidate genes of TSW in association analysis combined with bulked segregant analysis (BSA). A total of 1,933 high quality polymorphic SLAF markers were developed and 4 associated markers of TSW were procured. A hot region of ~0.58 Mb at nucleotides 25,401,885–25,985,931 on ChrA09 containing 91 candidate genes was identified as tightly associated with the TSW trait. From annotation information, four genes (GSBRNA2T00037136001, GSBRNA2T00037157001, GSBRNA2T00037129001 and GSBRNA2T00069389001) might be interesting candidate genes that are highly related to seed weight.


Introduction
Brassica napus (B. napus) is one of the most important oil crops and also the third largest oilseed crop worldwide. It supplies more than 13% of the world's vegetable oil and is a major economic crop [1]. Breeding of high yield oilseed crops is always the target and primary mission of plant breeders. Seed weight (SW), siliques per plant (SPP) and seeds per silique (SPS) are three important and basic components to determine the seed yield per plant [2]. Seed weight is the most important component and is a direct trait for yield of oilseed crops. To increase the seed weight is a major approach to improve the yield of oilseed crops [3]. Therefore, understanding the genetic determinants of seed weight is of great significance for yield improvement in oilseed breeding [4].
Exploring new quantitative trait loci (QTLs) for seed weight with molecular-marker-assisted selection is always a hot topic to improve B. napus seed yield [5]. To date, several QTLs related A DH population with 250 lines was derived from a cross between two parents, a large-seed line G-42 and a small-seed line 7-9, through microspore culture and doubling technology [20]. All 250 DH lines along with both parent plants were grown in the field under standard conditions from October 2013 to May 2014 at the experimental farm of the Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Wuhan, China.

Phenotypic observation
Three plants per line of the DH population and two parent plants ( Fig 1A) were bagged at flowering time for harvesting pure seeds. After harvesting and drying, the fully dried seeds were collected to measure seed weight trait. The TSW was evaluated from the weight for 1000 seeds and the mean values of 1000-seed weights for three replicates of each line in this experiment (S1 Table).

Two extreme DNA bulks construction
Two segregating pools selection. Two DNA bulks for sequencing were first made by selecting extreme individuals from the 250 DH population plants with the basic statistics of the phenotypic data. The lightest 47 lines (G1-G47) were selected as the small-seed pool, and the heaviest 46 lines (G51-G96) were selected as the large-seed pool from 250 DH lines (Fig 1B, S1  Table). Genomic DNA extraction. Total genomic DNA was isolated from young healthy leaves of two parents and the selected 93 DH lines using the cetyltrimethylammonium bromide (CTAB) method with some modifications and then purified by RNase [21]. DNA concentration and quality were estimated with a Nanodrop 2000 UV-Vis spectrophotometer (NanoDrop, Wilmington, DE, USA), and adjustments were made to yield a final DNA concentration of 100 ng. μl -1 with a total DNA amount greater than 20 μg. The 46 individual genomic DNA of the large-seed group were equally mixed as a large-seed DNA bulk and meanwhile 47 individual genomic DNA of the small-seed group were equally mixed as a small-seed DNA bulk. Genomic DNA of two DNA bulks and both parents were prepared for following SLAF sequencing.

SLAF library construction
A pilot SLAF experiment was designed to determine conditions and appropriate restriction enzymes for digestion that optimize SLAF yield and maximize SLAF-seq efficiency. Then, the SLAF library was constructed based on the result of the pilot experiment for SLAF selection. The procedure was followed by Sun et al. [14] with minor modifications. We used the reference genome of B. napus, which has a size of 1.2 Gb (download link: http://www.genoscope.cns.fr/ brassicanapus/data/ [22]). Purified genomic DNA was digested into fragments of 314~344 bp in size with an appropriate restriction enzyme combination, HaeIII+RsaI (NEB, Ipswich, MA, USA). Subsequently, fragment ends reparation, index paired-end adapters' ligation and adapter -modified ends obtainment were performed step by step. We selected the objective size on a 2% agarose gel and amplified the fragments through PCR reaction. Finally, we executed highthroughput sequencing by Illumina HiseqTM 2500 (Illumina, Inc; San Diego, CA, USA) at Biomarker Technologies Corporation in Beijing. Real-time monitoring was performed for each cycle during sequencing and the ratio of high quality reads with quality scores greater than Q30 (indicates a quality score of 30, indicating a 0.1% chance of an error and thus 99.9% confidence) in the raw reads and guanine-cytosine (GC) content was calculated for quality control.

SLAF-seq data clustering, polymorphic analysis and associated markers identification
Dual-index software [23] was used to identify the SLAF-seq raw data and obtain the reads of each sample. Being digested by the same restriction enzyme, all SLAF pair-end reads of samples were clustered according to sequence similarity by the BLAF software [24]. Sequences with over 90% identity were clustered in one SLAF locus (or SLAF tag) and a large number of specific fragments were selected for specific molecular marker development. SLAF tags were developed and compared among different samples. Polymorphic SLAF tags showed polymorphism between the parents including two kinds of markers, SNP and Indel [25]. For the polymorphic screening, there were three kinds of SLAF tags: polymorphic SLAFs, no polymorphic SLAFs and repetitive SLAFs. Clusters with more than four tags were regarded as repetitive SLAFs and were filtered out. SLAFs with two, three, or four tags were considered to be polymorphic SLAFs and those with only one tag were considered to be no polymorphic SLAFs. In this study, polymorphic SLAFs with sequence depth of both parents less than 5X were defined as low-depth SLAFs and filtered out. Finally, the potential SLAFs with one genotype derived from the male parent (G-42) and the other from the female parent (7-9) were identified as SLAF markers, and were selected for further association analysis.

Association analysis
The relative marker abundance in bulked DNA pool 1 (the small-seed pool) was calculated as the number of reads of the maternal allele divided by the number of reads of the paternal allele, whereas in pool 2 (the large-seed pool), the relative marker abundance was calculated as the number of reads of the paternal allele divided by those of the maternal allele. It was expected that the larger the relative abundance, the greater the possibility that the marker was associated with TSW. SNP-index association analysis [26] and Euclidean distance association analysis [27] were used in this research.
In this study, P stands for the male parent (G-42), M stands for the female parent (7-9), aa represents the small-seed pool and ab represents the large-seed pool.
SNP-index association analysis. SNP_index association analysis was recently published and is a type of method used to calculate genotype frequency differences between two bulks that are satisfied by Δ (SNP_index). The closer marker is associated with trait while the closer Δ (SNP_index) is associated with 1.
Δ (SNP_index) was calculated as follows: Maa is the depth of the aa group derived from M while Paa indicates the depth of the aa group derived from P; Mab means the depth of the ab group derived from M while Pab stands for the depth of the ab group derived from P.
Euclidean distance association analysis. Euclidean distance (ED) association analysis is a type of method that calculates Euclidean distance (quadratic sum root of differences between bulks from the depth of four types of base) and is satisfied by ED. In theory, the higher the ED value is, the closer the object site.
ED was calculated as follows: A aa , C aa , T aa , and G aa respectively represent the depth of bases A, C, T and G on a site in the large seed bulk. A ab , C ab , T ab , and G ab represent the depth of bases A, C, T and G on a site in the small seed bulk, respectively.
In this study, we used SLAF-seq technology combined with BSA to detect polymorphic tags between the two bulked DNA pools and quickly identified marker intensive hot-regions for seed weight on the genome of B. napus.

SLAFs development
After SLAF library construction and high-throughput sequencing, a total of 24.18 M reads were developed to procure SLAFs ( Table 1). The Q30 ratio was 88.18% and the GC content was 43.68% (Table 1). Of these high-quality data, 3,380,481 reads were from the male parent and 4,134,256 reads were from the female parent. Read numbers for the small-seed pool and small-seed pool were 9,453,088 and 7,216,711, respectively.
The numbers of SLAFs in the male and female parents were 86,429 and 95,008, respectively. The total depth and average depth of male and female parents was 1,801,757 (18.96x) and 2,132,208 (24.67x), respectively. For the two bulked pools, the numbers of SLAFs in the smallseed pool and the large-seed pool were 90,719 and 111,205, respectively. The total depth and average depth of the small-seed bulk and the large-seed bulk was 4,752,291 (52.38x) and 3,640,746 (32.74x), respectively (Table 2). Totally, we ultimately selected 112,292 SLAFs for further analysis.
Among the 112,292 SLAFs that were detected in total, 7,536 SLAFs showed polymorphism between the two parents with a polymorphism rate of 6.71% ( Table 3). The number of nonpolymorphic and repetitive SLAFs was 104,270 and 486, respectively. SLAF tags were located on the reference B. napus genome through short oligonucleotide analysis package (SOAP) software [28]. Statistics of marker numbers on each chromosome according to the positioning result were shown in Table 4 and a distribution diagram of SLAF on each chromosome was shown in Fig 2A. The SLAF tags were distributed equally on each chromosome.
SLAF-seq is a newly developed, efficient and high-resolution strategy for large-scale de novo SNP and Indel markers discovery and genotyping of large population and bulked segregant [14] through sequencing the paired-ends of the sequence-specific restriction fragment length [16]. It has several advantages such as high efficiency for marker development, low cost, short cycle, high accuracy with less sequencing and a high capacity for large populations [14]. Compared with other inefficient, expensive, and time-consuming conventional methods for developing markers, such as next-generation sequencing, restriction-site associated DNA (RAD) sequencing, bar-coded multiplexed sequencing and etc. [29][30][31], SLAF-seq can develop large amounts of sequence information, enable its sequencing data to generate molecular markers directly, guarantee the efficiency, uniformity, quality and quantity of maker development and cover the whole genome [16]. Since the SLAF-seq methods were developed, they have been used in several studies, such as molecular markers development, major QTLs identification, candidate genes association analysis, high-density genetic mapping and etc..
In this study, we are the first to used SLAF-seq technology in B. napus combined with BSA to detect polymorphic markers between the two bulked DNA pools and parents. A total of 111,205 SLAF tags were developed as the basis for high-throughput sequencing and 7,536 polymorphic markers were identified between two parents. Finally, 1,933 high quality polymorphic SLAF markers were finally selected for further association analysis with quantity and quality meeting the requirements. The SLAF markers were well-distributed on each chromosome, and both the integrity and accuracy were very high (Fig 2B). In our study, we quickly identified a marker intensive hot-region for seed weight on ChrA09 through SLAF-seq technology combined with BSA. This method quickly detected major QTLs at a genome-wide level and delimited it to a narrower region.

Polymorphic SLAF markers screening
A total of 7,536 polymorphic SLAFs were selected to obtain high quality polymorphic SLAFs after two rounds of sequencing and exclusion of low-quality fragments ( Table 1). Tags with a depth less than 5X were excluded first. Then, with the reference genome sequence, potential SLAF tags with one genotype deriving from P and the other from M were identified as SLAF markers. Finally, 1,933 high-quality polymorphic SLAFs were selected as candidate SLAF markers for further association analysis. Statistics for high quality polymorphic SLAF marker numbers on each chromosome were shown in Table 4 and a distribution diagram of candidate markers on each chromosome was shown in Fig 2C.

SNP_index association analysis
A total of 1,933 candidate polymorphic SLAFs were used for association analysis through the SNP_index method. The association threshold was 0.3764 and 4 SLAF markers on ChrA09 significantly correlated with the seed weight trait. The result of the SNP_index association analysis was shown in Fig 3A. Statistics for the number of associated SLAF markers on the chromosome were shown in Table 5. Through analysis of the 4 associated SLAF markers, a trait related candidate region on ChrA09 was identified. The candidate regions had a size of 0.58 Mb at nucleotides 25,401,885-25,985,931 with approximately 91 candidate genes in the region. The result of the candidate region identification by the SNP_index method was shown in Table 6.

Euclidean distance association analysis
A total of 1,933 candidate polymorphic SLAFs were also used for association analysis through the Euclidean distance method. The association threshold was 0.5532 and 4 SLAF markers on ChrA09 were significantly correlated with the seed weight trait. The result of the Euclidean distance association analysis was shown in Fig 3B. Statistics for the number of associated SLAF markers on the chromosome were shown in Table 6. Through analysis of the 4 associated SLAF markers, a trait related candidate region on ChrA09 was identified. The candidate regions had a size of 0.58 Mb at nucleotides 25,401,885-25,985,931 with approximately 91 candidate genes in this region. The result of candidate region identification by the Euclidean distance method was shown in Table 6.

Euclidean distance combined SNP_index association analysis
Euclidean distance and the SNP_index combined method were used for association analysis of 1,933 candidate polymorphic SLAFs. Four SLAF markers on ChrA09 were significantly correlated with the seed weight trait. The statistics of the number of associated SLAFs on the chromosome, the candidate regions and genes were shown in Tables 5 and 6. From all the results of three types of association analysis (Tables 5 and 6), we could conclude that the seed weight trait related candidate regions were at the same place.  In summary, it was shown that the candidate genes of seed weight were all located on ChrA09. It might verify the accuracy of SLAF-Seq through comparing with a linkage map of ChrA09 or the major QTLs of seed weight on ChrA09 in B. napus. We previously constructed a linkage genetic map using a F 2 population derived from the same cross between rapeseed lines G-42 and 7-9 with 128 SSR markers and 100 SRAP markers and detected two major QTLs for SW. Two QTLs (QSW-X-A9-1 and qSW-W-A9-3) were both localized to ChrA09 [32]. They were located between two markers, Na14-B03 and CB10373-2, which were quite close to our candidate hot-region (25,401,885-25,985,931 bp) identified from SLAF-seq by blasting with B. napus reference genome [22]. In conclusion, compared to our previous QTL mapping, the candidate gene hot-region by SLAF-seq might be confirmed. To further validate the accuracy of these four associated SLAF markers, we chose 10 SLAF loci derived from 4 SLAF markers in 2 parents and 10 random individuals and performed independent traditional Sanger sequencing. Of these 120 genotypes, 117 were consistent and 3 were incorrect with the SLAF-seq genotyping information. Details are shown in S2 Table. The results compared by two types of sequencing ways confirmed the genotyping accuracy of SLAF-seq.
To deeply understand the differences and new findings in our research compared with other studies, we enumerated some similar work on B. napus seed weight QTLs. Li et al. [33] detected an association signals (position at 34, 653 kb) for seed weight on ChrA09 using association mapping which were consistent with some previous studies of quantitative trait loci mapping in B. napus. Li et al. [34] harbored two QTLs (their confidence intervals were on the position 30.68 Mb to 31.19 Mb for uq.A09-1and 29.02 Mb to 30.28 Mb for uq.A09-3) for both seed weight and silique length on ChrA09 by regional association analysis with a panel of 576 inbred lines in B. napus. Liu et al. [35] identified a major QTL on ChrA09 for both seed weight and silique length, which was confirmed to be the same one with Li et al. [34]. By fine mapping and association analysis, they finally uncovered a 165-bp deletion in the auxin-response factor 18 (ARF18) gene associated with increased SW and SL. Apparently, these QTLs or gene above for seed weight were totally different with ours. It is very likely that seed weight is quantitatively inherited, which is controlled by multiple QTLs [36].

Association regional gene annotation
Totally, we obtained 4 polymorphic SLAF markers which narrowed the candidate associated regions down into 0.58 Mb in size on ChrA09, with a total of 91 genes. Ninety-one candidate genes were blasted with Gene ontology (GO) [37], Cluster of Orthologous Groups of proteins (COG) [38], Kyoto Encyclopedia of Genes and Genomes (KEGG) [39], Swiss-Prot [40] and Non-redundant protein (NR) [41] databases by BLAST software [42] yielding 90 genes that were successfully annotated. All the annotated information was listed in Table 7 and S2 Table. Of these candidate genes, 90 could be annotated in NR database; 70 could be involved in Swiss-Prot database; 87 could be included in GO database; 25 could participate in KEGG pathway and 35 could find annotated information in COG database. Annotation of the 90 candidate genes contributed to the further study of map-based gene isolation. The details about 90 candidate genes annotation information from GO, COG and KEGG databases were showed in S1, S2 and S3 Figs. From genetic and molecular based research on rice yield, it is known that grain weight is controlled by cell division in the outer glumes and the grain filling rate [43]. For example, in rice, the genes of GS3 and qGL3 negatively regulate cell division in the outer glumes so that the loss of their functions increase grain yield [44][45][46]. Previous studies on rice and Arabidopsis concluded that IAA might play an important role in regulating cell number. For sink organs of rice, the tgw6 allele affects the timing of the transition from the syncytial to the cellular phase by controlling IAA supply and limiting cell number and grain length [47,48]. From the annotated information in our study, we found four interesting candidate genes, GSBRNA2T00037136001, GSBRNA2T00037157001, GSBRNA2T00037129001 and GSBRNA2T00069389001. GSBRNA2T00037136001 participates in cell division; GSBRNA2T00037157001 was involved in the process of seed development; GSBRNA2T00037129001 was involved in both seed development and cell division; and GSBRNA2T00069389001 participated in the process of IAA biosynthesis, all of which might be highly related to seed weight.

Conclusions
In this study, SLAF-seq technology combined with BSA was firstly and successfully used to detect candidate genes for seed weight in B. napus. A hot-region~0.58 Mb with 91 candidate genes on ChrA09 were identified to be tightly associated with the TSW trait. The four most likely candidate genes were selected from annotation information. Confirmation of the function of these candidate genes by transformation or assessment of mutation for gene mining represents worthwhile future studies.