Association Mapping of Seed Oil and Protein Content in Sesamum indicum L. Using SSR Markers

Sesame is an important oil crop for the high oil content and quality. The seed oil and protein contents are two important traits in sesame. To identify the molecular markers associated with the seed oil and protein contents in sesame, we systematically performed the association mapping among 369 worldwide germplasm accessions under 5 environments using 112 polymorphic SSR markers. The general linear model (GLM) was applied with the criteria of logP≥3.0 and high stability under all 5 environments. Among the 369 sesame accessions, the oil content ranged from 27.89%–58.73% and the protein content ranged from 16.72%–27.79%. A significant negative correlation of the oil content with the protein content was found in the population. A total of 19 markers for oil content were detected with a R2 value range from 4% to 29%; 24 markers for protein content were detected with a R2 value range from 3% to 29%, of which 19 markers were associated with both traits. Moreover, partial markers were confirmed using mixed linear model (MLM) method, which suggested that the oil and protein contents are controlled mostly by major genes. Allele effect analysis showed that the allele associated with high oil content was always associated with low protein content, and vice versa. Of the 19 markers associated with oil content, 17 presented near the locations of the plant lipid pathway genes and 2 were located just next to a fatty acid elongation gene and a gene encoding Stearoyl-ACP Desaturase, respectively. The findings provided a valuable foundation for oil synthesis gene identification and molecular marker assistant selection (MAS) breeding in sesame.


Introduction
Sesame (Sesamum indicum L.) is an ancient and important oilseed crop and is cultivated mainly in the tropical and subtropical regions of Asia, Africa and Southern America. The harvested area of world sesame reaches to 7.3610 7 hm 2 , and the total product per year is roughly 3.7610 7 ton from 2001 to 2010 (FAO data). Compared with other main oil crops, e.g., soybean (18% of average oil content) [1], oilseed rape (41%) [2], sunflower (40-44%) [3] and peanut (51%) [4], sesame is one of the few oil crops with the highest oil content and quality. Sesame seeds contain 55-58% oil and almost 18% proteins. Among the fatty acid compositions in sesame seeds, oleic acid (18:1) (39.6%) and linoleic acid (18:2) (46.0%) are the two main components with the ideal ratio of almost 1:1 [5,6]. Apart from the seed yield, the content of seed storage oil and proteins is the highlight agronomic trait in sesame breeding [7].
In the past two decades, in order to clarify the high quality of sesame oil and protein, many researchers focused on exploring seed development and fatty acid and storage protein synthesis processes, as well as identifying the lipid synthesis related genes and molecular markers in sesame [8][9][10][11][12]. Of all the three available cDNA libraries, two libraries are constructed for seed development analysis [13,14]. However, oil and protein contents are complex quantitative traits and always are affected by genotype and environment [15]. At present, the mechanism of high oil content and quality in sesame seeds is still unclear. No loci of oil and protein content traits have ever been found in the sesame linkage maps. Even though Wei et al. [12] performed the association analysis of seed oil and protein content and fatty acid composition within 216 Chinese sesame accessions using 79 molecular primer pairs (including SSRs, SRAPs and AFLPs), only one association marker (M15E10-3) was identified under two environments. Therefore, in order to precisely detect the genes or markers associated with oil and protein content traits and to improve the sesame breeding, more efficient markers and germplasm resources with larger phenotypic variation need to be applied [16,17].
Currently, linkage analysis (QTL mapping) and association mapping are two main and common analysis tools for dissecting complex phenotypic variation. Compared with the traditional linkage analysis based on mapping populations, association mapping offers higher precision for locating QTLs and selecting molecular markers [16,18]. Till now, association mapping has been extensively used for analyzing important agronomic and quantity traits in wheat, maize, cotton, oilseed rape and other crops [18][19][20][21][22]. In the past several years, vast simple sequence repeat (SSR) or microsatellite markers with the high polymorphism are developed in sesame [7,23,24]. Accordingly, the association mapping is getting reliable and powerful for detecting the genes or markers associated with key traits and improving the molecular marker-assisted selection (MAS) in sesame breeding programs.
The objectives of this study are: (1) to perform the association mapping of seed oil content (OC) and protein content (PC) traits in worldwide sesame accessions using the GLM and MLM models, (2) to reflect the characteristics of sesame oil and protein contents under various environments, and (3) to determine the key SSR markers associated with seed quality. In this report, a natural population covering 369 worldwide accessions from China and other 15 countries and 112 pairs of polymorphic SSR markers were applied. The results give a foundation for investigating the seed development-related genes and seed quality in sesame.

Seed oil and protein content variation in the natural population
Both the seed oil content (OC) and the protein content (PC) are often influenced by environment. To decrease the environmental effect, we collected the phenotypic data of the 369 sesame accessions under 5 environments of Pingyu and Yuanyang locations in 2011, and Pingyu, Yuanyang and Xinyang locations in 2012, The descriptive parameters under each environment were calculated ( Table 1). The results showed that the OC and PC significantly varied among the 369 accessions. In total, the OC of the natural population ranged from 27.89%-58.73%, with an average of 49.59%-53.14%; meanwhile, the PC ranged from 16.72%-27.79% with an average of 20.28%-22.51%. All the datasets showed a normal or nearly normal distribution. To determine the heritability of the phenotypes, we performed the variance analysis of oil and protein contents ( Table 2). Results indicated that the OC and PC traits were significantly influenced by genotype and environments (i.e., year and location). No significant interactions between variety and environmental factors (year and location) were detected. Moreover, the OC and PC traits presented the significant negative correlation under the 5 environments, as the correlation coefficient (r) between OC and PC varied from 20.66,20.72 (P,0.01) in 2011 and from 20.52,20.74 (P,0.01) in 2012, respectively (Data not listed).

Linkage disequilibrium
Linkage disequilibrium (LD) refers to the non-random association of alleles between the genetic loci. A total of 112 SSR markers were used for estimating the LD level among the 369 sesame germplasm accessions (Table S2). These SSR markers distributed in 33 contigs/scaffolds with a total length of 180.86 Mb, which approximately represented 67 percentage of the assembly genome size (270 Mb) and 50 percentage of the estimated genome size (360Mb). The average SSR density was 1 SSR per 1.6 Mb. To reflect the associations between the polymorphic loci of the 112 SSR markers, LD P-values were determined among the 6,216 locus pairs (i.e., 112*(112-1)/2) using Fisher's exact test and two indexes of D9 and r 2 ( Figure 1). The average values of D9 and r 2 for the 6,216 pairs were 0.1649 and 0.0173, respectively. Of the 6216 pairs, 2584 pairs (41.57%) showed a significant linkage disequilibrium (P,0.01), 363 pairs (5.84%) showed a higher linkage disequilibrium of D9.0.5, and 33 pairs (0.5%) gave a D9 value of 1.0 (i.e., complete linkage). The data indicated that linkage disequilibrium existed among the sesame accessions.

SSR marker diversity and population structure
Before analyzing the association, the polymorphism of the 112 SSR markers within the 369 germplasms was investigated (Table  S2). Results showed that the number of alleles ranged from 2-5, with an average of 2.47 per locus. The PIC values of the markers    Of the 369 accessions, 126 accessions were grouped into subgroup 1 (the green ones in Figure 3), 243 accessions into subgroup 2 (the red ones in Figure 3). Most (47 out of 51) of the foreign lines were located in subgroup 2. The two subgroups in the population were considered having the complex 'admixture' relationship, and no significant correlation of geographic origin with the subgroups in the Chinese lines were found.

Marker-trait associations
In this study, the association analysis was performed using general liner model (GLM) method. The stringent criterion of logP$3.0 under 5 environments was used for determining the association significance between the OC and PC traits. For the OC trait, 19 markers were detected and the R 2 values ranged from 4%-29% (Table 3 & Table S2). According to the primer locations in the sesame genome, 19 markers were mapped in 11 contigs/ scaffolds. Of the 19 markers, Hs485 and Hs586 are located in scaffold00001, Hs4381, Hs02 and Hs4082 are located in C01, and Hs4061, Hs345 and Hs19563 are in C04. The distance between the markers that located in the same contig/scaffold was short than 2 Mb (Table S2). For the PC trait, 24 markers were detected with the R 2 value range of 3%-29% and mapped in 15 contigs/ scaffolds (Table 4 & Table S2). Comparison results indicated that 19 of the above 24 markers were associated with OC trait at the same time; the other 5 markers of Hs425 (in C08), Hs560 (in C11), Hs4265 (in C12), and Hs4089 and Hs1514 (in C19) were unique to PC trait. Simultaneously, in order to assay the stability, we performed association mapping using mixed liner model (MLM) method, with the criterion of logP$3.0 under at least 3 environments.
For the OC trait, 9 markers were detected in MLM model and 8 markers of Hs345, Hs4381, Hs485, Hs1036, Hs4061, Hs635, Hs376 and Hs586 were confirmed using GLM and MLM methods. Especially, 4 markers of Hs345, Hs4381, Hs485 and Hs1036 had high R 2 values ($10%) under 5 environments (Table 3 and Table S3). For the PC trait, 9 markers were found and confirmed in both models, of which 7 markers had high R 2 values ($10%) under 5 environments (Table 4 and Table S3). These results suggested that the OC and PC traits are controlled mostly by major genes in sesame.

Marker effect on the phenotypic variation
To explore the association between the above markers and phenotypic variation and the utility potential in sesame MAS breeding program, we performed the allelic effects of the five markers associated with both traits (Table 5). For each marker, the effects estimated were in accordance with the allele character under the 5 environments. Hs345 had the largest effect on the variation of seed oil and protein content. The Hs345-1:1 and Hs345-2:2 of Hs345 showed the different variation effects on OC trait, as the average oil content in the accessions ranged from 52.05%-42.82%. In the genotypes carrying the Hs4381-1:1 allele, oil and protein contents were 43.89% and 23.25%, respectively, while the samples with the Hs485-2:2 contained 52.00% oil and 21.00% proteins. Furthermore, the specific allele indicated the negative or positive effect to a large extent on the OC or PC trait. For example, the allele effect of Hs345-2:2 on OC trait ranged from 28.13 to 212.18, whereas the effect on PC trait ranged from 2.05 to 3.82. Comparison results suggested that the alleles of all 5 markers give the opposite effects on OC and PC traits, respectively. The allele that increased the oil content certainly gave the negative effect on protein content, and vice versa. Therefore, these markers could be used for screening sesame lines with either high oil or protein content.

Comparative genome analysis
To clarify the distribution and more information of the above associated markers, we performed the comparative genome analysis of the 19 SSR markers associated with the OC trait ( Table 6). All genes closest to the markers were explored. Of the 19 markers, 11 are located in gene regions and 8 are in intergenic regions. The flanking genes had various functions, such as ligase (C01.560), transcription factor (C04.26, C13.438) and kinase (C14.22). Moreover, we found that the markers of Hs4082 and Hs345 were just located next to C01.883 (ABC transporter G family member 3 gene) and C04.786 (Stearoyl-ACP Desaturase gene), respectively, which were proved involved in plant lipid biosynthesis. We also screened the upstream and downstream sequences of 500 Kb far from each marker. 17 (out of 19) markers were proved close to plant lipid pathway genes. These result further confirmed our association mapping conclusions.

Discussion
To clarify the genetic mechanisms of fatty acid and protein synthesis in sesame seeds, we herein performed the association mapping analysis of the OC and PC traits among 369 sesame accessions using 112 genic-SSR markers. These accessions were collected from 19 provinces of China and 15 other countries, and represented the genetic diversity of sesame germplasm for association mapping study. These accessions included many released Chinese and foreign cultivars. Compared with the traditional linkage analysis (QTL mapping), the association analysis based on linkage disequilibrium (LD) has been applied for the quantitative trait loci (QTLs) detection and location in many crops. Meanwhile, GLM and MLM models are applied individually for evaluating the marker association. In this article, 19 SSR markers associated with the OC trait were detected under each 5 environments in GLM model, while 24 markers were determined and associated with the protein content.
Sesame genetic diversity and the population structure Sesame is a diploid and self-pollinated oil crop with the karyotype of 2n = 2x = 26. As all cultivars are originated from the sole cultivated species, Sesamum indicum L, the narrow genetic diversity presents in regional sesame resources to a large extent [24,[49][50][51]. In addition, many reports reflect that there is no clear association between genotype and geographical origin, as many sesame accessions from the different geographic locations are clustered into the same group in the dendrogram [51][52][53]. Apart from the natural history of predomesticated ancestors, the diversity pattern of domestic species could be influenced by the breeding practice, germplasm collection and human activity [53][54][55]. In this article, in order to guarantee the broad genetic variation, we selected the 369 worldwide accessions for seed nutrition genetic analysis according to the geographical origin, molecular clustering and the morphologic diversity (Table S1). The heterozygosity ranged from 0.27% (Hs373)-23.12% (Hs4325). The result proved that the natural population could be used as the core germplasm for association mapping (Table S2). Population structure analysis showed that many sesame accessions collected from the same geographic locations were not grouped together, which further proved the unclear association between genotype and geographical origin in sesame germplasm resource [51][52][53].
During performing the association mapping in a population, LD patterns between the functional loci and markers should be analyzed at first [56]. We analyzed the P-values of linkage disequilibrium between the polymorphisms of the 112 SSR marker loci using Fisher's exact test ( Figure 1). As 363 (5.84%) pairwise comparisons had the high LDs (D9.0.5), the linkage disequilibrium existed in 369 sesame accessions. Accordingly, we believed that the natural population is suitable for association analysis of the oil and protein contents.

Oil and protein content variation and associated SSR markers
Among large-scale sesame germplasm resources, the oil and protein contents varied significantly. Yermanos et al. [57] evaluated 721 sesame samples collected from more than 19 countries, and found that the oil content varied from 40.4-59.8% with the mean of 53.1%. The protein content ranged from 19-31% with an average of about 25% [58]. In our population, the oil content varied from 27.89%-58.73% with an average of 51.34%, while the protein content varied from 16.72%-27.79% with an average of 21.19% ( Table 1). The data reflected the great variation of sesame seed compositions in germplasm [57,59,60].
Comparison analysis proved that there is a strong negative correlation of the oil content with protein content. Interestingly, the stringency relationship was also exhibited in the association analysis. As shown in the GLM analysis in Table 3 and 5, all the 19 markers associated with OC were detected for PC trait; and the alleles exhibited the opposite effects on OC and PC at the same time. These phenomena also present in other oil crops, such as oilseed rape, cotton, soybean and peanut [1,[61][62][63]. Zhao et al. [62] found 6 QTLs with pleiotropic effects on both oil content and protein content in oilseed rape. In the cotton backcross inbred population, of 17 QTLs for oil content and 20 QTLs for protein content, 8 QTLs co-localized in the same chromosome regions and controlled oil and protein contents with opposite additive effects [63]. Hwang et al. (2014) detected 25 SNPs associated with seed oil in 13 different genomic regions through GWAS (genome wide association study), and 7 SNPs were significantly associated with both protein and oil traits. For the six of seven marker loci, a negative relationship existed between the protein effect versus that on oil [64]. Meanwhile, QTLs or markers associated with increased protein and oil contents were also found. For example,  found that 2 QTLs that controlled oil content were independent from protein content by conditional QTL mapping [65]; Hwang et al. (2014) found a SNP at the 4.92 Mb position on Chr 9 was associated with increased protein and oil contents [64].
In many crops, the seed oil content seems to be controlled mostly by major genes [66][67][68][69]. In this study, we detected 19 markers associated with OC trait in sesame using GLM method, and the R 2 values ranged from 4.0-29.0%. Chen et al. (2010) screened 27 QTLs related to oil content in oilseed rape and the individual explanation was high with the range of 4.20-30.20% [66]. In Arabidopsis thaliana, a single QTL or marker could give an explanation of 17-19% for the seed oil content [68]. In soybean, the explanation reached to 14.3-45.6% [69]. In this report, 112 polymorphic SSR markers were used for association mapping. Compared with other common molecular markers, such as SRAP and AFLP, SSR marker is more suitable for sesame diversity analysis due to the narrow genetic basis [24,[49][50][51]. The marker distribution and density were analyzed using the sesame genome assembly data. The contigs/scaffolds carrying the 112 markers approximately covered 180 Mb of the sesame genome and occupied ,67% of the assembly size (270 Mb) and 50% of the estimated genome size (,360 Mb, in which 90 Mb was believed to be repeat sequences) (Table S2) [70,71]. Therefore, the association mapping using 112 markers is meaningful and reliable, even though some QTLs might be missed. To detect more QTLs, new SSR or SNP markers could be applied in further research.   Chromatin remodeling factor [47] Candidate genes and oil components To clarify the marker location and more genome information for OC and PC traits, we screened the genes that were close to the associated markers using the sesame genome assembly data. As a result, 36 candidate genes related to lipid pathway were identified (Table 6) . We found that most candidate genes were involved in three pathways: (1) fatty acid and TAG (triacylglycerol) synthesis and elongation, e.g., C01.526, C01.548 and C01.928; (2) TAG degradation, e.g., C01.601; and (3) fatty acid dehydrogenation, e.g., C04.786 (Stearoyl-ACP desaturase, determining the ratio of saturated and unsaturated fatty acids) (Table S4). Therefore, the seed oil content in the sesame accessions could be regulated by three factors, i.e., oil synthesis ability, oil degradation ability and oil component ratio (e.g., 18:1 and 18:2 fatty acids). In various accessions, any alleles of the genes involved in fatty acid and TAG synthesis, TAG degradation or dehydrogenase genes could give effects on oil content. To confirm this hypothesis, further studies of seed oil synthesis should be performed in the future.

Conclusion
We systematically explored the association mapping of seed oil and protein content traits in 369 worldwide sesame accessions using the 112 SSR markers. A significant negative correlationof the oil content with the protein content existed in the population. 19 SSR markers were associated with the oil content trait with high phenotypic variation explanation, and 24 SSR markers were associated with the protein content trait using GLM method. This association results would provide an efficient platform for seed development research and MAS breeding in sesame.

Plant materials
A population of 369 core sesame germplasm resources was chosen according to the genetic diversity analysis results and phenotypic variation [51]. These core genotypes comprised 318 lines from 19 provinces of China and 51 worldwide lines from the 15 countries, which were reserved at the Sesame Germplasm Bank of Henan Sesame Research Center (HSRC), HAAS (Henan Academy of Agricultural Sciences) (

Oil and protein content analysis
After harvested, ,20 g of seeds were collected from five plants per line, and the seed OC and PC were measured on infrared determination equipment (Perten DA7250, Sweden) according to the manufactures' instructions. The standard curve for measuring the sesame oil and protein contents had been established according to the chemical analysis results of 300 sesame accessions (Unpublished data, HSRC). Three replications of each samples were assayed for phenotypic analysis. Mean value, broad-sense heritability and correlation coefficient of the phenotypic data were analyzed using the statistical analysis system software (SAS Institute Inc. 2002) [72].

SSR genotyping
The 112 polymorphic SSR pairs were selected from our SSR marker bank [24,51,73] (Table S2). DNA extraction, PCR amplification, electrophoresis and SSR genotyping analysis were performed according to the methods described by Zhang et al. [24]. The total number of polymorphic alleles at each SSR locus was calculated according to the results of all 369 lines. The polymorphic SSR alleles presented only within 4 (1%) or fewer accessions were recorded as rare alleles.

Linkage disequilibrium (LD)
As the population structure could result in the spurious associations between phenotypes and marker loci, we analyzed the extent and structure of LD within the population before selecting the appropriate association mapping strategy. To assay whether the 112 polymorphic SSR markers were segregated independently or not, LD analysis was conducted according to the dedicated procedure of the TASSEL software [49]. Both D9 and r 2 were used for quantifying LD values [74,75]. Significance (P values) of D9 for each SSR pairs was determined with 100,000 permutations.

Population structure and relatedness analysis
The population structure was determined using STRUCTURE 2.2 [76]. The mixture model and the independent allele frequency model were used to analyze the population dataset. Five runs of STRUCTURE were carried out for each number of populations (K) (from 1-10), and each run started with 10,000 burn-ins, followed by 100,000 iterations. While performing the STRUC-TURE, we assumed that the inferred population accord with Hardy Weinberg equilibrium (HWE) and the loci are unlinked. To correct the relatedness of individuals in further analyses, the relatedness between individuals (relative kinship) was evaluated using SPAGeDi 1.2 software [77]. The matrix with the relative Note: a Related genes are the genes containing or close to the screened markers. b Nearby lipid genes refer to the genes located in the upstream and downstream of 500 Kb far from the marker. c Only one of the homologous genes is listed in the table.
-refers to no known lipid genes in the location. doi:10.1371/journal.pone.0105757.t006 kinship coefficients (K matrix) was applied for association analysis using the Mixed Linear Model(MLM, Q+K)method.

Association mapping and marker distribution in sesame genome
Associations between the SSR markers and the oil and protein content traits were investigated using both methods of the general linear model (GLM, Q) and the mixed linear model (MLM, Q+K) in TASSEL 2.1 described by Bradbury et al. [49]. The mean value of the markers at P,0.005 was used for determining the significance of marker-trait associations.
To determine distributions of the associated markers in sesame genome, we carried out the alignment of SSR markers and transcripts with the updated sesame genome data [70,71]. In the present genome assembly, the number of N50 scaffold was 14, and the number of N90 scaffold was 64. 29,798 gene models were identified. Among related scaffolds or contigs, the sesame lipid synthesis related genes were identified according to the homologous comparison using the genes of A. thaliana from the Acyl Lipids pathway database [78] as queries.

Author Contributions
Conceived and designed the experiments: HZ. Performed the experiments: CL HM LW. Analyzed the data: CL HM LW. Contributed reagents/ materials/analysis tools: TZ XH. Wrote the paper: CL HM LW.