Identifying Loci Influencing 1,000-Kernel Weight in Wheat by Microsatellite Screening for Evidence of Selection during Breeding

Chinese wheat mini core collection (262 accessions) was genotyped at 531 microsatellite loci representing a mean marker density of 5.1 cM. One-thousand-kernel weights (TKW) of lines were measured in five trials (three environments in four growing seasons). Structure analysis based on 42 unlinked SSR loci indicated that the materials formed two sub-populations, viz., landraces and modern varieties. A large difference in TKW (7.08 g, P<0.001) was found between the two sub-groups. Therefore, TKW is a major yield component that was improved in the past 6 decades; it increased from a mean 31.5 g in the 1940s to 44.64 g in the 2000s, representing a 2.19 g increase in each decade. Analyses based on a mixed linear model (MLM), population structure (Q) and relative kinship (K) revealed 22 SSR loci that were significantly associated with mean TKW (MTKW) of the five trials estimated by the best linear unbiased predictor (BLUP) method. They were mainly distributed on chromosomes of homoeologous groups 1, 2, 3, 5 and 7. Six loci, cfa2234-3A, gwm156-3B, barc56-5A, gwm234-5B, wmc17-7A and cfa2257-7A individually explained more than 11.84% of the total phenotypic variation. Favored alleles for breeding at the 22 loci were inferred according to their estimated effects on MTKW based on mean difference of varieties grouped by genotypes. Statistical simulation showed that these favored alleles have additive genetic effects. Frequency changes of alleles at loci associated with TKW are much more dramatic than those at neutral loci between the sub-groups. The numbers of favored alleles in modern varieties indicate there is still considerable genetic potential for their use as markers for genome selection of TKW in wheat breeding. Alleles that can be used globally to increase TKW were inferred according to their distribution by latitude and frequency of changes between landraces and the modern varieties.


Introduction
China is the largest wheat producer and consumer in the world, with 23.6 million ha, a mean 4,762 kg/ha, and a 112 million tonnes total production in 2008. There is long history of wheat cultivation in China extending over more than 2,000 years. Production extends from latitude 22u499 to 48u039 and much progress has been achieved in breeding and production in the last 60 years. Average wheat yields increased annually by 1.9% and production increased more than six-fold [1]. Thousand-kernel weight (TKW), as one of three major components of yield in wheat, has steadily increased over the period. Based on phenotyping of 1,800 cultivars released since the 1940s, TKW increased from a mean 31.5 g in the 1940s to 44.64 g in the 2000s, with a 2.19 g increase in each decade (Zhang et al. unpublished). Previous studies also showed that TKW was one of the three yield components with highest heritability, which varied from 59% to 80% [2]. Most genes affecting TKW have additive effects. Selection for TKW in the early generations of breeding is highly effective [2].
Crop domestication is an artificial evolutionary process of combining traits to meet human needs. During the domestication of cereals, for example, reductions in plant height to avoid lodging, large spikes, increased grain size, and disease resistance, were selected and conserved. Modern breeding involved further directional selection, which resulted in lower genetic diversity within the domesticated population than in the entire species. At the genome level, only a small number of genes (alleles) were positively selected and conserved [3]. Many other alleles at specific loci were gradually eliminated, leading to reduced genetic diversity at these loci compared with those present in the entire species. Diversity in genomic regions flanking the target genes was simultaneously reduced because of linkage. This phenomenon is referred to as linkage drag, hitchhiking, or selection sweep [4]. Hitchhiking generally leads to reduced diversity at target loci, linkage disequilibrium at loci surrounding the selected gene, and changed distribution patterns of alleles within the selected region [5,6]. These effects also provide the bases for association of neutral markers, such as SSR and DArT, with agronomic traits [6][7][8][9][10].
We established a Chinese common wheat core collection (CC) and a mini core collection (MCC) after genotyping 5,029 candidate accessions at 78 SSR loci [11]. Choice of candidate entries was based on documentary data in the national gene bank [12]. The MCC contains 231 accessions, or 1% of the basic collection (23,135 accessions) with an estimated 70% representation of the genetic variation in that collection [11,13]. The higher genetic diversity and artificial diminishment of dominant allelic frequencies in the MCC makes it a suitable population for detection of major QTLs controlling yield traits. It was shown to be a good reference set for revealing geographic distribution and time changes of important functional genes [10,[14][15][16][17][18]. In this study, we target loci associated with TKW to show the value of the MCC in dissecting complex yield traits in wheat. This association analysis provides useful information for marker-assisted selection in breeding wheat for increasing yield.

Phenotypic Assessment
TKWs of the Chinese mini core wheat collection were measured in 4 growing seasons and 3 environments, including Luoyang, Henan province 2002, 2005, and 2006; Shunyi, Beijing 2010; and Qingdao, Shandong 2010 (Table 1). Minor differences in mean TKW occurred among different planting environments, and there were major differences between landraces and modern varieties. The MTKW of modern varieties (39.23 g) calculated using BLUP methods based on multiple environments was significantly higher (P,0.001) than that of landraces (32.15 g), confirming that TKW was a yield trait improved by breeding. The maximum TKW was not in a modern variety, but was in a landrace. This indicates that further genes for this trait are present in landraces and can be accessed for breeding. The total agronomic data were considered in whole genome association analysis.

Population Structure Analysis
Population structure analysis can identify locus associations that are statistically significant, but biologically invalid due to strong correlation with population structure. However, if the population structure is properly dealt with, the likelihood of spurious associations can be minimised [7,19]. Forty-two loci distributed across every arm of the 21 wheat chromosomes were chosen to examine the population structure of entries in the mini core collection. We selected K values of assumed groups from 1 to 10. After 80 cycles of simulation, we found that K = 2 was the best separator providing the highest delta k value, and showing that the MCC entries comprised two sub-populations. One group was mainly the landraces, and the other included modern varieties and introduced lines (Fig. 1). Overlapping occurs between the two groups because in the early breeding period (1940-1960s), most of the released varieties were derived from crosses between Chinese landraces and introduced European or American varieties [20]. This was consistent with results based on 512 SSR loci using a similar set of materials [13].
Among the 24 loci associated with MTKW in at least two trials, we found breeder-favored alleles with strong positive effects on MTKW at 22 loci; and they were mapped to 11 chromosomes, viz. 1A, 1B, 1D, 2A, 3A, 3B, 5A, 5B, 5D, 6D, and 7A. The 7A effect spanned four loci, including gwm471, wmc168, wmc17 and cfa2257. The genetic distance between wmc17 and cfa2257 is 2.72 cM. No stronger linkage disequilibrium (LD) was found between the two loci (r 2 = 0.10, P.0.05) indicating they may not relate to a single yield gene, a result also suggested by previous QTL studies of TKW [22][23][24][25][26]. Three loci on chromosomes 1B and 2A were associated with MTKW. The allelic effect at each locus on MTKW was estimated by ANOVA (SPSS16). Significant or extremely significant differences in MTKW were detected between varieties with the favored allele and those with other alleles. Six loci with the strongest effects, and individually explaining more than 10% of the total variation were detected on chromosomes 3A, 3B, 5A, 5B and 7A (R 2 .10%) ( Table 2).

The Distribution of Favored Alleles at Associated Loci
We estimated the frequencies of favored alleles at each of the 22 loci in the landrace and modern entries groups in the mini core collection. Except at gwm403-1B, favored allele frequencies were much higher in modern varieties than in the landraces ( Figure 3, Table S2). This reflects positive selection of those alleles in breeding programs.
Modern varieties usually have fewer allelic variations than landraces [13]. However, the major allele frequency is not always higher in modern varieties than in landraces (Table S3). At the 42 loci without obvious signs of selection, the average major allelic Table 1. Comparison of 1,000-kernel weights between landraces and modern varieties in the Chinese wheat mini core collection in the 5 environments.
Among the four loci with favored allelic frequencies higher than 50% in modern varieties, cfa2234-3A, barc56-5A and wmc17-7A were among the six loci with the highest effects on phenotype variation of TKW (Table 2). In addition, dramatic increases were also detected at wmc17 and cfa2257 on 7A (Table S2); these were also among the six loci ( Table 2). The increased numbers and frequencies of favored alleles were accompanied by increased mean MTKW in modern varieties (Table 3). Therefore, we believe that the increase in favored allele frequencies at the 22 loci was mainly caused by selection for grain size over the five decades before 2000 (Table S2).

Accumulation of Favored Alleles from Breeding
Positive selection of favored alleles at key loci was also clearly implicated by changes in their number and frequency ( Table 3). The best modern variety (44.01 g) had 15 favored alleles at 22 critical marker loci, whereas the best landrace (38.84 g) had 10. Almost 92% of the landraces had 0-5 favored alleles, whereas 85.2% of modern varieties had more than 5 favored alleles, ranging from 5-15. Modern breeding has significantly promoted the accumulation of favored alleles in varieties (Fig. 4). These results illustrate the reliability of identifying favored alleles. Importantly, no modern cultivar has favored alleles at all 22 marker loci (Table 3, Fig. 4), indicating further capacity for improvement of TKW by maker-assisted selection.

Geographic Distribution of Favored Alleles at the Six Loci with the Highest Contributions to TKW
Closely located loci cfa2257 and wmc17 on chromosome 7AL with the highest contributions to TKW were chosen to analyze   their distributions in different production regions in China ( Figure 5). The favored alleles (182 bp and 184 bp) of wmc17 occurred in both landraces and modern varieties, but their frequencies were significantly higher in modern varieties than in landraces. Among landraces the highest frequency of the favored allele with high TKW was in region VI with region VII in second place. Both of the regions grow spring wheats with high TKW. For modern varieties, regions IV and VI had the highest frequency, with VII in third place. Other regions showed large variations in the frequencies of favored alleles. Regarding cfa2257, the highest frequency of the favored 129 bp allele was in region V with region VI in second place, a little lower than its frequency in landraces in region V. This allele was not present in landraces from 5 wheat regions (I, II, VII, VIII, and IX), a situation clearly different from the modern variety group where all modern lines, for example in region IX, carried the favored allele. This allele was also common in varieties from regions VI and VIII and occurred in the other regions. The geographic distributions of favored alleles at four other loci are included in Figure S1.

Genetically Additive Effects of Favored Alleles on TKW
To determine if additive effects occur among the favored alleles at the 22 loci, we estimated the mean TKW of varieties with different numbers of favored alleles. There was a high linear correlation (Y = 1.294X+29.33, R 2 = 0.95) between MTKW and number of favored alleles ( Figure 6) indicating clearly additive effects of favored alleles. However, an obvious negative interaction among loci after the number of favored alleles reached 10 and resulting in larger differences between real and expected TKW cannot be ignored (Fig. 6). A confounding factor was that some subgroups included only one or two varieties (Table 3).

SSR Loci Associated with TKW may Represent Major QTLs affecting Yield
According to Nordborg and Weigel [27], association mapping represents next-generation plant genetics. It uses ancestral gene associations and natural genetic diversity within a population to dissect quantitative traits, and is built upon the presence of linkage disequilibria. It offers a potentially powerful approach for mapping causal genes with modest effects [28,29]. The association results and allelic effects are influenced by population type and size, and the breeding system of the species. Core collections are very suitable for association analysis of highly heritable and domestication traits [8]. In the Chinese wheat mini core collection, the mean LD decay distance for landraces at the whole genome level was ,5 cM compared to 5-10 cM in modern varieties. Only 0.05% of marker pairs in significant (P,0.001) LD reached threshold levels of r 2 = 0.2 [13]. The observed LD is much lower than for CIMMYT historical breeding materials, but is similar to a population of European varieties released since the beginning of the last century [9,30]. The overall population structure is very weak, but the two sub-populations, landraces and modern varieties, were clearly distinguished [11,13]. This separation makes the MCC population suitable for marker/trait association analysis. Earlier analyses revealed differences in regard to latitude distribution and changes over time in important genetic haplotypes, such as those of Pina and Pinb [14], Ppd-1 [15], GS2 (glutamine synthetase) [17], TaGW2 [18], TaSus2 [10,16]. However, compared with the candidate core entries, the frequencies of predominant alleles declined to enable the maximum representation of allelic variation at each locus [6,11]. This likely reduced the association power, allowing the major QTLs to be targeted [8,10,29]. This was supported by the data in Table 2, i.e. most of the associated loci were detected within QTL intervals controlling TKW. Comparative analysis of modern varieties and landraces reveals major loci that have been almost fixed in modern varieties because of positive selection in breeding. For example, in wheat, two haplotypes coding an invertase gene on chromosome 5D were detected among 384 European wheat varieties released since the 1880s, with 382 being the same haplotype, and only two being the other. The latter would obviously have a very low chance of being detected in general association mapping populations. However, in our MCC, 58 accessions carried the above minority haplotype (Jiang YM and Zhang XY unpublished data).

Integration of Association Mapping and QTL Mapping Generates More Reliable Results
Artificial selection (domestication and breeding) leaves strong foot-prints in plant genomes [4,6,10,31]. Understanding the relationship between DNA sequence variation and variation in phenotypes for quantitative or complex traits will increase the speed of selection in breeding programs for predicting adaptive evolution [32]. Both linkage and association mapping aim to identify markers sufficiently closely linked to functional sequence variations (causal genes) encoding changes in phenotype, allowing breeders to select and manipulate these alleles routinely in diverse breeding populations [29].
Localization and interpretation of QTLs and associated loci provide confidence in results from association analysis [6,27,32]. In soybean,a high correlation (R 2 = 0.83) between the distribution of SSR markers and genes suggested close association of SSRs with genes [33]. This makes us believe that SSR markers are suitable for association analyses. Most of the associated markers were found in genomic regions where genes or quantitative trait loci (QTL) influencing the same traits were found previously. This provides an independent validation of the approach. Additionally, new chromosome regions for TKW were identified in the wheat genome through association analysis. Overall, 22 SSR loci on 11 chromosomes were associated with TKW with high confidence. This is much greater than the number of QTLs mapped in any biparental population, indicating the dissection power of this methodology in natural populations (Table 2) [34][35][36]. After genotyping 254 loci in 194 F 7 recombinant inbred lines, Groos et al. [37] detected nine chromosome regions controlling TKW (chromosomes 1D, 2B, 2D, 3A, 5B, 6A, 6D, 7A, 7D). These are largely consistent with our association results ( Table 2) from which three QTLs, on chromosomes 2B (Xgwm148 -Xgwm374 -Xgwm388), 5B (Xgwm639 -Xgwm271 -Xgwm604) and 7A (Xcfa2049 -Xbcd1930) were detected in six environments. The QTL on 7A mapped to the middle to terminal region of 7AL, and partially overlapped the region wmc17 -cfa2257 detected in the present study. QTL controlling TKW were also detected at a homologous region of 7DL [37]. Furthermore, the association mapping result for this region is much more precise than with QTL mapping; the genetic distance between the two nearest markers being only 2.72 cM ( Table 2, http://www.shigen.nig.ac.jp/wheat/komugi/ maps/markerMap.jsp). This raises the question of whether a single causal gene is involved. The r 2 value between the two markers is about 0.1 in the MCC. Thus there may be two linked causal genes, a possibility that is consistent with the obvious geographic distribution difference in favored alleles at two loci (Fig. 5). Similarly, gwm312 and gwm372 on chromosome 2A also reflect effects of two causal genes, which formed weak LD (r 2 = 0.23) in the MCC population. These examples illustrate how haplotype and LD analyses enable dissection of yield QTLs in practice [10].
In another comprehensive QTL mapping report based on 12 data sets obtained over three years of trials with 2-5 environments/year, Snape et al. [38] detected seven relatively stable QTLs controlling TKW in 11 DH populations. These QTLs were distributed on chromosomes 2A (gwm445), 2B (gwm148), 2D (wmc41), 3A (gwm428 -psp3001), 5A (gwm293), and 6A (wmc32, gwm518). The gwm445associated QTL was not detected in the MCC population, but was detected in the core collection (1,160 entries) with a 2.89 g increase in TKW (Zhang and You unpublished); gwm445 is very close to an almost orthologous region of chromosome 2D marked by wmc41. Both gwm148 on 2B and gwm275 on 2A mapped to orthologous regions detected in our study (Table 2). Loci gwm55-6D and gwm415-6B associated with TKW may be homologous to a QTL on 6A flanked by wmc32 and gwm518 in the pericentromeric region, in which TaGW2 is located [18](http://wheat.pw.usda.gov/ggpages/ SSRclub/GeneticPhysical/). In addition, distinct changes in frequencies of SSR alleles between the landraces and modern varieties at the 22 loci caused by hitchhiking effects provided positive Table 3. Number, frequency and mean MTKW of landraces and modern varieties in the mine core collections.

Linear Correlation between TKW and Favored Alleles Showing the Practical Value of Genome Selection in Breeding
Compared with QTL mapping, another attribute of association analysis is the validation of favored alleles in germplasm collections [8]. For example, Röder et al. [39] mapped a major TKW QTL to the interval Xgwm295 -Xgwm1002 located in the distal telomeric bin (7DS4-0.61-1.00) in the physical map of wheat chromosome 7DS. Zhang et al. [6] found that allele Xgwm130 132 underwent very strong positive selection during modern breeding. Xgwm130 maps between Xgwm295 and Xgwm1002, with a genetic distance of 1.1 cM from Xgwm295. Thus the identification of favored alleles will help in choosing parents for crossing programs, to ensure maximum levels of favored alleles across sets of loci targeted for selection, and to promote fixation at these loci [40].
Whereas linear correlations between TKW and favored alleles indicate the additive effects of QTLs or genes, the possibility of other genetic effects should not be ignored in practice. Higher standard errors when the numbers of favored alleles exceed 10 ( Figure 6) reveals the possibility of threshold effects with excessive numbers of favored alleles. Another cause of the higher standard errors was that the number of varieties carrying more than 10 favored alleles was much fewer (Fig. 4).
The concept of genome-wide selection (GWS) was recently introduced in plant breeding; this method uses information from all markers, as opposed to significant markers, to evaluate the breeding value of each line [41,42]. Frisch et al. [43] used transcription data from 46,000 oligonucleotide arrays to develop a prediction model for the value of parental maize lines in relation to the grain yield performance of their hybrid progeny. They found that predictions based on 50 well chosen genes were as accurate as predictions based on 5,000 random genes. Therefore, the combination of GWA and GWS will in future enhance the practical application of GWS in crop improvement [29]. This work paves the way for further targeted diversity mining in landrace populations and wild relatives via comparative genomics analysis. The most interesting example is that genes on a Thinopyrum ponticum group 7L chromosome enhance grain yield by 13% in the genetic background of newly released varieties [44,45]. The 7L gene may be orthologous to the TKW chromatin block flanked by wmc17 and cfa2257 on 7AL (Table 2) [46]. These examples indicate that increased grain weight in wheat is feasible using genomic selection.

Frequency and Geographical Distribution of Favored Alleles Indicate Potential for Yield Increases by Selection of Loci Associated with TKW
In wheat, some genes or SSR loci associated with yield vary across latitudes, such as TaSus2 on chromosome 2B [16], TaGW2 on 6A [18] and gpw7596 on 7B (EST-SSR) [47]. Favored alleles usually occur at relatively lower latitudes. This might indicate that the functional genes at these loci, including mapped alleles and those linked with markers, might be responsive to sunlight and temperature during the growing season [48,49]. None of the 6 SSR loci with determination coefficients higher than 10% associated with favored MTKW alleles cfa2234 142 (3AL), gwm156 311 (3BS), barc56 119 (5AS), wmc17 182 , 184 (7AL) and cfa2257 129 (7AL) had obvious correlations with latitude (Fig. 5,  Fig. S1). They can therefore be used globally for increasing TKW. None of the 88 genotyped modern varieties, and 17 introduced lines, carried favored alleles at all 22 loci, and only one variety had 15 favored alleles (Table 3, Fig. 4). Therefore, there are still opportunities for maker-assisted selection for TKW in wheat breeding.

Phenotypic Assessment
A Chinese wheat mini core collection [6,11,13] was chosen for genome-wide association of 1,000-kernel weight (TKW) using SSR markers. The mini MCC contained 262 wheat lines including 157 landraces, 88 modern varieties, and 17 introduced lines representing 1% of the national collection, but more than 70% of its genetic diversity [11]. The phenotype data were collected in five environments, viz. 2002, 2005 and 2006 in Luoyang, Henan province, and 2010 in both Shunyi, Beijing, and Qingdao, Shandong. The field planting design and methods of TKW measurement were described in Su et al. [18] and Jiang et al. [16]. Mean values of TKW and standard errors were analyzed by SPSS 16.0 (http://www.brothersoft.com/downloads/spss-16.html). The mixed mean TKW (MTKW) was estimated by the best linear unbiased predictor (BLUP) method according to Bernardo [50][51][52].

SSR Genotyping
Genomic DNA was extracted from young leaves of 10 seedlings of each entry according to Sharp et al. [53] and fingerprinted by PCR amplifications that identified alleles at 531 SSR loci. Genetic map positions for most of the markers (512 loci) can be found in Hao et al. [13]. The loci were distributed evenly across all 21 wheat chromosomes. The primer sequences and genetic locations of the loci were obtained from http://www.shigen.nig.ac.jp and http://wheat.pw.usda.gov [54,55]. The annealing temperature for each primer pair was obtained from Röder et al. [54] and GrainGenes (http://wheat.pw.usda.gov). After purification, the amplified PCR products were separated on an ABI3730 DNA Analyzer (Applied Biosystems, Foster City, CA, USA). Fragment sizes were determined using an internal size standard (LIZ500, ABI, USA), and the outputs were analyzed using GeneMapper software (http://www.appliedbiosystems.com.cn/). The minor allele frequency (MAF) was set as 0.05 during the following statistics.

Association Analysis
To reduce the risk of false or spurious associations, population structure was estimated by STRUCTURE v2.2 software according to Pritchard and Rosenberg [56] and Pritchard et al. [57], based on 42 unlinked loci from both arms of each chromosome with a length of burn-in period equal to 50,000 iterations and a run of 500,000 replications of Markov Chain Monte Carlo (MCMC) after burn in. A total of 80 independent runs were set with the number of presumptive groups (k) varying from 1 to 10. In order to select the most appropriate number of sub-groups, the Dk value, based on the average Ln probe of each run, was calculated allowing the internal population structure of the sample set to be determined [58], then Q data were obtained according to the corresponding K value.
In order to define the degree of genetic covariance between pairs of individuals, a kinship (K) analysis was conducted by genotypic data with SPAGeDi software [59]. The calculation of pairwise kinship coefficients was according to Loiselle et al. [60] with 10,000 permutation tests. Negative values between individual pairs were then set to 0, as this indicated that they were less related than random individuals [21]. The mixed linear model (MLM) module with Q+K of the TASSEL 2.1 software package (http://www2.maizegenetics.net/) [61,62] was used for genome wide association of MTKW and TKW in each trial. The relative value of the favored allele for TKW (R 2 ) was calculated according to the equation, R 2 = (SSA2f A 6MSE)/SST where SSA indicated the sum of squares between groups of favorable alleles and others, f A indicated the degrees of freedom of the group with the favored alleles, MSE indicated the error mean square, and SST indicated the sum of squares [62,63].
Because modern varieties usually have fewer alleles than the landraces generally, frequency at most alleles would be increased in modern varieties [13]. To avoid circular reasoning in data interpretation, we randomly selected one locus on each arm of the 21 chromosomes, with PIC values higher than the global mean (0.65), for evaluating changes in major allele frequencies between the two sub-populations at loci associated significantly with MTKW and loci probably not removed by selection in domestication and breeding [64] (Table S2, Table S3). We used F-tests and t-tests to estimate differences in allelic frequencies between the landrace and modern variety groups by SPSS15.0.