Identification of Associated SSR Markers for Yield Component and Fiber Quality Traits Based on Frame Map and Upland Cotton Collections

Detecting QTLs (quantitative trait loci) that enhance cotton yield and fiber quality traits and accelerate breeding has been the focus of many cotton breeders. In the present study, 359 SSR (simple sequence repeat) markers were used for the association mapping of 241 Upland cotton collections. A total of 333 markers, representing 733 polymorphic loci, were detected. The average linkage disequilibrium (LD) decay distances were 8.58 cM (r2 > 0.1) and 5.76 cM (r2 > 0.2). 241 collections were arranged into two subgroups using STRUCTURE software. Mixed linear modeling (MLM) methods (with population structure (Q) and relative kinship matrix (K)) were applied to analyze four phenotypic datasets obtained from four environments (two different locations and two years). Forty-six markers associated with the number of bolls per plant (NB), boll weight (BW), lint percentage (LP), fiber length (FL), fiber strength (FS) and fiber micornaire value (FM) were repeatedly detected in at least two environments. Of 46 associated markers, 32 were identified as new association markers, and 14 had been previously reported in the literature. Nine association markers were near QTLs (at a distance of less than 1–2 LD decay on the reference map) that had been previously described. These results provide new useful markers for marker-assisted selection in breeding programs and new insights for understanding the genetic basis of Upland cotton yields and fiber quality traits at the whole-genome level.


Introduction
Cotton is an important industrial crop in China. Many cotton breeders have focused on detecting and using marker-associated quantitative trait loci (QTLs) for marker-assisted selection (MAS) in breeding programs. Linkage analysis is a classic strategy for detecting QTLs in segregated populations derived from two inbred lines. Since Shappley [1] first reported QTLs associated with the agronomic and fiber traits of Upland cotton, thousands of QTLs have been identified through segregation analyses in cotton [2][3][4][5][6][7][8][9][10][11][12][13][14][15]. Two population types have been used in these QTL mapping studies: populations derived from interspecies crosses between Gossypium hirsutum and Gossypium barbadense and populations derived from intraspecies crosses within G. hirsutum. Most QTLs and linkage markers detected based on these interspecies populations are difficult to directly utilize for MAS because the Upland cotton varieties or lines are major material resources in breeding programs. However, when QTL and linkage markers were detected in intraspecies populations, only a few genomic areas can be scanned because of the low number of polymorphisms between Upland cotton varieties, and these results are only suitable for breeding populations derived from QTL-detected populations. For better understand the genetic basis of interesting traits in different breeding materials, such as the QTL distributions, configurations, and the percentage contribution to phenotypic variation, cotton breeders need to employ new analysis method.
In recent years, association mapping based on disequilibrium analysis has been introduced into plant QTL mapping. The new mapping strategy provided a powerful method for QTL mapping of germplasm populations. Compared with QTL mapping based on linkage analysis, association mapping has many advantages, including a higher resolution, increased genome coverage, lower time and money consumption, and reduced risk. Abdurakhmonov et al. [16] first conducted association mapping in which association between SSR (simple sequence repeat) markers and fiber quality traits was detected based on a germplasm resource population comprising 208 landrace stocks and 77 photoperiodic variety accessions, and a core set of 95 microsatellite markers. Abdurakhmonov et al. [17] also conducted genome-wide linkage disequilibrium (LD) scanning and association mapping based on a panel consisting of 334 G. hirsutum variety accessions from Uzbek, Latin American, and Australian ecotypes. In two environments, an average of 20 SSR markers were found to be associated with the main fiber quality traits using a unified mixed liner model (MLM) incorporating population structure and kinship, and 12-22 SSR markers were associated with fiber length, fiber strength, fiber fineness and six other fiber quality traits. Approximately 25% to 54% of these markers had previously been detected in studies based on linkage analysis. Zeng et al. [18] identified associations between SSR markers and fiber traits using an exotic germplasm population derived from species polycrosses (SPs) among tetraploid Gossypium species. A total of 202 fragments were analyzed, and fifty-nine markers showed a significant association with six fiber quality traits. These studies confirmed the feasibility of applying association analysis to explore complex traits in Upland cotton collections.
Following system and cross selection, the Upland cotton varieties found in China were demonstrated to show distinct characteristics. Generally, Chinese Upland cotton varieties are typically classified into three ecotypes: the Yellow River valley type, the Yangtze River valley type and the interior land type, according to the areas in which cotton was planted and cultivated. The Yellow River valley type is characterized by high disease resistance and high yields, while the Yangtze River valley type exhibits a high lint percentage or large bolls. Additionally, the interior land type shows adaptation to long days and short growing seasons in high-latitude areas. Furthermore, a large number of germplasm resources, including high lint percent and fiber quality lines, have been developed through cotton breeding. These varieties and germplasm resource lines have provided important materials for improving the yields and fiber quality of Upland cotton varieties in China. Zhang et al. [19] performed general linear model (GLM) association mapping of 12 agronomic and fiber quality traits based on 121 SSR markers and 81 G. hirsutum L. collections, and detected 180 loci that were significantly associated with 12 traits in more than one environment. Mei et al. [20] conducted association mapping of yields and yield component traits using 356 representative Upland cotton cultivars and 145 polymorphism markers. Cai et al. [21] performed association mapping of fiber quality traits in 99 G. hirsutum L. collections with 97 polymorphic microsatellite marker primer pairs. Zhao et al. [22] carried out association mapping based on Verticillium Wilt Resistance using a collection of 329 cotton (G. hirsutum L.) accessions obtained from a Chinese cotton germplasm collection. The results of these studies indicated the feasibility of applying association analysis to explore complex traits in Upland cotton collections in China. To better understand the genetic foundation of the yield and fiber quality traits at the population level and identify associated SSR markers, we performed whole-genome association analyses using 359 SSR polymorphism markers well distributed in reference maps [23,24] and a panel of 241 varieties and germplasm resource lines in the present study.

Selection of accessions and determination of phenotypic data
A total of 241 Upland cotton accessions were selected for genotype screening and evaluation of yield components and fiber quality traits to identify loci associated with yield components and fiber quality QTLs. All of the collections were derived from four sources: ① elite varieties popularly cultivated in China; ② germplasm resource lines with outstanding yield components or fiber qualities; ③ parental lines that are typically used in breeding programs; and ④ historical varieties and germplasm resources lines from abroad, including 20 collections from the US, 6 from the Uzbek, 6 from the Sudan, one from Australia and one from Cuba (S1 Table). , and lint percentage (LP). Ten plants growing near to each other were selected to count the total number of bolls. The average number for ten plants was scored as NB. Twentyfive bolls from each plot were weighed to determine the BW and then ginned by roller gin to evaluate the LP. Fifteen grams of lint was sent to the Supervision and Testing Center of cotton quality, the Ministry of Agriculture to measure the fiber quality. The following fiber quality traits were evaluated using the High-Volume Index (HVI) spectrum: 2.5% fiber span length (FL, mm), fiber strength (FS, cN/tex), and the micronaire reading (FM).

SSR markers and genotyping
In 2010, the young, not yet fully expanded leaves were collected from five plants of each line. DNA was extracted from the leaves as previously described [25]. A total of 359 polymorphic SSRs were used to genotype the 241 Upland cotton collections. The 359 SSRs included three resources: ① 302 SSRs separated by a distance of approximately 10.0 cM on all 26 chromosomes (A1-A13, D1-D13), and covered 94.6% (3241.3 cM/3425.8 cM) of the reference map [23,24]; ② 27 markers separated by approximately 1-3 cM distance on the 50.0-80.0 cM area of A1 and D1 chromosomes; and ③ 30 markers linked to the QTLs of three yield components traits (NB, BW, and LP) and three fiber quality traits (FL, FS, and FM). The SSR primer sequences used in these analyses were obtained from the Cotton Microsatellite Database (CMD, http:// www.cottonmarker.org/). The marker nomenclature consisted of a letter that specified the origin of the marker followed by the primer number. As previously described [26], the SSR analysis was conducted by polymerase chain reaction (PCR) and 6% non-denaturing polyacrylamide gel electrophoresis (PAGE). PCR runs were performed for 30 cycles of 45 s at 94°C at the annealing temperature for 45 s and 72°C for 60 s, and a final extension step at 72°C for 5 min. For each SSR primer, the polymorphic bands were identified according to the fragment size. The presence of polymorphic DNA fragments was scored as 1, and the absence of fragments was scored as zero. Multiple polymorphic DNA fragments presented or absented together in panel were identified as same marker locus. For the STRUCTURE software, "1" indicates fragments present, "0" indicates absent, and "-1" indicates missing data. For the Tassel software, "2/2" indicates fragments present, "1/1" indicates absent, and "0/0" indicates missing data.

Data analysis
LD values (r 2 and p value) between marker fragments were calculated using TASSEL 3.0 software [27]. The genetic distances between marker pairs were calculated based on the position of these markers on the genetic map [23,24]. Minor loci with a frequency < 0.05 were filtered out to reduce problematic and biased LD estimations between pairs of loci [28,29]. The r 2 values for pairs of SSR loci were plotted as a function of map distances, and LD decay (r 2 < 0.1) was estimated using the average distances of marker pairs showing LD values lower than 0.1 [30].
Analysis of variance (ANOVA) for the phenotypic data was conducted using the Statistical Analysis System (SAS8.1, Cary, NC). The broad-sense heredity of the six traits was calculated using the following equation: where σ 2 e is the residual variance component, and σ 2 G is the genotypic variance component. The population structure was analyzed using STURCTURE 2.2 software [31,32,33], with a running time of 100,000 and 50,000 replications after burn-in. Models for admixture and correlated allele frequencies were employed in the population structure analysis. The pairwise kinship of all 241 collections was calculated using TASSEL 3.0 software [27]. The MLM association analysis of the yield components and fiber quality traits was performed with TAS-SEL 3.0 software, incorporating filtered marker data and the K and Q matrices. We also performed GLM association analyses using the same four datasets, incorporating pairwise kinship information as a covariate and 1,000 permutations for the correction of multiple testing. To make up for the deficiency of using p-values in association, significant MLM associations (p < 0.05) across more than two environments were ranked, and the significance of these markers (p < 0.05) in the permutation test was compared using GLM association tests. The p-values derived from the MLM and GLM analyses were also separately tested using the positive false discovery rate (pFDR) test [34] for multiple testing corrections. The minimum Bayes factor (BFmin) was calculated using following formula: BFmin = -e Ã p Ã ln(p) [17,35,36].

Amplification fragment polymorphisms of the SSR markers
A total of 359 SSR markers were used to genotype 241 collections, among which 26 (7.2%) of the markers presented homomorphisms, and 333 markers covering 86.6% reference map (2968 cM/3425.8 cM) produced 733 polymorphic loci, averaging 2.2 loci per marker. The observed locus frequencies ranged from 50.19% to 99.62%, averaging 78.17%. The average genetic diversity was 0.358 (ranging from 0.008 to 0.802). The average polymorphism information content (PIC) was 0.300 (ranging from 0.008 to 0.773).

Population structure and LD of the marker pairs
The population structure was determined using STRUCTURE software, with K values ranging from 1-10. The LnP(D) value increased continuously with no obvious inflexion point before the panel was divided into 9 subgroups. However, the Δk value decreased rapidly at K = 2 and K = 3, and the locus frequency divergence among the subpopulations (Net nucleotide distance) was significant at k = 2, but not at k = 3. Fig. 1 shows that Δk presented a second peak for K = 9, indicating that this panel could be continuously further divided until into 9 subgroups. Pritchard et al. [31] suggested focusing on values of K that capture most of the structure in the data and that seem biologically sensible when the model choice criterion continues to increase with increasing K. To avoid an overcorrected population structure that would lead to the disappearance of the association loci in the association analysis [37], we adopted K = 2, not 9. The first subgroup contained 120 collections, comprising the majority of the elite varieties and parental lines that are typically used in breeding programs from the Yangzi river valley. The second subgroup included 121 collections consisting of germplasm resources lines, historical varieties from abroad, and the majority of the elite varieties and parental lines from the Yellow river valley (S1 Table). Cluster analysis of 241 Upland cotton collections showed majority of subgroup 1, as well as majority of subgroup 2, was clustered together (S1 Fig.).
Approximately 9.36% of the marker pairs showed significant LD, with p values lower than 0.05 (S2 Table). Approximately 18.90% of the collinear marker pairs showed significant LD, and 40.50% of the obtained LD values (r 2 ) were greater than 0.1. Approximately 8.87% of the non-collinear marker pairs showed significant LD, and 3.45% of the LD values (r 2 ) were greater than 0.1. Most of the significant LD values were higher than 0.2 were obtained from collinear marker pairs (S2 Fig., S2 Table). The LD value (r 2 ) decreased rapidly at genetic distances of less than 10 cM. The longest genetic distance between markers was 108 cM. The average genetic distance between markers was 8.58 cM and 5.76 cM for r 2 > 0.1 and r 2 > 0.2 (Fig. 2).

Performance of phenotype and Broad-sense heritability
For all yield and fiber quality traits, the 241 collections presented a wide-range of phenotypic variation in the four different environments (S3 Table). For example, in environment 1 (E1), the NB, BW and LP ranged from 6. 49 (Table 1).
Among the six evaluated traits, LP showed the highest broad-sense heritability, ranging from 0.67 to 0.81. FS and FL exhibited heritabilities higher than 0.5 in three environments and lower than 0.5 in one environment. The broad-sense heritability of the FM was lower than 0.5 in three environments and higher than 0.5 in one environment. The NB and BW showed lower heritabilities compared with the other traits across the four environments, ranging from 0.33-0.42 (Table 2, S4 Table).

Association mapping
For all six traits, including the three yield component traits (NB, BW and LP) and the three fiber quality traits (FL, FS and FM), we applied an MLM (+ kinship + Q-matrix) model to analyze the four datasets derived from the 241 collections at two locations over two years. Only markers showing significance in more than one environment were used to further test Twenty markers tolerated the FDR test in one or more environments, including 12 for yield traits and 8 for fiber quality traits. Fifty one marker loci (25 for yield traits and 26 for fiber quality traits) presented moderate-to-strong or strong-to-very strong evidence for association in different environments. Sixteen markers for yield traits and nine for fiber quality traits passed the permutation test at the 0.05 level in the GLM analysis. In total, forty-six markers associated with different traits were accepted in our analysis (Table 3 and Table 4).
We compared the associated markers identified in the present study with SSR markers previously identified through linkage QTL and association mapping analyses [39][40][41][42][43]. Among 46 markers, 14 were found to be associated or linked with the same traits (LP, FL, FS and FM) identified in previous studies (Table 5). Of the 14 markers, five were associated with LP, four were associated with FL and FS respectively, and one was associated with FM. Because the different markers were used in different studies, only a few markers could be directly compared. Therefore, we also employed the reference map as a bridge to compare the results obtained in the present study with the results from previous studies [39][40][41][42][43]. Nine markers were found to be near the QTLs controlling the same traits with a distance of less than 1-2 LD decay on the reference map (Table 6).

Discussion
Genetic diversity and population structure To maintain relatively high levels of polymorphism and to take advantage of association mapping, different ecotypes from China, including lines from cotton germplasm resources, historical varieties from abroad (the Uzbek, the US, Australia, Cuba and Sudan), mutants lines derived from radiation breeding programs, and some progenies of intra-and interspecies  [20], respectively. A low genetic diversity was not only found in Chinese Upland cotton collections but also in American Upland cotton collections [38] and other country's collections [44]. Population structure is an important factor that typically leads to spurious associations. Although the genetic background of Upland cotton is narrow, recent studies have revealed the population structure in association panels for Upland cotton [20-22, 38, 44]. Of 241 collections, 127 came from the Yangzi River valley, and 76 came from the Yellow River valley. In the present study, 73.2% (93/127) of the germplasm resources, varieties and breeding lines from the Yangtze River valley were classified into the P1 sub-group. A total of 75.0% (57/76) of the  Table 3 doi:10.1371/journal.pone.0118073.t004 germplasm resources, varieties and breeding lines from the Yellow River valley were classified into the P2 sub-group. The results revealed that the major differences in this panel came from the different ecotypes. However, 25.0% (19/76) of the collections from the Yellow River valley and 26.8% (34/127) of the collections from the Yangzi River valley were not arranged into corresponding subgroups. The fact indicates that there is still frequent gene exchange between different ecotype collections in China (S1 Table). These results were consistent with the results of previous studies [45] and recent reports [20][21][22]. Evanno et al. [46] conducted population structure analyses using three classic models: the island model, the hierarchical island model and the contact zone model, and K = 2 corresponds to the uppermost structural level in the contact zone model. In this study, population structure was similar with that of the contact zone model. The result was consistent with the fact that China is not a native cotton growing area. Most cotton varieties planted in China are derived from only a few germplasm resources (e.g., DPL, Stoneville, King, Uganda, Foster, and Trice) introduced from abroad [47].  [20], the ratio of LD was low and similar to the findings of Zhao [22]. Among the collinear marker pairs, 29.2% showed LD values (r 2 ) greater than 0.2. For the non-collinear marker pairs, this ratio was 0.5% (S2 Table). Further examination of the LD data revealed that approximately 80.5% of moderate LD (0.2 < r 2 < 0.4) and 91.5% of strong LD (r 2 > 0.4) was caused by linkage. Our results also showed that approximately 43.6% of moderate LD (r 2 > 0.1) was caused by other factors in this panel. LD resulting from noncollinear marker pairs has been previously described [16,20,22,44]. Abdurakhmonov [16] provided several possible explanations for LD between non-collinear markers, including selection, co-selection of loci, population stratification, and relatedness, genetic drift or bottlenecks. These elements might also generate LD values leading to spurious marker-trait associations [48][49][50], indicating the necessity of seriously considering population structure (Q) and relatedness (K) when conducting population-based association mapping in cotton germplasm resources [16].
In the present study, the observed LD value (r 2 ) rapidly decreased when the genetic distance was less than 10 cM. The speed of population LD decay was 8.58 or 5.76 cM for r 2 > 0.1 or 0.2, respectively. The LD decay block was similar to that described in recent association analysis studies [20][21][22] but faster than that described in studies using landrace [16,17] and SP panels [18]. We selected markers that were spaced approximately 10 cM apart from the frame linkage map [23,24]. Because of the shortage of polymorphism markers, there were some gaps of more than 15 to 46.8 cM along the 26 chromosomes. Although more markers are needed to conduct genome-wide association analyses (GWAS) of complex traits, the size of the LD blocks would guarantee that the identified SSR markers would be sufficient for MAS in Upland cotton breeding programs because increasing the number of markers per chromosome does not necessarily result in a stronger response to selection, particularly at a shorter distance between markers, such as 10 cM for an F 2 population of 500 individuals [51].

QTLs obtained through association mapping
To avoid spurious associations, different methods have been developed to control population structure, such as structured association (SA) [48], genomic control (GC) [52], EIGENSTRAT [53], stepwise regression (SWR) [54] and mixed linear models (MLM) [55]. To generate more accurate correlations with less-inflated type I errors [55], the MLM (+K+Q) method was employed in the present study. Considering the history of Upland cotton cultivation and the relatively simple population structure in this panel, GLM (+K) was also employed in the present study, and the results derived from the GLM and MLM were compared. For all six traits, the GLM (p < 0.05) detected 216 associated markers, and 155 markers were detected in more than one environment. The MLM (p < 0.05) identified 195 associated markers, and 84 markers were detected in more than one environment. After the correction of population structure using Q-matrix information, approximately 50% of the markers were not repeatedly detected in the MLM compared with the GLM, suggesting that the population structure should be seriously considered in stratified populations [17]. However, comparing the results obtained from the GLM and MLM provides more information. In the present study, all of the associated markers detected through the MLM were associated with the same traits in the GLM analysis across two to four environments. Notably, for the same traits, we compared the map positions of the associated markers derived from the GLM and MLM analyses, and we found more than one associated markers from the GLM were close to (within one or two LD blocks) associated markers detected using the MLM. This observation provided more support for the validity of the MLM results [17].
Interestingly, out of the 46 associated markers detected, some nearby markers (map distance within 1 LD block) were associated with the same traits. For example, both of NAU3736 and CIR307 on D1 were associated with FS and NB, JESPR101 and NAU3700 on D3 were associated with FM (S3 Fig.). These nearby markers might associate with the same QTL allele with a high probability.
Comparing the results derived from different populations or using different analytical approaches for cotton QTL detection provides more information for interpreting the results of the present study. Among all 46 markers associated with yield and fiber quality traits, 14 markers associated with the same traits were identified in previous studies (Table 5). Thirteen markers were detected through linkage analysis, and three markers were detected via association analysis. When we employed the reference map as a bridge to compare the results of the present study with those from previous studies, the 9 associated markers identified were near the QTL-linked/associated markers controlling the same traits identified in other reports, at distances of less than 1-2 LD decay on the reference map (Table 6). Considering the different markers used in the prior studies and the precision of QTL detection, these nearby marker pairs should be linked to the same QTLs reported.
MLM analysis generates more accurate correlations with less-inflated type I errors. However, significant MLM-derived associations are subjected to multiple testing corrections. The results of correction for multiple testing could be misleading due to the unknown influence of pvalue adjustment methods applied under the MLM approach [17]. Perhaps a modified statistical approach should be applied to adjust MLM p-values, though answering this question will require further studies [17]. In the present study, to maintain low false positive results, we employed four environmental datasets and four different significance tests (p-value, BFmin, FDR and permutation testing). Although most of the associated markers did not tolerate multiple testing for the FDR, the results of the present study obtained using the MLM method were supported by the BFmin, FDR and permutation testing from the GLM analysis as well as the findings of previous studies. These results exhibited a relatively high confidence level and can be considered for use in MAS programs.
To date, few SSR markers have been efficiently employed in MAS programs in cotton because the majority of available marker information was derived from populations resulting from bi-parental crosses with limited genetic backgrounds, covering only a few meiotic events since experimental hybridization [17]. Recent association mapping of Upland cotton collections confirmed the feasibility of applying association analysis to explore complex traits in Upland cotton collections and provided useful markers for marker-assisted breeding programs [18][19][20][21][22]. Similar to linkage mapping, association mapping using different materials harboring different genes and different markers can provide more information for marker-assisted breeding programs as well as insight into the genetic basis of interesting traits in Upland cotton. The results of the present study provided new useful markers for marker-assisted selection in cotton breeding programs and clues for the fine mapping of yield and fiber quality traits. These results will also enhance our current understanding of the genetic basis of Upland cotton yield and fiber quality traits at the whole-genome level.