Association Mapping for Epistasis and Environmental Interaction of Yield Traits in 323 Cotton Cultivars under 9 Different Environments

Improving yield is a major objective for cotton breeding schemes, and lint yield and its three component traits (boll number, boll weight and lint percentage) are complex traits controlled by multiple genes and various environments. Association mapping was performed to detect markers associated with these four traits using 651 simple sequence repeats (SSRs). A mixed linear model including epistasis and environmental interaction was used to screen the loci associated with these four yield traits by 323 accessions of Gossypium hirsutum L. evaluated in nine different environments. 251 significant loci were detected to be associated with lint yield and its three components, including 69 loci with individual effects and all involved in epistasis interactions. These significant loci explain ∼ 62.05% of the phenotypic variance (ranging from 49.06% ∼ 72.29% for these four traits). It was indicated by high contribution of environmental interaction to the phenotypic variance for lint yield and boll numbers, that genetic effects of SSR loci were susceptible to environment factors. Shared loci were also observed among these four traits, which may be used for simultaneous improvement in cotton breeding for yield traits. Furthermore, consistent and elite loci were screened with −Log10 (P-value) >8.0 based on predicted effects of loci detected in different environments. There was one locus and 6 pairs of epistasis for lint yield, 4 loci and 10 epistasis for boll number, 15 loci and 2 epistasis for boll weight, and 2 loci and 5 epistasis for lint percentage, respectively. These results provided insights into the genetic basis of lint yield and its components and may be useful for marker-assisted breeding to improve cotton production.


Introduction
Cotton is one of the most important economic crops in the world and provides large amounts of raw materials for textile industry. High yield is always remained the primary focus of cotton breeding programs. Boll number, boll weight, and lint percentage are three yield component traits, which are complex traits controlled by multiple genes with gene 6 gene interaction and gene 6environment interaction [1][2]. Understanding genetic architecture of lint yield and its component traits at the molecular level is very important for efficiently improving cotton yield traits. Identifying and charactering the key QTLs and genes affecting yield traits in cotton has been a research frontier over the past few decades. On the basis of linkage analysis in the QTL mapping populations, several QTLs controlling lint yield and its component traits have been obtained [1][2][3][4][5][6][7][8]. However, these detected QTLs involved in large regions of the chromosome, where candidate genes or genetic variants are difficult to distinguish. In addition, the mapping results have only limit utilization because of lower genetic diversity in the special population used for QTL mapping and can not fully elucidate the genetic basis of lint yield and its component traits.
Genome-wide association studies (GWAS), which are based on the linkage disequilibrium (LD), provide the opportunity to systematically identify the genetic components of complex traits of crops in recent years [9][10][11][12][13][14]. It has the potential to detect single nucleotide polymorphisms (SNPs) or other types of molecular markers such as single sequence repeats (SSRs) within or nearby a gene. It can extensively exploit historical recombination and natural genetic diversity in natural population, and overcome the limitations of conventional linkage mapping for crop breeding. Recently, Zhang et al performed association mapping using 81 accessions of Upland cotton based on 121 SSRs to map 12 agronomical and fiber quality traits, and identified 180 loci only with additive effects significantly associated with the traits [14]. However, current GWAS predominately focus on examination of SNPs at a single environment and fails to detect the gene 6 gene interaction as well as gene 6 environment interaction across multiple environments. The genetic variation of complex traits is contributed in part by epistasis and environmental interaction effects. Therefore, searching for only major effects may miss other key genetic variants specific to environment factors and cannot provide reliable estimates for genetic effects [3].
Due to a heavy computational burden, current methods of GWAS have not been used for detecting gene 6 gene interaction and gene 6 environment interaction for breeding populations in multiple environments. In this study, we used mixed linear model approach and a newly developed mapping software of QTXNetwork based on GPU parallel computation to identify individual and epistasis loci and their environmental interactions for lint yield and its component traits. We also identified candidate genes related to markers of these target traits. These results and genetic architecture identified in this study could provide more precise and reliable information for marker-assisted selection in crop breeding.

Plant Materials
Mapping populations representing Upland cotton cultivars and genetic resources were mainly obtained from China and some from foreign countries including the United States of America, Russia, Australia etc (Table S1), which are available from the National Mid-term Genebank of the Institute of Cotton Research, Chinese Academy of Agricultural Sciences (ICR-CAAS) after signing the Material Transfer Agreement (MTA). These cultivars are suitable for association mapping since they have abundant genetic resources. 323 sampled cultivars of Upland cotton were planted and evaluated for lint yield and yield components during 2007 to 2009 in three locations: 1) Anyang of Henan province (annual frost free period = 180-230 days, average active accumulated temperature = 3800-4900uC, annual rainfall = 500-1000 mm) in the Yellow River valley cotton growing region; 2) Kuche of Xinjiang province (annual frost free period = 170-230 days, average active accumulated temperature = 3000-5400uC, annual rainfall = 15-380 mm) in the northwestward cotton growing region; 3) Nanjing of Jiangsu province (annual frost free period = 227-278 days, average active accumulated temperature = 3500-5500uC, annual rainfall = 1000-1600 mm) in the Yangtze River valley cotton growing region. This research is regular cotton regional trial test conducted in farm field of cotton research institute, and not related with protected area of land and protection of wildlife. It does not need specific official permission. We regarded combination of each location and each year as an individual environment, setting a total of nine environments. A randomized complete block design with three replications was employed in the filed trials, and each block was settled with two rows and every row was kept in a plot with 5 m long and 0.7 m wide. The field management utilized conventional field production management techniques adjusted to local practice. Lint yield and three yield components were evaluated: number of bolls per plant, boll weight, and lint percentage.

DNA Extraction and Marker Analysis
The genomic DNA was extracted from fresh, young leaves of the 323 cultivars using CTAB method. SSR primers sequences including BNL, JESPR, TMB, CIR, HAU, DPL, GH, NAU, MGHES etc. can be available from Cotton Microsatellite Database (CMD, http://www.cottonssr.org). We selected 20 cultivars of large morphological variations from the research natural population to screen for polymorphisms primers. In total, 5,600 SSR primers were chosen to survey the polymorphisms of the 323 cultivars. Only 651 polymorphisms SSR alleles were detected, which were used for association mapping (Table S2).

Genetic Models and Statistical Methods
Association mapping were performed using the mixed linear model, including environment (e) as fixed effects, SSR loci effects (a, aa) and loci by environment interaction (ae, aae) as random effects. The genetic model for phenotypic value of the k-th genotypes in the h-th environment (y hk ) can be expressed by the following mixed linear model, where m is the population mean; e h is the fixed effect of the h-th environment; a i is the i-th additive effect with coefficient u ik ; aa ij is the i-th additive by j-th additive epistasis effect with coefficient u ik u jk ; ae hi is the i-th additive by the h-th environment interaction effect with coefficient u hik ; aae hij is the aa ij by the h-th environment interaction effect with coefficient u hik u hjk ; and e hk is the random residual effect of the k-th breeding line in the h-th environment. The mixed linear model can be presented in matrix notation, where y is an n|1 column vector of phenotypic values and n is the sample size of observations; b is a column vector of u and environments in the experiment; X is the known incidence matrix relating to the fixed effects; U v is the known coefficient matrix relating to the v-th random vector e v ; e eM MVN(0, Is 2 e ) is an n|1 column vector of residual effects.
The phenotypic variance V p is considered as the sum of additive variance V A , epistasis variance V AA , additive by environment interaction variance V AE , epistasis by environment interaction variance V AAE , and residual variance V e .
Heritability is defined as the relative contribution of genetic variance to phenotypic variance with the following estimation model: where h 2 GzGE = total heritability; h 2 A = heritability contributed by sum of individual locus, h 2 AA = heritability contributed by sum of pair-wise epistasis, h 2 AE = additive by environment interaction heritability contributed by sum of individual additive by environment interaction effects, h 2 AAE = epistasis by environment interaction heritability contributed by sum of pair-wise epistasis by environment interaction effects; h 2 a = additive heritability of individual locus, h 2 aa = epistasis heritability of pair-wise loci, h 2 ae = heritability of additive by environment interaction effect, h 2 aae = heritability of epistasis by environment interaction effect.

Association Mapping
A GPU parallel computing software QTXNetwork (http:// ibi.zju.edu.cn/software/QTXNetwork/) was used to dissect the genetic architecture of lint yield and its component traits. There were 651 SSR markers used for association mapping for yield traits of 323 representative Upland cotton cultivars in nine different environments. Significant SSR markers were screened and the estimation of fixed effects (e) and prediction of random effects (a, aa, ae, and aae) of loci were obtained.

Gene Locations Based on the Associated Markers by Reference D Genome
Based on the experiment-wise type I error (a EW v0:05) setting by QTXNetwork, the SSR loci associated with the fiber yield traits were selected. The primer information was acquired from cotton marker database (http://www.cottonmarker.org/cmd_ downloads/ssr_project_data/CMD_PRIMER_ALL.xls).
Only one SSR location for each marker on the genome D (ftp://ftp. ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/plants/Gossypium_ raimondii/) was acquired by bowtie method which 2 bases not matched was permitted (bowtie -a -v 2-fr cottonPD -f -1 primerForward.fasta -2 primerReverse.fasta -best -strata bow-tieResult_PD.txt). The SSR motifs of the searched SSR sequences on the D genome were scanned by MISA. The location between SSR locus and the related gene was decided based the star location of the motif on the genome.

Results
In total, there were 94 significant SSRs (14 loci with individual effects and 93 loci involved in epistasis interactions) associated with lint yield, 95 SSRs (28 individual loci) associated with boll number, 91 SSRs (21 individual loci) associated with boll weight, and 88 SSRs (19 individual loci) associated with lint percentage. All identified loci associated with lint yield components were involved in epistasis interactions.
As to the lint yield, the total heritability of genotype 6 environment interaction effects (h 2 GE o38.26%) was mainly contributed due to epistasis 6 environment interaction (h 2 AAE o34.05%), and was much larger than the total genotype heritability (h 2 G o10.80%) ( Table 1). We listed highly significant genetic effects of loci (2Log 10 P.8.0 and h 2 GE .1.0%, h 2 G .0.5%) and consistent loci with environment interaction which were stable in a particular ecological area across at least two years for lint yield and three yield traits ( Figure 1, Table 2 -5). Among the loci associated with lint yield (Table 2), 1 locus (NAU3011-2) and 6 pairs of epistasis loci were highly significant with environmental interaction at Nanjing in 2007 and 2009. It was suggested that improving lint yield could be expected by selecting major-allele genotype QQ (NAU30112197) and QQ6QQ (MUCS1012 3306MGHES182228, NAU332522386TMB19892255, NAU360822456HAU7732170), as well as minor-allele genotype qq6qq (HAU179423206TMB102375, NAU13622 2256NAU37742248, and NAU332522386HAU4232175) at Nanjing (Table 2).
For boll number trait, the heritability of epistasis 6environment interaction effects (h 2 AAE o41.93%) was larger than other trait heritability (Table 1), which indicated that boll number was mainly controlled by environment-specific epistasis effects. 3 loci (HAU1385-3, NAU1102-3, MGHES41-6) with additive heritability were found to be more than 0.5%, and 10 pair epistasis loci with environment-specific epistasis heritability showed more than 1.0% genetic effects ( Figure 1, Table 3), indicating that the improvement of boll number in all 9 environments could be predictable (h 2 A o1.83%) by selecting genotype QQ for HAU13852150, and qq for other 2 loci (MGHES412333, NAU11022241). Further improving boll number could be expected (h 2 AE3 o11.31%) through selecting HAU10292197 6 NAU11252243 in both Anyang and Nanjing areas and other 9 pair epistasis loci only in Nanjing area, among these 7 epistasis loci with genotype of QQ6QQ showed positive effects, and 3 pair epistasis loci with genotype of qq6qq showed negative effects. The total heritability (h 2 GzGE o 62.85%) of boll weight was mostly due to additive effects (h 2 A o21.48%), epistasis effects (h 2 AA o22.01%) and environment-specific additive effects (h 2 AAE o18.73%). It was suggested that most genetic effects were fairly stable across environments (h 2 G o43.49%) for boll weight. There were 15 loci with heritability of additive effect larger than 0.5% and 2 pair epistasis loci with heritability of environment-specific epistasis effects larger than 1.0% for boll weight (Figure 1, Table 4). With regard to little impact of environmental factors on boll weight, relevant robust loci could be selected for improvement of lint yield across different ecological areas.
For lint percentage, the total heritability (h 2 GzGE o72.29%) was mostly attributed to epistasis effects (h 2 AA o57.78%). Thus, it was concluded that lint percentage was very stable across various environments and mainly controlled by epistasis effects. Compared with lint yield and other yield components, most genetic effects of individual loci and epistasis loci explained a small proportion of total phenotypic variance for lint percentage. Only 2 loci with additive effects and 5 pair epistasis loci with epistasis effects had heritability larger than 0.5% ( Figure 1 and Table 5). It was apparent that lint percentage was mostly controlled by many loci with smaller effects as compared with lint yield and other component traits. In this population, further genetic gain could be projected (h 2 AzAA o 6.3%) based on selecting 2 individual loci with additive effects (NAU33252238, NAU35192200) and 5 pair loci with epistasis effects for increasing lint percentage across different environments.
It has been observed that some loci were associated with more than one trait. 4 loci (DPL910-120, HAU639-175, NAU1155-205, NAU40422175) were found to be associated with all of the four traits, 22 loci were detected to be associated with three of the four traits, and 61 loci were associated with two of the four traits. Generally, these common loci made it possible for us to simultaneously improve multiple traits via breeding programs. However, 41 loci controlling lint yield were not detected in other component traits, which suggested that lint yield, as the result of complex biosynthesis pathways, may comprise some unknown components affected by polygenes.
It had been observed that epistasis is the genetic base of lint yield and yield components. In the results, most of the loci with individual effects were involved in epistasis interactions for lint yield and yield components. Altogether 293 digenic loci with epistasis effect (aa) and/or epistasis 6 environment interaction effect (aae) were identified to be associated with lint yield. There were 193 digenic epistasis interactions with their total heritability (h 2 AAzAAE o 30.71%) of epistasis effects and environment-specific epistasis effects involved in the loci without individual effects. We also observed this phenomenon in yield components. It was shown that many loci, even without controlling yield and yield components on their own, could affect the traits in combination with other loci.
It was observed that many loci could interact with several other loci for controlling phenotypic variation of lint yield and its component traits. For example, NAU2126-195 could affect lint yield in combination with other 4 loci (HAU7732170, NAU3377-180, TMB102375, GH501-265), respectively. Magnitude of epistasis effects of them differed from each other. NAU2126-195 may be regarded as core locus of epistasis interaction jointly with other casual genes for controlling target traits. This phenomenon might indicate that higher order interaction could also exist for association with lint yield and its components.

Discussion
Association mapping is a powerful method to detect loci underlying complex traits and more efficient compared to linkage analysis, because it exploited abundant genetic variation in diverse genetic backgrounds [15]. However, in most of the current GWAS, the epistasis and environmental interaction effects were not detectable due to absence of appropriate statistical methods and a heavy computational burden [16][17][18]. The models without epistasis and environmental interaction for GWAS may cause some problems: 1) It may result in biased estimation of effects on loci and decrease the precision and power of loci detection [18- Table 1. Estimated heritability for lint yield and yield components.

19];
2) The heritability of complex traits would be missed in the condition of reduced models [20][21].
Our proposed methodology is based on mixed linear model including epistasis and environmental interaction, but it can also be used for reduced models ignoring epistasis and environmental interaction. We also performed GWAS of various reduced models for lint yield and yield components. As compared with the full model (Table 1), the total heritability of lint yield and yield components decreased in the additive model (h 2 G o18.35% for lint yield, 6.53% for boll number, 28.65% for boll weight, and 22.95% for lint percentage), and also in the additive with environmental interaction model (h 2 GzGE o17.31% for lint yield, 23.76% for boll number, 32.25% for boll weight, and 25.08% for lint percentage). The results also showed steep decrease in numbers and effects of loci associated lint yield and yield components between reduced models and full model. This may explain the reasons for missing heritability reported recently [21].
Because of low LD associated with higher frequencies of crosspollination, cotton species are amenable for association mapping of agronomic traits with a relatively large numbers of markers. Previous studies involving nearly 1000 polymorphic markers would ensure the coverage of whole cotton genome [14]. A total of 651 SSR markers and 323 Upland cotton cultivars under 9 environments in this study can significantly improve the power of association mapping. However, with the rapid development of next generation sequencing, a substantial number of single nucleotide polymorphisms (SNPs) will facilitate the accurate and fine association analysis.
It is shown by the results of GWAS that lint yield, boll number and lint percentage are inherited in a mainly epistasis manner, whereas boll weight is predominately controlled by both additive and epistasis effects. In addition, lint yield and boll number are particularly susceptible to a broad range of environments, as compared to boll weight and lint percentage, which has also been observed in previous studies [5,[22][23]. Both lint yield and component traits were complex traits controlled by many genes and gene-network cascades interacting with each other and also with the environment factors ( Figure 1). Many loci (ten or more in the present study) associated with lint yield, boll number, and boll weight showed much larger effects and heritability (additive or epistasis effects), while few loci associated with lint percentage could have large effects. This was indicated that improving cotton lint yield could be difficult through increasing lint percentage. All these information obtained from GWAS could provide useful reference for MAS in cotton breeding programs.
Although the effects of many loci appeared to be consistent across various environments, the magnitude of effects and even direction of loci may vary depending on environmental factors because of interactions between loci and environment [3,18,24]. We can classify loci identified in the present study into three types. The first category of loci is called constituted loci with large major effects (a, aa) and can be consistently detected across different environments. This type of loci is very useful for improving cotton traits across different ecological locations. Another type of loci is called environment-specific loci with only environment-special effects identified in a special ecological region but stable in different years. For example, TMB3122212 6 GH132-180 was stably found in the Northwestward region and NAU30132245 6 NAU51632216 was stably scanned in the Yangtze River valley region. This type of loci could be used for maker-assisted breeding in a particular ecological area. The third type of loci is called environment-sensitive loci that could not be consistently detected in various grown regions. This kind of loci could only be used in specific conditions. SSRs widely used in major crop were highly reliable, polymorphic, simple and cheap. The SSRs associated with nearby genes were extremely useful for MAS. In the present study, we successfully detected 251 significant loci associated with lint yield and its three components, including 69 loci with individual effects. Most of these significant loci generally had small impacts on cotton Table 2. Predicted genetic effects with significance and heritability of lint yield for cotton regional trials in three locations and three years.  breeding due to small effects and easily affected by environment [23]. In order to develop effective MAS schemes, reliable and elite loci associated with lint yield and yield components have been screened. These loci, which consist of constituted loci and environment-specific loci, are reliable to predict corresponding phenotype. Constituted loci (such as NAU35882315 and NAU33252238) can be applied for MAS in various cottongrowing region, whereas application of environment-specific loci (examples are NAU11022241 and NAU13622225 6 NAU37742248) are limited to specific cotton-growing region. Both of them have substantial potential to improve the efficiency and precision of conventional cotton breeding. With the information from Cotton Microsatellite Database (CMD) and published report, several markers identified in our present study had been located within or nearby the reported QTL in previous linkage analysis and association mapping [1][2][3][4][5][6][7][8]14]. For example, Zhang et al [14] independently detected QTLs called qLP-A1-1 (between JESPR101 and BNL3590) using linkage method and a locus called JESPR101 using association method, which affected lint percentage. We also identified this marker (2Log 10 P-value = 15.09) associated with the same (Table 5). In addition, most of these consistent loci were involved in interaction in our present results (examples are NAU54332330 for numbers of boll in Table 3). It indicated that these loci were stably transferred and can be further used in MAS.
The traditional QTL was mainly located between two markers with the scale of cM (centi-Mogan). Based on the physical mapping of reference genome, the fine mapping of the markers and the genes related to the fiber yield traits become possible. In this study, candidate gene locations of yield fiber traits were identified based on the associated SSR markers analyzed by the QTXNetwork. For the lint yield, one major-allele genotype QQ (NAU30112197) related to zinc finger (GATA-type) gene (IPR000679) was identified. 3 epistasis loci with positive effects (QQ6QQ genotype) could be related to the interactions with the genes of cytochrome P450 (IPR001128), Cytochrome b245 (IPR000778), zinc finger with C2H2-type (IPR007087), and protein kinase (IPR000719) etc. 3 minor-allele with negative effects (qq6qq genotype) was linked to the interactions with the genes of glycoside hydrolase (IPR001764), lipid-binding START Table 3. Predicted genetic effects with significance and heritability of boll number for cotton regional trials in three locations and three years. Note: a = additive effect; ae 3 y 1 , ae 3 y 3 , aae 3 y 1 , aae 3 y 2 , aae 3 y 3 as defined in Table 2; ae 3 y 2 = additive by environment interaction effect at Nanjing in 2008, aae 1 y 1 = epistasis by environment interaction effect at Anyang in 2007, aae 1 y 2 = epistasis by environment interaction effect at Anyang in 2008, aae 1 y 3 = epistasis by environment interaction effect at Anyang in 2009; 2Log 10 P = minus log 10 (P-value), h 2 (%) = heritability (%). doi:10.1371/journal.pone.0095882.t003 (IPR002913), and peptidase aspartic (IPR021109) at Nanjing areas ( Table 2). It seems that the genes catalyzed the oxidation of organic substances will have positive effects on cotton lint yield, but the decompose genes of glycoside, lipid and protein maybe have negative effects for lint yield. For boll number, there were 3 additive SSR loci (HAU1385-150, NAU11022241, MGHES412333) related to genes of peptidyl-prolyl cis-trans isomerases (PPIase, IPR000297), plastocyanin-like (IPR003245), and acylneuraminate cytidylyltransferase (IPR003329). 7 epistasis loci with positive effects for boll number (QQ6QQ genotype) might be related to gene interactions of cytochrome P450 (IPR001128), UBX domain-containing protein (IPR001012), tubulin (IPR000217), ribosomal protein (IPR001865), major facilitator superfamily (IPR016196), formylmethionine deformylase (IPR000181), and SET domain of SUVH4 (IPR001214) etc. There were 3 epistasis loci with negative effects (qq6qq genotype) related to gene interactions of DDT domain superfamily (IPR004022), AUX/IAA protein (IPR003311), ribosomal protein (IPR001865), protein of unknown function (DUF1685), lipid-binding START (IPR002913) etc. It seems that the gene interactions for the metabolism and transport of cysteine, prolyl and formylmethionine could increase the boll number. But the regulation of the gene interactions of the lipidbinding DNA binding and auxin-binding protein could also affect the boll number. Table 4. Predicted genetic effects with significance and heritability of boll weight for cotton regional trials in three locations and three years.  Table 5. Predicted genetic effects with significance and heritability of lint percentage for cotton regional trials in three locations and three years. For boll weight, there were 15 additive SSR loci with positive effects relate to following genes: Type I phosphodiesterase (IPR002591), SET domain (IPR001214) of SUVH4, plastocyanin-like (IPR003245), oligopeptide transporter (OPT) superfamily (IPR004813). While the genes could have negative effects, such as glyceraldehyde 3-phosphate dehydrogenase (IPR020828), restriction endonuclease (IPR011335), leucine-rich repeat (IPR001611), STAR-related lipid-transfer (START) domain (IPR002913), brassinosteroid signalling positive regulator (BZR1) family protein(IPR008540), and RING-type zinc finger (IPR001841), respectively. Except the main additive effects, there are also 2 epistasis loci with negative effects (qq6qq genotype) which maybe resulted from the interactions of leucine-rich repeat (IPR001611), GTPbinding protein (IPR002917) etc. It was inferred that boll weight is mainly controlled by the genes with the phosphorus, lipid and signalling transport, which may be further regulated by the zinc finger, leucine-rich repeat, and the methylation protein.