A genome-wide association study identified loci for yield component traits in sugarcane (Saccharum spp.)

Fernanda Zatti Barreto; João Ricardo Bachega Feijó Rosa; Thiago Willian Almeida Balsalobre; Maria Marta Pastina; Renato Rodrigues Silva; Hermann Paulo Hoffmann; Anete Pereira de Souza; Antonio Augusto Franco Garcia; Monalisa Sampaio Carneiro

doi:10.1371/journal.pone.0219843

Abstract

Sugarcane (Saccharum spp.) has a complex genome with variable ploidy and frequent aneuploidy, which hampers the understanding of phenotype and genotype relations. Despite this complexity, genome-wide association studies (GWAS) may be used to identify favorable alleles for target traits in core collections and then assist breeders in better managing crosses and selecting superior genotypes in breeding populations. Therefore, in the present study, we used a diversity panel of sugarcane, called the Brazilian Panel of Sugarcane Genotypes (BPSG), with the following objectives: (i) estimate, through a mixed model, the adjusted means and genetic parameters of the five yield traits evaluated over two harvest years; (ii) detect population structure, linkage disequilibrium (LD) and genetic diversity using simple sequence repeat (SSR) markers; (iii) perform GWAS analysis to identify marker-trait associations (MTAs); and iv) annotate the sequences giving rise to SSR markers that had fragments associated with target traits to search for putative candidate genes. The phenotypic data analysis showed that the broad-sense heritability values were above 0.48 and 0.49 for the first and second harvests, respectively. The set of 100 SSR markers produced 1,483 fragments, of which 99.5% were polymorphic. These SSR fragments were useful to estimate the most likely number of subpopulations, found to be four, and the LD in BPSG, which was stronger in the first 15 cM and present to a large extension (65 cM). Genetic diversity analysis showed that, in general, the clustering of accessions within the subpopulations was in accordance with the pedigree information. GWAS performed through a multilocus mixed model revealed 23 MTAs, six, three, seven, four and three for soluble solid content, stalk height, stalk number, stalk weight and cane yield traits, respectively. These MTAs may be validated in other populations to support sugarcane breeding programs with introgression of favorable alleles and marker-assisted selection.

Citation: Barreto FZ, Rosa JRBF, Balsalobre TWA, Pastina MM, Silva RR, Hoffmann HP, et al. (2019) A genome-wide association study identified loci for yield component traits in sugarcane (Saccharum spp.). PLoS ONE 14(7): e0219843. https://doi.org/10.1371/journal.pone.0219843

Editor: David D. Fang, USDA-ARS Southern Regional Research Center, UNITED STATES

Received: January 23, 2019; Accepted: July 2, 2019; Published: July 18, 2019

Copyright: © 2019 Barreto et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This research was supported by the following institutions and/or grant programs: FINEP (Finaciadora de Estudos e Projetos), FAPESP (Fundação de Amparo à Pesquisa de São Paulo - 08/52197-4 to APdS) and INCT-Bioetanol (Instituto Nacional de Ciência e Tecnologia do Bioetanol - FAPESP 08/57908-6 to MSC, and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) - 574002/2008-1 to MSC). FZB received a master’s fellowship from CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Finance Code 001. TWAB received doctoral fellowships from FAPESP (10/50091-4). JRBFR received master fellowships from FAPESP (10/06702-9). APS and AAFG received research fellowships from CNPq. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Sugarcane (Saccharum spp.) is an important industrial crop and a vital component for food and energy security, providing sucrose, bioethanol and bioelectricity [1,2]. Sugarcane is cultivated in mainly tropical and subtropical areas and has a very high photosynthetic efficiency and a complex genome due to its variable ploidy levels, frequent aneuploidy, and large genome size of approximately 10 gigabases (Gb) [3–8]. Modern sugarcane cultivars have chromosome numbers ranging from 100 to 130, are vegetatively propagated, and result from the selection of populations derived from outcrossing heterozygous parents [8–10]. Brazil is the world’s largest sugarcane producer, and its productivity increased 66% in tons of sugarcane per hectare from 1975 to 2010, partially due to the growing area expansion and improvements in agricultural practices [2,10,11].

Sugarcane breeding programs concentrate efforts to release cultivars adapted to different environments that have high yields in terms of biomass production and sucrose content as well as resistance to diseases. However, the breeding process is expensive and requires approximately 15 years of experimentation and selection to obtain one or a few cultivars. Briefly, every year, crosses between accessions generate hundreds of thousands of F1 progenies, and the individuals reaching the final stages of selection are commonly evaluated over several harvests in multienvironment trials (METs) to identify those with the potential to become new cultivars [10–13]. Even with the adoption of better agricultural practices and selection strategies in the early stages of breeding programs, which attempt to measure and isolate the environmental effects of genetic factors [13–16], the genetic gains to quantitative traits have declined in recent years for sugarcane and other crops [17–19].

Clearly, there is a need to complement the classical breeding of sugarcane with other tools, such as molecular approaches, which have been applied for other crops [20–22]. Quantitative trait locus (QTL) mapping and genome-wide association studies (GWAS) are strategies to understand the genetic architecture of complex traits and include a first step of marker-assisted selection (MAS) [4,6,19,22]. To employ these strategies in outcrossing heterozygous species, such as sugarcane, we need to consider that, for each segregating locus, different numbers of segregating alleles may exist, and the single-dose markers currently available for mapping studies show only some of the genetic information [8,23]. This limitation is more evident in the traditional QTL mapping approach, which may identify genomic regions with low resolution, usually due to the smaller amount of available markers and also limited to the genetic composition of the biparental population under study. Nevertheless, attempts to associate phenotype and genotype and the development of new data analysis strategies have been significantly advanced [23–27].

On the other hand, GWAS has been widely used to identify marker-trait associations (MTAs) in genetically diverse populations of plants [20,21,28–32]. GWAS is based on linkage disequilibrium (LD) due to physical linkage, which is reportedly extensive in sugarcane [33–36]. This LD value is assigned to a recent breeding history, characterized by a strong foundation bottleneck followed by a small number of intercrossing cycles, which significantly reduces the frequency of recombination events. The high extent of LD in sugarcane indicates that a high density of markers may not be critical for performing GWAS [36–38] and that single-dose markers might be appropriate for this purpose; indeed, mapping models for loci with high allelic dosages are under development [26]. Although high-throughput marker systems are available, mainly for single nucleotide polymorphism (SNP) genotyping, the lack of appropriate methods for analyzing complex species such as sugarcane hinders the applicability of new molecular breeding tools [23,26,27]. In this context, single-dose markers, such as simple sequence repeats (SSRs) and target region amplification polymorphisms (TRAPs) could be used to characterize genome variation, investigate population structure and genetic diversity and thus enable GWAS [37,39–41]. In addition, despite the potential for using LD-based association studies to identify MTAs, a few studies on yield-related traits in sugarcane have been published [18,35–42].

For the latter, several algorithms and software have been developed to improve statistical power, increase computational efficiency, and reduce spurious associations in the GWAS approach [43]. Among GWAS algorithms, FarmCPU [44], which uses a multilocus linear mixed model (MLMM), is considered an efficient alternative to control for spurious associations [45–47]. Indeed, combinations of various methods for multilocus GWAS have also been used to identify causal associations and control the false positive rate [43,47,48].

In the current assignment, our objectives were to (i) estimate, through a mixed model, the adjusted means and genetic parameters of the five yield traits evaluated over two harvest years in a diversity panel composed of ancestral and modern sugarcane accessions; (ii) detect population structure, LD and genetic diversity using SSR markers; (iii) perform GWAS analysis to identify MTAs; and iv) annotate the sequences giving rise SSR markers that had fragments associated with target traits to search for putative candidate genes.

Materials and methods

Plant material and phenotypic traits

In this study, 134 accessions (S1 Table) of the Brazilian Panel of Sugarcane Genotypes (BPSG) were used. BPSG is a mini core collection from the germplasm bank of RIDESA (Inter-University Network for the Development of Sugarcane Industry), and the accessions were chosen according to the following criteria: i) relevant Brazilian cultivars, ii) main parents for Brazilian breeding programs; iii) cultivars from countries that grow sugarcane; iv) cultivars used as parents in mapping programs [25,49]; and v) representatives of the Saccharum species complex. The BPSG accessions represent an important genetic background in Brazilians breeding programs.

The 134 accessions of BPSG were planted in a field experiment performed in 2013 at the Agricultural Science Center of the Federal University of São Carlos (UFSCar) in Araras City, São Paulo State, Brazil. Araras is located at 22°21’25”S, 47°23’3”W at an altitude of 611 m; the experimental area soil is Typic Eutroferric Red Latosol. The experimental design consisted of a randomized complete block, which was fully replicated four times. The plots consisted of two rows 3 m long and spaced 1.5 m apart. Each plot was composed of 12 presprouted seedlings at the planting of the experiment in 2013. The experimental plants were harvested when they were approximately 18 months of age during the plant cane and first ratoon. The BPSG was evaluated for five yield components: soluble solid content (BRIX, in °Brix), stalk height (SH, in m), stalk number (SN), stalk weight (SW, in kg), and cane yield (TCH, in t ha^–1). Phenotypic yield trait data were collected according to Balsalobre et al. [12]. Briefly, a 10-stalk sample per plot was taken for analysis of the BRIX and SH. The weight of the 10 stalks was added to the total weight of the plot (SW) to estimate the TCH, which was calculated as the product between the SW of a linear meter and the amount of linear meters in one ha (6667 linear meters compose one ha with a spacing of 1.5 m). The SN was estimated by directly counting the stalks in each plot.

Statistical analysis of phenotypic data

A multiharvest mixed model produced the joint adjusted means. The analysis was conducted for each trait using GenStat 19th edition [50] based on restricted maximum likelihood (REML) and the following linear mixed model: where y_imkuv is the phenotype of the i^th accession, evaluated in the m^th harvest, located in the u^th row and the v^th column inside the k^th replication; μ is the overall mean; h_m is the fixed effect of the m^th harvest (M = 1,…,M;M = 2); b_km is the fixed effect of the k^th replication (k = 1,…,K;K = 4) at the m^th harvest; g_imk is the random effect of the i^th accession (i = 1,…,I,I = 134) at the m^th harvest evaluated in the k^th replication; r_umk and c_vmk are the random effects of the u^th row and v^th column, both evaluated at the m^th harvest and k^th replication; and e_imkuv is the random residual error. In addition, for the SN, SW, and TCH traits, the number of clumps per plot was included in the mixed model as a fixed covariate. Aiming to model the accession effects, the genetic variance–covariance (VCOV) matrix G = G_M ⊗ I_Ig, i.e., g~N(0,G) was considered, where M is the number of harvests, and ⊗ represents the Kronecker product of both the genetic G_M and identity I_Ig matrices with the respective dimensions of 2 x 2 and 1 x 134. For the G_M matrix, four structures (identity, ID; diagonal, DIAG; first order autoregressive homogenous, AR1; and first order autoregressive heterogeneous, AR1(het)) were examined and compared via Akaike [51] (AIC; [51]) and Bayesian (BIC; [52]) information criteria [53]. For the residue, a structure of variance heterogeneity was assumed for the different harvests. For each trait, the fixed effects were tested using the Wald statistics test and were retained in the model if statistically significant (P < 0.05). After the G_M matrix structure selection, the adjusted means for accessions and genetic parameters for each evaluated trait were obtained. The phenotypic and genotypic () variances were used for calculating heritability in the broad sense on an individual-plant basis (). The value was determined from , where was the residual variance, was the variance for row effects and was the variance for column effects [54].

DNA extraction, molecular markers and genotyping

Approximately 3.0 g of tissue from the leaf primordia of each accession was collected, and the genomic DNA was extracted according to methods described by Aljanabi et al. [55]. The SSR markers were amplified based on the procedures described by Oliveira et al. [56], and the amplified fragments were visualized as described by Creste et al. [57]. A total of 100 SSR primers were used, of which 86 were from expressed sequences (EST-SSR) [58,59] and 14 were of genomic origin [60]. These markers were selected because they met one or more of the following criteria: i) high polymorphic information content (PIC); ii) high discrimination power (DP); and iii) present in previously published sugarcane genetic maps.

Due to the polyploid and complex nature of sugarcane, the amplified SSR fragments, which cannot depict ploidy levels and allele dosages, were evaluated as dominant markers [61,62], i.e., the presence of fragments suggested that an allele for a given locus was present in at least one of the chromosomes that comprised a homologous group, while the absence of fragments suggested that this same allele was not present in any chromosome. Thus, the fragments were classified as binary, i.e., (1) indicated a fragment was present, and (0) indicated a fragment was absent. When amplification failed, NA (nonamplified) was used to indicate missing data. The polyacrylamide gels were manually evaluated with the support of a light box, and a binary matrix formed by the combination of the detected fragments with the analyzed accessions was constructed.

Population structure and genetic diversity

Population structure was analyzed by a discriminant analysis of principal components (DAPC) [63] using SSR data in the adegenet package [64], which is available in R software [65], as described by Jombart and Collins [66] and Deperi et al. [67]. Briefly, the find.clusters function was used to detect the number of clusters in the BPSG. This function uses K-means clustering, which decomposes the total variance of a variable into between-group and within-group components. The best number of subpopulations has the lowest associated BIC. A cross validation function (xval.Dapc) and optimal α-score function (optim.a.score) were used to confirm the correct number of principal components (PCs) to be retained. The optimal number of PCs to retain is associated with the lowest root mean square error and with the highest optimized α-score. The subpopulations indicated by DAPC were plotted in a scatterplot considering the first and second linear discriminants. Additionally, a genetic dissimilarity matrix was calculated via a simple matching (SM) method using Darwin software [68] based on the SSR information. Then, the resulting matrix was plotted as a phylogram using the neighbor-joining (NJ) algorithm [69]. In addition, bootstrap analysis was performed as described by Efron [70] and Efron and Tibshirani [71] to verify whether the number of fragments evaluated was sufficient to distinguish the accessions. The coefficients of variation are graphically shown as boxplots for each sampling with different numbers of fragments.

Kinship matrix

The kinship coefficient was calculated between pairs of accessions using the kinship2 package [72] in R, considering the accessions of all generations and assigning the value 0 when the parents were unknown. Based on the estimated kinship coefficients, a kinship matrix (K) was generated.

Linkage disequilibrium analysis

Marker data were used to assess the level of LD in the BPSG as described by Raboin et al. [35]. Briefly, Fisher’s exact probability was used to test for associations between SSR fragments that were common to both the association mapping population and the SP80-180 and SP80-4966 integrated genetic map [56]. For each pair of markers, a contingency table (presence versus absence) was established, and the Fisher probability was computed using the exact2x2 package in R software [73]. To control for error due to multiple testing, we used the false discovery rate (FDR) procedure [74] with an initial threshold of 5%. A Bonferroni-corrected threshold was also verified. The Fisher (−LogP) logarithmic probabilities of the associations between only linked fragments were plotted with the respective genetic distances [75] in centimorgans (cM).

GWAS analysis

GWAS analysis was conducted using both the Genomic Association and Prediction Integrated Tool (GAPIT, [76]) and FarmCPU [44] methods in R software. To carry out GWAS analyses using the SSR data obtained in the BSPG, the fragments were reclassified, with (2) indicating the presence of a fragment and (0) indicating the absence of a fragment. The retained PC obtained in DAPC analysis was used as a covariate in the FarmCPU procedure, while the kinship matrix and retained PC were used in the GAPIT analysis. To control for type I errors due to multiple testing, the adjusted p-value less than 1% following an FDR controlling procedure [77] and Bonferroni-corrected threshold with 1% were used to declare significant MTAs by GAPIT and FarmCPU, respectively. To determine which of the tested methods best fit the data, we plotted the quantile-quantile (QQ) plot, i.e., the QQ negative log10-transformed observed p-values obtained for each MTA, against their expected distribution under the null hypothesis of no genetic association. For significant MTAs detected by FarmCPU, the phenotypic variance explained for each SSR fragment was estimated one at a time using a linear model with the lm function in R software.

Sequence annotation

Functional annotation of the loci associated with traits was performed using the available sequences that gave rise to the SSR marker. These sequences were annotated using i) the nonredundant NCBI database with e-values ≤ 1 × 10⁻³ through BLASTX and ii) the Phytozome website [78], which was used to align the data against the Viridiplantae protein databases.

Results

Phenotypic data

The VCOV models selected for the G_M matrix were based on AIC and BIC criteria. AR1(het) had the lowest AIC and BIC values, which indicated that it was the best model for all evaluated traits (BRIX, SH, SN, SW and TCH) (S2 Table). This result supports heterogeneous genetic variances between harvests and correlations between successive harvests and provides a systematic explanation of the existing temporal dependence. The ranges, adjusted means and estimates of the components of variance, coefficients of variation, and broad-sense heritability on an individual-plant basis for the five traits evaluated for the BPSG over the two harvest years (plant cane and first ratoon) are summarized in Table 1. The TCH trait had the highest variation, i.e., the accession RB925268 (295.60 t ha^-1) was 7.6 times greater than the accession POJ2878 (38.90 t ha^-1). The SN trait also showed high variation, i.e., the accession IN84-58 (290.64 stalks) was 7.03 times greater than the accession POJ2878 (41.34 stalks). On the other hand, the BRIX trait had a relatively low variation, i.e., the accession TUC71-7 (22.55°Brix) was 1.48 times greater than the accession IN84-58 (15.14°Brix).

Download:

Table 1. Ranges, adjusted means, estimates of components of genetic variance (

) and phenotypic variance (

), coefficients of genetic variation (CV_G) and phenotypic variation (CV_R), and broad-sense heritability on an individual-plant basis (

) for BRIX, SH, SN, SW and TCH for the BPSG over two harvest years (plant cane (1) and first ratoon (2)).

https://doi.org/10.1371/journal.pone.0219843.t001

Estimates for ranged from 0.48 (TCH) to 0.67 (SN) and from 0.49 (TCH) to 0.65 (SN) in the first and second harvests, respectively. For genetic () and phenotypic () variances, higher and lower values were observed for the TCH and SH traits, respectively. The lowest coefficients of genetic (CV_G) and phenotypic (CV_P) variations were for the BRIX trait, while the higher values for CV_G and CV_P were for SN, SW and TCH.

Pairwise genotypic correlations among the five evaluated traits, considering both harvests (plant cane and first ratoon), are shown in Fig 1. In total, eight significant genotypic correlations (P < 0.05) were observed between the evaluated traits in the BPSG. According to the degree of correlation between traits, correlations were grouped into low (≤0.35), moderate (0.36–0.70) and strong (≥0.71) categories [12]. Thus, four interactions were classified as low (BRIX–SH, BRIX–SW, BRIX–TCH and SH–SN), four interactions were classified as moderate (BRIX–SN, SN–SW, SN–TCH and SH–TCH), and two interactions were classified as strong (SH–SW and SW–TCH). The correlation of BRIX–SN was negative.

Download:

Fig 1. Genotypic correlation between yield traits evaluated in the BPSG.

For each trait, the histograms of the adjusted means (diagonal), scatterplots (below diagonal), and values of the genotypic correlation (above diagonal) between pairs of traits are shown. *Significant at the 5% global level (P < 0.05).

https://doi.org/10.1371/journal.pone.0219843.g001

Polymorphisms of SSR markers

The use of 100 SSR markers generated 1483 fragments, 1476 of which were polymorphic (99.5%), in the 134 accessions of the BPSG. Considering all polymorphic fragments, 484 (32.8%) were produced by SSR dinucleotides, 689 (46.7%) were produced by SSR trinucleotides, and 303 (20.5%) were produced by SSR tetranucleotides. The number of fragments ranged from four (ESTC52 and ESTC55) to 36 (ESTA31), with an average of 14.83 fragments per SSR. Species-specific fragments were observed for the ancestral accessions Badila (S. officinarum) at ESTB45 and SMC319; Ganda Cheni (S. barberi) at ESTB45, ESTB118, ESTA51, and ESTC17; and especially IN84-58 (S. spontaneum) at CIR23, ESTA26, ESTA61, CIR55, ESTB69, ESTA33, ESTB94, ESTA63, CIR18, ESTB63, CIR36, ESTB45, ESTA16, ESTC55, ESTA48, SMC222 and CIR25.

Population structure and genetic diversity

Four subpopulations were detected according to the lowest BIC value derived by the find.clusters function (S1 Fig). DAPC analysis was performed using the detected number of subpopulations (Fig 2). Seven first PCs (25.5% of variance conserved) from principal component analysis (PCA) (S2 and S3 Figs) and three discriminant eigenvalues were retained. All accessions were classified in each subpopulation with a membership coefficient equal to 1, suggesting that there were no admixtures and that the BPSG was structured (S4 Fig). A total of 42 fragments with the largest contribution to subpopulation identification were detected, with 24 fragments assigned to linear discriminant 1 and 18 fragments assigned to linear discriminant 2 (S3 Table and S5 Fig).

Download:

Fig 2. DAPC for the BPSG.

The axes represent the first two linear discriminants (LD). The dots represent accessions grouped in subpopulations, each with a different color. The cumulative variance values, in percentages, of the PCs are shown in the lower left corner of the figure; the eigenvalues of the seven first PCs retained by PCA are in black.

https://doi.org/10.1371/journal.pone.0219843.g002

The phylogram using the SM genetic distance among accessions also suggested the presence of four subpopulations. A total of 99.25% of the group assignments made by the DAPC analysis were also made by the phylogram (Fig 3). Only accession SP70-1284 was assigned to different groups by the NJ phylogram and DAPC methods. The genetic dissimilarity ranged from 0.06 (between accessions IAC68-12 and IAC64-257, in subpopulation 3) to 0.45 (between accessions SP70-1005 and RB855589, in subpopulations 2 and 1, respectively), with an average value of 0.31 (S6 Fig). Overall, the clusters inside subpopulations were in accordance with the pedigree information. This result was verified by full-sib accessions within the subpopulations, as was the case for the accessions RB845197, RB845210 and RB845257 in subpopulation 3, which originated from the crossing between cultivars RB72454 and SP70-1143, and for the cultivars SP80-1816, SP80-1842 and SP80-3280 in subpopulation 2, which originated from the crossing between the cultivars SP71-1088 and H57-5028. In addition, the ancestral accessions Maneria (Saccharum sinense) and Ganda Cheni (S. barberi) were placed in subpopulation 2, the ancestral accessions Badila (S. officinarum) and IN84-58 (S. spontaneum) were positioned in subpopulation 1, and the ancestral accession White Transparent (S. officinarum) was positioned in subpopulation 4.

Download:

Fig 3. Neighbor-joining (NJ) tree for the BPSG using the SM method.

Accessions indicated with the same color belong to the same subpopulation according to DAPC.

https://doi.org/10.1371/journal.pone.0219843.g003