Quantile regression for genome-wide association study of flowering time-related traits in common bean

Flowering is an important agronomic trait. Quantile regression (QR) can be used to fit models for all portions of a probability distribution. In Genome-wide association studies (GWAS), QR can estimate SNP (Single Nucleotide Polymorphism) effects on each quantile of interest. The objectives of this study were to estimate genetic parameters and to use QR to identify genomic regions for phenological traits (Days to first flower—DFF; Days for flowering—DTF; Days to end of flowering—DEF) in common bean. A total of 80 genotypes of common beans, with 3 replicates were raised at 4 locations and seasons. Plants were genotyped for 384 SNPs. Traditional single-SNP and 9 QR models, ranging from equally spaced quantiles (τ) 0.1 to 0.9, were used to associate SNPs to phenotype. Heritabilities were moderate high, ranging from 0.32 to 0.58. Genetic and phenotypic correlations were all high, averaging 0.66 and 0.98, respectively. Traditional single-SNP GWAS model was not able to find any SNP-trait association. On the other hand, when using QR methodology considering one extreme quantile (τ = 0.1) we found, respectively 1 and 7, significant SNPs associated for DFF and DTF. Significant SNPs were found on Pv01, Pv02, Pv03, Pv07, Pv10 and Pv11 chromosomes. We investigated potential candidate genes in the region around these significant SNPs. Three genes involved in the flowering pathways were identified, including Phvul.001G214500, Phvul.007G229300 and Phvul.010G142900.1 on Pv01, Pv07 and Pv10, respectively. These results indicate that GWAS-based QR was able to enhance the understanding on genetic architecture of phenological traits (DFF and DTF) in common bean.


Introduction
The common bean (Phaseolus vulgaris L.) is the world's most important grain legume for human consumption [1]. Common bean is cultivated especially in African and Latin American countries, and is a key commodity for food security improvement because of the low-input agricultural systems required for production [2]. Phenological traits associated with the flowering time, such as days to flowering (DTF), days to first flower (DFF), and days to end of flowering (DEF), are used in selection programs of the common bean. Cultivars having early lifecycles can, reduce water consumption of irrigated crops, can quickly free the area for crop succession, can be beneficial in periods of less rain, and has lower time to be exposed to plagues and diseases [3].
Several genetic mapping studies have been performed for the common bean, improving the understanding the genetic architecture of relevant traits [4,5,6,7]. However, due to the limited number of markers and small population sizes, these studies resulted in low-resolution QTL identification [2]. Thus, the inferences on positional candidate genes associated with the identified QTL have been widely poor.
In human genetics, Single Nucleotide Polymorphism (SNP) effects were estimated to be related to particular quantiles (extreme tails) on the distribution of obesity-related traits [8].
The authors demonstrated, based on gene function, that this strategy resulted in the detection of true QTL in genome-wide association studies (GWAS). The sampling from the extremes has been successfully used [9,10] and can often to improve the resolution QTL identification [11]. The statistical power of association analyses can be improved by preferential selection of individuals who are most likely to be informative [9]. The main idea is to sample individuals with extreme phenotypes in the hopes that rare causal variants will be enriched among them. However, the same authors list some limitations of extreme phenotype sampling. Among them, the results might not be generalizable to the underlying population and might be sensitive to outliers, sampling bias, and the assumption of normality for the underlying traits [9].
An interesting approach to address this issue is the use of quantile regression (QR; [12]). Unlike traditional regression methods, which are based on conditional expectations, in other words E(X|Y), QR is based on conditional quantiles, Qτ(Y|X). Thus, QR approach can be used in all situations that traditional regression methods are applied, but gives a more complete picture of phenomenon in study. By using QR, it is possible to fit models to all portions of a probability distribution of the trait. In other words, QR enables to measure the impact of a SNP on specific quantiles of the trait. This feature takes into account the extreme tails of phenotypic distribution without sampling these parts specifically, that is, to sample individuals with extreme phenotypes. Therefore, QR uses the same principle used in sampling from the extremes of the trait distribution, except that with QR, estimation is performed directly on the extremes and a specific sampling is not needed, avoiding sampling bias [13].
QR have been successfully applied to GWAS in human genetics [14] and animal breeding [15]. In both, the principle of estimating markers effects under different quantiles provided more informative results than those based on traditional regression methods. However, to the best of our knowledge, there are no reports in literature approaching QR for plant breeding under a GWAS framework.
In this context, we aimed to (1) estimate genetic parameters for phenological traits in the common bean; (2) to enhance the understanding of the genetic architecture of these traits through quantile regression-based GWAS; and (3) to annotate candidate genes through the identification of the significant SNP positions at different quantiles.

Flowering traits
Eighty common bean cultivars widely studied between 1970 and 2013 by several research institutions in Brazil (Embrapa, IAC, UFV, IAPAR, Epamig, UFLA, Fepagro, Epagri, FT seeds) were evaluated. These cultivars were selected based on scientific records, as well as on reports of breeders participating in different breeding programs [16]. The list of cultivars and research institutions is presented in S1 Table. These cultivars were field planted at two experimental stations at the Federal University of Viçosa (Universidade Federal de Viçosa; UFV, Brazil), one located in Viçosa, MG, and the other in Coimbra, MG. Cultivars were planted at each location in the dry (February) and winter (July) seasons of 2013, following a randomized complete block design with three replicates.

Genotypic data
DNA samples were genotyped using the Vera Code1 BeadXpress platform (Illumina) at the Biotechnology Laboratory of Embrapa (Goiania, GO, Brazil). A set of 384 SNP markers, validated by a previously identified Prelim file (https://icom.illumina.com/Custom/UploadOpaPrelim/) for Phaseolus vulgaris, was selected to compose the Oligo Pool Assay (OPA) SNP marker panel. During the procedure for SNP detection, three oligonucleotides were used for each of the variants of the same SNP and the third specific-locus binding to the 3' region of the DNA fragment containing the target SNP, generating a unique allele-specific fragment. Genotype call was performed using Genome Studio software version 1.8.4 (Illumina, City, State, USA), with Call Rate values ranging from 0.80 to 0.90 and GenTrain ! 0.26 for SNP clustering. Analyses were performed to cluster the SNP alleles of each line, based on signal intensities of Cy3 and Cy5 fluorophores. The phenotypic and genotypic data sets are freely accessible at https://zenodo.org/ record/1120366#.WjqyZkxFzmQ.

Data analysis
The phenotypic data for each phenologycal traits (DFF, DTF and DEF), were analyzed according to the following statistical model: where y ijk is the observed phenotype; μ is the general mean; g i the random effect of genotype (cultivar) i, with i = 1 to 80; a j is the fixed effect of environment j, with j = 1 to 4; ga ij the random effect of the interaction of genotype i with environment j; b k(j) is the random effect of block (replicate) k (k = 1, 2, 3) within environment j; and ε ijk is the experimental error, Normally and Independently Distributed (NID).
Models with and without the interaction between genotypes and environment were compared using the Akaike information criterion (AIC) and Bayesian information criterion (BIC) were used to compare and select the best model. After selection of the model for each trait, genetic parameters (heritability and correlations) were estimated for lowering traits. In addition, adjusted phenotypes (Y Ã ) were calculated as the sum of random effects in the model and used for association analyses. Analyses were carried out in ASReml 3.0 [17].

Genome-wide association study
Prior to performing genome-wide association (GWA) analyses, the genetic structure of the populations used in this study was assessed by principal component analysis (PCA) using the princomp function in R [18]. PCA was applied to genotype data to obtain the top principal components (PC) that captured most of the genetic variation based on visual inspection of the scree plot. After that, the selected PC were used as covariates in the GWA models to detect SNPs associated with the phenologycal traits (DFF, DTF and and DEF). The general GWA model was given by: where Y ijk Ã is the adjusted phenotype; μ is the population mean; SNP i is the fixed effect of the i th SNP, fitted as a covariate (allele-substitution effect); PC kj is the fixed effect (covariate) of k th principal component for individual k; and ε ijk is the random error term associated with Y ijk Ã . The vector θ = [μ, SNP i , PC jk ] T represents the unknown parameters.
The SNP effects were estimated at different portions of a probability distribution of the traits by quantile regression [12]. This method consists in estimating the parameters at quantile τ(θ τ ) by solving the following optimization problem: where τ indicates the quantile of interest. In the current study, we evaluated 9 quantiles (τ): 0.1 to 0.9, every 0.1. The parameter ρ τ (Á) is denoted as a check function [12], and is defined by: The markers effects were also obtained using the traditional GWA approach, by fitting one SNP at a time (allele-substitution effect). Thresholds (-log 10 P) of 3.88 and 4.58 were calculated based on the Bonferroni method for multiple testing to identify SNPs significant at 5% and 1%, respectively.

Functional annotation
We exploited the potential biological mechanism of flowering traits by identifying genes close to the significant SNPs identified in this study using the Jbrowse on Phytozome v11.0 [19] for the common bean genome v1.0 [20]. Lists of genes located nearest to the significant SNP were extracted, allowing for a maximum distance of 3.5 Mb between SNP and annotated genes. For genes without clear functional annotation related to flowering traits, the genomic sequence data from Phytozome v11.0 was used in a search against NCBI database using BLASTn [21] with the objective of to finding functional annotations considering similar nucleotide sequences available on GenBank.

Results
A summary including the means, standard deviations (SD) and ranges of the phenological traits are presented in Table 1.

Model selection and genetic parameters
The goodness-of-fit for models with and without genotype-by-environmental interaction was assessed by AIC, BIC, and likelihood ratio test (LRT; Table 2). Overall, the full model (with the interaction effect) showed lower AIC and BIC values compared to the reduced model (without the interaction). For DFF, AIC and BIC were 2206 and 2225, respectively, for the full model and 2233 and 2248, respectively, for the reduced model. For DTF, these were 2318 (AIC) and 2338 (BIC), and 2333 (AIC) and 2348 (BIC), for the full and reduced models, respectively. For DEF, we obtained 2351 (AIC) and 2370 (BIC), and 2438 (AIC) and 2453 (BIC), for the full and reduced models, respectively. Finally, the LR test between full and reduced models were significantly different (P<0.01) for all 3 traits (Table 2).

Population structure
The scree plot showing the explained variability for each principal component (PC), and the genotype dispersion based on the first two PCs are presented in S1 Fig. Analysis of population structure with PCA revealed that seven PCs accounted for~82% of the genotypic variability (S1A Fig). These seven PCs were used as covariates in the GWA analyses to account for the population structure of the data. Although there was a tendency for some genotypes to cluster together in the upper left quadrant of the PCA plot using the first 2 PC (S1B Fig), there was no clear strong population structure observed.

Genome-wide association
Association between SNPs and (adjusted) phenotypes were performed using traditional single-SNP and quantile regression (QR) methodologies. The traditional single-SNP approach did not result in any significant (P>0.1x10 -4 ) associations between SNPs and phenological traits (DFF, DTF and DEF). The Manhattan plots for DFF, DTF and DEF are showed in S2-S4 Figs, respectively. Using the QR approach, which allows for the estimation of SNP effects at different portions of the probability distribution of the trait, associations were only found (P<0.1x10 -4 ) using the lowest quantile (τ = 0.1) evaluated in this study (Table 4). We identified 1, 7, and no significant association (P<0.1x10 -4 ) for DFF (Fig 1), DTF (Fig 2), and DEF (Fig 3), respectively. Manhattan plots using all quantiles (0.1 to 0.9) are presented in S5-S7 Figs, for DFF, DTF, and DEF, respectively.
After that, we investigated potential candidate genes in the region around these significant SNPs (Table 4). Three genes involved in the flowering pathways were identified, including Phvul.001G214500, Phvul.007G229300 and Phvul.010G142900.1 on Pv01, Pv07 and Pv10, respectively.

Genetic parameters
Heritability estimates for DFF (0.58), DTF (0.49) and DEF (0.32) were consistent with those reported in the literature. Specifically, for DFF and DTF, estimates were close to those reported by [22], with 0.54 and 0.35 for DTF and DEF, respectively, and by [23], with 0.59 for DFF. As expected, the estimates of genetic and phenotypic correlations between the flowering time traits were positive and high (ranging from 0.66 to 0.98), which are similar those reported by [23] and [24], where values ranged from 0.82 to 0.90.

Genome-wide association study
In this study, we aimed to enhance the understanding of the genetic architecture of phenological traits (DFF, DTF and DEF) in the common bean under a QR approach. The proposed method allows investigating the relationship between explanatory variables (i.e. SNPs in our study) and phenotype at different portions of a probability distribution of the traits, which resulted in more informative results by taking into account the extreme tails of trait distributions. QR uses the same principle used in sampling from the extremes of the trait distribution [8,9,10], except that estimation is performed directly on the extremes of the response variable, avoiding possible sampling biases [13].
Although the main objective of this manuscript was just to introduce the QR for plant breeders under a GWAS approach, we believe that this concept can be better exploited under a small sample size framework. The fact of QR take into account any quantile of phenotypic distribution can be considered as a robust tool to make inference in the presence of small population size. Although selective samples (sampling from the extremes) have been effective for QTL detection [8], under a small number of individuals, this method can implies in a drastic reduction of sample information because only the individuals belonging to the extremes will be considered for the analysis. Since QR uses all available information to infer on specific quantiles, this tool can be characterized as a robust method for small sample sizes. Tarr [25] and Ismail [26] studied the behavior of QR with small sample sizes and concluded about the robustness of this method under the mentioned situation. In terms of the small number of markers used here, it is interesting to highlight that the probability to identify significant markers is directly dependent of the genome coverage. In this context, the QR is a practical and efficient way to increase this probability since the same marker can be tested under different quantiles.  The number of reports including GWAS for economically relevant traits in the common bean is still very limited. Kamfwa et al. [2], using 237 genotypes of common beans genotyped for 5398 SNPs, found two significant SNPs for phenological traits DTF and days to maturity. These authors suggested the positional gene Phvul.001G221100, located on Pv01, as a candidate gene for controlling photoperiod sensitivity and flowering in common bean. Moghaddam et al. [22] observed significant SNP association on Pv01 for DTF, and on Pv04 and Pv11 for DTM, using over 150K SNPs in the analysis of 280 common bean genotypes. In terms of candidate genes, these authors listed 6 genes, for example, Phvul.001G064600, Phvul.001G192200, Phvul.001G192300, and Phvul.001G087900 with potential role on days to flowering, because of their function on regulates Flowering Time [27]; change the expression periodicity of genes leading to earlier flowering [28]; suppressing the gibberellic acid pathway and directly interacts with GIGANTEA (GI), a gene within the photoperiod pathway [29]; regulates the photoperiod and vernalization pathway by directly interacting with and suppressing FLC [30], respectively.
In the present study, no associations were found when analyses were performed using traditional GWA (single-SNP GWAS) methodology. The lack of associations using this methodology was expected because of the limited power of this study, which is limited due to the small number of markers and population sizes available [2]. However, using the same principle used in sampling from the extremes by QR approach, that is, performing the estimation directly on the extremes, we found a total of 7 associations between SNPs and phenological traits using QR methodology: 6 for DTF and 1 for DFF. These associations were found considering an extreme quantile, more specifically, τ = 0.1. QTL reported here were previously identified on Pv01 [6,31,32,33], Pv02 [33,34], Pv07 [6,31,34] and Pv11 [31,34]. Since previous studies have reported QTLs on the same regions, potential positional candidate genes for flowering traits around significant SNPs were investigated in our study.
Among the significant SNPs, 2 located on Pv01 and Pv07 are closest (<3.5Mb) to 2 genes involved in the flowering pathways. The first one, Phvul.001G214500, located on Pv01 (~50Mb). The functional annotation on Phytozome indicated that Phvul.001G214500 is a flowering time control protein FCA-related. This protein plays a role in the regulation of flowering time in the autonomous flowering pathway by decreasing expression levels of flowering locus C [35]. According to Phytozome, Phvul.007G229300, which was located on Pv07 (~47Mb), is classified as protein terminal flower 1. A BLASTn search on NCBI data resulted in a best score to a gene TFL1z. Flowering time (FT) acts in part downstream of transcription factor gene, Constans (CO), and mediates signals for flowering in an antagonistic manner with the TFL1 gene. Phvul.010G142900.1 that is mapping near the Pv10 (~41Mb) is a protein early flowering 3 [36]. This protein participates in the initiation of flowering [37]. The representative map of the genome positions of markers associated with DFF and DTF and potential positional candidate genes for flowering in the region around significant SNPs are showed in Fig 4. For the other four associated markers, no genes related to flowering time-related traits were identified based on Phytozome.
In general, these results indicate that using quantile regression to perform GWAS for flowering traits in the common bean resulted in the identification of QTLs. In contrast, this was not the case for when we used traditional single-SNP analyses. Indeed, these results are reasonable, once QR allows the estimation of effects for all portions of a probability distribution of the trait, using of the same principle used in sampling from the extremes of the trait distribution, increase the statistical power of the analysis [8,9,10]. This improved power compared to traditional GWAS should be even greater with the increase of sample size and SNP density. Despite QR approach seems interesting in situations as the limited number of markers and small population sizes, more studies using simulated and real data sets with different genetic architectures and data set sizes (individuals and markers) are needed to confirm the efficiency of QR compared to traditional genome-wide association approaches.

Conclusion
Quantile regression-based GWAS was able to detect true QTL for phenological traits (DFF and DTF) in the common bean that were missed when using traditinal single-SNP. The associations were identified using one extreme quantile (τ = 0.1), using the same principle used in sampling from the extremes. Finally, genomic regions identified in this study were enriched with candidate genes involved in the flowering pathways.