A Ranking Approach to Genomic Selection

Background Genomic selection (GS) is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Until now, existing studies formulated GS as the problem of modeling an individual’s breeding value for a particular trait of interest, i.e., as a regression problem. To assess predictive accuracy of the model, the Pearson correlation between observed and predicted trait values was used. Contributions In this paper, we propose to formulate GS as the problem of ranking individuals according to their breeding value. Our proposed framework allows us to employ machine learning methods for ranking which had previously not been considered in the GS literature. To assess ranking accuracy of a model, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG). NDCG rewards more strongly models which assign a high rank to individuals with high breeding value. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value. Results We conducted a comparison of 10 existing regression methods and 3 new ranking methods on 6 datasets, consisting of 4 plant species and 25 traits. Our experimental results suggest that tree-based ensemble methods including McRank, Random Forests and Gradient Boosting Regression Trees achieve excellent ranking accuracy. RKHS regression and RankSVM also achieve good accuracy when used with an RBF kernel. Traditional regression methods such as Bayesian lasso, wBSR and BayesC were found less suitable for ranking. Pearson correlation was found to correlate poorly with NDCG. Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG can be a useful evaluation measure for GS.


Introduction
Traditional selective breeding, based on phenotypic or pedigree information, has led to much genetic improvement. Genomic selection (GS) [1] is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Compared to traditional marker-assisted selection methods, the key benefit of GS is that it uses markers covering the whole genome, thus making it possible to predict polygenic traits. With the constantly decreasing cost of marker technology, genotyping is currently less costly than phenotyping in applied plant breeding programs [2]. GS has already been adopted by dairy industries worldwide and is expected to double genetic gains for milk production and other traits [3]. GS can also accelerate selection cycles, since markers can be genotyped at birth or even before [4]. The effectiveness of GS has been confirmed in numerous studies, both for plant and animal breeding [2,[4][5][6][7][8][9][10].
Until now, GS has traditionally been formulated as the problem of predicting an individual's breeding value for a given trait of interest; for instance, grain yield or or milk production. Therefore, GS was fundamentally formulated as a regression problem. To estimate a regression model, many different parametric and non-parametric methods were proposed in the literature including BayesA, BayesB, best linear unbiased prediction (BLUP) in the original work of [1], Bayesian lasso [7,11,12], a fast EM algorithm for the BayesB model called wBSR [13] and reproducing kernel Hilbert space (RKHS) regression [10,14]. Recently, [15] compared popular methods from the GS literature with other machine learning methods including support vector regression, random forests and neural networks. Their results suggested that GS could be based on a reduced set of models such as Bayesian lasso, wBSR and random forests.
In this paper, we propose to formulate GS as a ranking problem. This is motivated by the fact that in order to select the most favorable individuals, we do not necessarily need to accurately predict breeding values. Instead, it is often sufficient to correctly rank individuals from most favorable to least favorable. As an example, consider the problem of selecting wheat lines according to their grain yield. In existing studies, this would be formulated as the problem of predicting grain yield from genotypes. In our approach, this is formulated instead as the problem of correctly ranking wheat lines in order of decreasing grain yield. Our proposed framework allows us to employ machine learning methods for ranking which had previously never been considered in the GS literature. Until now, the predictive accuracy of a model was typically assessed using the Pearson correlation between observed trait values and the predicted trait values (a.k.a. genomic estimated breeding values, GEBV). However, our experiments show that Pearson correlation may correlate poorly with ranking accuracy. In this paper, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG) [16]. NDCG rewards more strongly models which assign high rank to individuals with high breeding value. In addition, NDCG focuses on the top individuals in the ranking, while Pearson correlation treats all individuals uniformly. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of the top individuals with highest breeding value.

General approach
In this section, we first review the two-phase approach usually taken by traditional regressionbased GS methods.
In the model estimation phase (also known as training phase), a reference population, which has been genotyped and whose trait values are known, is used to estimate a statistical model of the relationship between genotypes and the trait. For a reference population of size n, we denote the genotypes by x 1 , . . ., x n , where x i 2 X R p , and the associated trait values by y 1 , . . ., y n , where y i 2 Y R. The number p indicates the number of molecular markers (e.g., DArT, SNP, . . .) used for genotyping. For simplicity, in the remainder of this paper, we use x i to refer to both individual i and its vector representation after genotyping. Likewise, phenotypic values or estimated breeding values will simply be referred to as "trait values". Trait values of the reference population are typically obtained by field testings, which are expensive and timeconsuming. Therefore, n is usually small. On the other hand, the number of markers p is usually large for genome-wide genotyping. Therefore, n is usually much smaller than p. This is known as the n ( p problem. Existing GS approaches estimate a regression model h: R p ! R such that h(x i ) % y i for all i 2 {1, . . ., n}. We briefly review common approaches for estimating h further below. In the following, we denote by X the n × p matrix which gathers individuals x 1 , . . ., x n , by y the n-dimensional vector which gathers their true trait values y 1 , . . ., y n and byŷ the n-dimensional vector which gathers the predicted valuesŷ 1 ; . . . ;ŷ n , whereŷ i ¼ hðx i Þ.
In the candidate selection phase, predicted trait values are computed using the fitted model for candidate individuals to be selected. We denote the genotypes of m candidates by x 1 ; . . . ; x m . Contrary to the reference population, the true trait values of the candidates are not known. Throughout this paper, we assume that the individuals from the reference population x 1 , . . ., x n and candidate individuals x 1 ; . . . ; x m are sampled i.i.d. (independent and identically distributed) from the same (unknown) distribution.

Traditional evaluation measures
Model evaluation is the task of evaluating how good a model is and is crucial to choose the best model among several possible choices. In the GS literature, model evaluation has traditionally been carried out using mainly two measures: mean squared error (MSE) and Pearson correlation. MSE is defined by The model is better when MSEðy;ŷÞ is lower and perfect fit is achieved when MSEðy;ŷÞ ¼ 0. Obviously, MSEðy;ŷÞ ¼ 0 ifŷ i ¼ y i for all i. In other words, a model achieves zero error if it predicts perfectly all trait values. Note that this can only happen when heritability is one and all genetic variance is explained by the markers. A more commonly used measure in the GS literature is the Pearson correlation between observed and predicted trait values. It is defined by where m y ¼ 1 n P n i¼1 y i and mŷ ¼ 1 n P n i¼1ŷ i . The model is better when rðy;ŷÞ is higher and perfect correlation is achieved when rðy;ŷÞ ¼ 1. Contrary to MSE, correlation does not require to accurately predict trait values. Indeed, it can be seen that rðy;ŷÞ ¼ 1 if there exists a > 0 and b such thatŷ i ¼ ay i þ b for all i. In other words, the set of points fðy i ;ŷ i Þg n i¼1 must be collinear in order to achieve perfect correlation. Again, perfect correlation can only happen when heritability is one and all genetic variance is explained by the markers. When heritability is less than one, correlation has an upper limit which is equal to the square root of heritability. If the proportion of genetic variance that markers can explain is less than one, the upper limit decreases from the square root of heritability.

Overview of popular regression models
In this section, we briefly describe various regression models, focusing on the most popular ones in the GS literature.
Ridge and RKHS regression. One of the first methods proposed for genomic selection was ridge regression, which is equivalent to best linear unbiased prediction (BLUP) in the context of mixed models [1]. Let I p be the identity matrix of size p × p. The basic model is y = X β + , where β $ Normalð0; s 2 u I p Þ and $ Normalð0; s 2 e Þ. The solution for the marker effects can be obtained by β = (X T X+λI p ) −1 X T y, where l ¼ s 2 e =s 2 u is the ratio between the residual and marker variances. Predictions can be computed by h(x) = β T x. The representer theorem [17,18] guarantees that β can be written as a linear combination of the data, i.e., β = X T α for some α 2 R n . This has two important implications. The first is that we can equivalently compute β by β = X T α = X T (XX T +λI n ) −1 y. The main difference is that we now need to invert a n × n matrix instead of a p × p one. This is advantageous in GS because we usually have n ( p. The second implication is that ridge regression can now be "kernelized" by using the identitŷ In practice, K can be replaced by any positive semidefinite kernel matrix with elements K ij = κ(x i , x j ), where κ is the corresponding kernel function. Predictions can then be computed by hðxÞ ¼ P n i¼1 a i κðx; x i Þ. The result is known as kernel ridge regression in the machine learning literature and as RKHS regression in the GS literature. The elements K ij correspond to inner products in a high-dimensional (possibly infinite) space called reproducing kernel Hilbert space (RKHS). This allows to model nonlinear relationships between X and y. RKHS regression is equivalent to ridge regression when using a linear kernel.
Bayesian lasso (BL). Bayesian lasso [11] is the Bayesian counterpart of the lasso [19]. Following the parameterization of [20], the effect of marker j, β j , was assumed to follow a hierarchical prior distribution, where Normal and InvGamma indicate the normal and inverse gamma distributions, respectively, t 2 j determines the shrinkage magnitude for β j , 1=t 2 0 is the residual variance and l 2 B is a hyper-parameter that defines the distribution of t 2 j . We modified the method of [20] such that marker effects were conditional on the residual variance (precision), as in [11]. The prior distribution of l 2 B is the gamma distribution Gamma(ϕ, ω), where ϕ and ω are the shape and rate parameters, respectively.
Extended Bayesian lasso (EBL). In the EBL [21], l 2 B is replaced with d 2 Z 2 j , where δ 2 is a global shrinkage factor and Z 2 j is a shrinkage factor for marker j. The priors used are Gamma(ϕ, ω) for δ 2 and Gamma(ψ, θ) for Z 2 j . Weighted Bayesian shrinkage regression (wBSR). wBSR [13] uses the indicator variable γ j to determine whether the marker effect β j is included in the regression model (γ j = 1) or not (γ j = 0). A prior Bernoulli distribution, Bernoulli(γ j jπ), is assumed for γ j . The marker effect β j is assumed to follow a hierarchical prior distribution, Normalðb j j0; s 2 j ÞInvChi2ðs 2 j jn; S 2 Þ; where InvChi2 indicates a scaled inverse chi-squared distribution, ν is the degree of freedom and S 2 is the scaling parameter.
BayesC. In BayesC [22], β j is assumed to follow a spike and slab prior distribution, where ρ j is an indicator variable. The prior distributions used for ρ j and σ 2 are Bernoulli(ρ j jπ) and InvChi2(σ 2 jν, S 2 ), respectively. Stochastic search variable selection (SSVS). SSVS [23] assumes the following prior distribution for β j , where c < 1 determines the relative magnitude of the variances of the two normal distributions.
Bayesian mixture regression model (MIX). For the prior distribution of β j , MIX [24] assumes a mixture of two normal distributions with variances independent of one another: In [24], the prior distributions of s 2 1 and s 2 0 were InvChi2ðs 2 1 j n; S 2 Þ and InvChi2ðs 2 0 j n; S 2 Þ. We modified the prior of s 2 0 to InvChi2ðs 2 0 j n; cS 2 Þ so as to encourage clustering of markers according to the magnitude of their effects.
Random forests (RF). RF [25] are an ensemble algorithm based on randomized regression trees. In RF, each tree is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. The final prediction is computed by averaging the predictions of all trees in the forest. This procedure is known to improve the bias-variance trade-off of regression trees and leads to highly accurate predictions. To further reduce overfitting, two heuristics are typically applied in RF. The first heuristic consists, when splitting a node during the construction of a tree, in selecting the best split from a random subset of the features ("max_features"). This both improves accuracy and reduces training time. The second heuristic consists in limiting the maximum depth of the regression trees ("max_depth"). This ensures that trees are not too complicated. Although careful tuning of these two parameters can improve accuracy, we find that RF are pretty robust to their choice.
Gradient boosted regression trees (GBRT). In gradient boosting [26], an ensemble of regression models is built in a stage-wise fashion so as to minimize a differentiable loss function. GBRT refers to gradient boosting when the models are regression trees. GBRT starts with a base model h 0 . For regression with squared loss, a common choice is the base model which always outputs the training set's target mean irrespective of x: h 0 ðxÞ ¼ 1 n P n i¼1 y i . Subsequently, GBRT incrementally adds new trees h 1 , . . ., h M to obtain an ensemble hðxÞ ¼ P M s¼0 a s h s ðxÞ. For the squared loss, the tree h s at stage s is fitted against the residuals of the ensemble so far e 1 , . . ., e n , where e i ¼ y i À P sÀ1 r¼1 a r h r ðxÞ. For other differentiable loss functions, residuals are replaced with the negative gradient, which [26] calls "pseudo-responses". At each stage s, GBRT finds α s by line search and multiply the result by a small learning rate, typically between 10 −3 and 1, to avoid overfitting. To further reduce overfitting, the same heuristics as RF can be applied ("max_features" and "max_depth"). Although past GS works did not consider GBRT, we include it in our comparison since it achieved among the leading results in the Yahoo! learning to rank challenge [27].

Ranking-based genomic selection General approach
Similarly to existing regression-based approaches, our approach is broken down into a model estimation phase and a candidate selection phase. The main difference is that, in our approach, we do not impose that the model h satisfy h(x i ) % y i . Instead, in our approach, h is a scoring function: it assigns a score to each candidate. The scores are then used to determine a ranking of the candidates. In this ranking framework, GS can be summarized by the following two phases: 1. Model estimation. Using the reference population, estimate a scoring function h which can be used for ranking. Intuitively, a perfect scoring function would satisfy h( 2. Candidate selection. Using h, rank m candidates x m by decreasing scores. We denote the ranked candidates by x ðkÞ for further field testing. Typically, the number of selected candidates k is chosen much smaller than the total number of candidates m, i.e., k ( m. Our ranking-based formulation naturally captures a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value. In this section, since we do not impose that h(x i ) % y i ,ŷ does not necessarily represent predicted trait values. Instead,ŷ is the n-dimensional vector which gathers predicted scoresŷ 1 ; . . . ;ŷ n , whereŷ i ¼ hðx i Þ, from which individuals can be sorted in decreasing order.

Evaluation measures for global ranking
As we explained previously, the Pearson correlation does not require to accurately predict trait values. It only requires the set of points fðy i ;ŷ i Þg n i¼1 to be collinear. In that sense, the Pearson correlation can be seen as a ranking measure. However, the collinearity requirement may sometimes be too strict. To illustrate the problem, consider the case when y = [3.5, 2.8, 1.2] and y ¼ ½10:3; 3:7; 0:1. In this example, the model achieves perfect ranking since y 1 ! y 2 ! y 3 and y 1 !ŷ 2 !ŷ 3 . However, because the true and predicted trait values are not perfectly collinear, the correlation is only equal to rðy;ŷÞ ¼ 0:92. This example shows that it is not necessary to achieve perfect correlation to achieve perfect ranking. In this section, we present two related measures for global ranking evaluation that do not assume collinearity: pairwise accuracy and Kendall's τ.
Given the reference trait values y, we define the preference set as P(y) = {(i, j):y i > y j }. Intuitively, if (i, j) 2 P(y), then x i is preferred to x j (e.g., x i has higher grain yield than x j ). Given the predicted scoresŷ, we define the set of concordant pairs as Cðy;ŷÞ ¼ fði; jÞ 2 PðyÞ :ŷ i >ŷ j g and the set of discordant pairs as Dðy;ŷÞ ¼ fði; jÞ 2 PðyÞ :ŷ i <ŷ j g. Pairs in the set Tðy;ŷÞ ¼ fði; jÞ 2 PðyÞ :ŷ i ¼ŷ j g are neither concordant nor discordant.
Pairwise accuracy (c.f., e.g., [28]) is simply defined as the proportion of concordant pairs: pairwise accuracyðy;ŷÞ ¼ jCðy;ŷÞj jPðyÞj ; where jSj is the cardinality of the set S. Pairwise accuracy is 0 when not a single pair was concordant and is 1 when all pairs were concordant. For binary trait values, pairwise accuracy is exactly equivalent to the area under the ROC curve (AUC) and is closely related to the Mann-Whitney-Wilcoxon statistic [29]. Pairwise ranking algorithms such as RankSVM [30], Rank-Boost [31] and RankNet [32] maximize an upper-bound on pairwise accuracy. Another commonly used measure is Kendall's τ [33]: tðy;ŷÞ ¼ jCðy;ŷÞj À jDðy;ŷÞj jPðyÞj Intuitively, Kendall's τ is the difference between the ratio of concordant pairs and the ratio of discordant pairs. Kendall's τ is always between −1 and 1. Assuming Tðy;ŷÞ is an empty set (which is likely to be the case, sinceŷ is a continuous vector), Kendall's τ and pairwise accuracy are directly related by the following formula: tðy;ŷÞ ¼ 2jCðy;ŷÞj jPðyÞj À 1 ¼ 2 Â pairwise accuracyðy;ŷÞ À 1: Therefore, algorithms which are designed to maximize pairwise accuracy will also maximize Kendall's τ.

Evaluation measures for top-k ranking
One issue with pairwise accuracy and Kendall's τ is that they treat all pairs equally. Intuitively, a good ranking evaluation measure should fulfill two requirements. First, it should reward more strongly assigning a high rank to individuals with high breeding value. Second, it should focus on the top k individuals in the ranking. Oftentimes, it does not matter if a model cannot correctly order individuals with low breeding value. Instead, it is sufficient to rank as many individuals with high breeding value as possible at the top. In this section, we introduce two measures that fulfill the above two requirements: discounted cumulative gain (DCG) [16] and its normalized version (NDCG). In the information retrieval (IR) literature, NDCG has been popularly used to measure the ability of search engines to retrieve highly relevant documents in the top search results. In this paper, we use NDCG to measure the ability of GS models to select the top k individuals with highest breeding value.
To introduce discounted cumulative gain (DCG), we first note that any predicted score vectorŷ ¼ ½ŷ 1 ; . . . ;ŷ n induces a permutation π = [π 1 , . . ., π n ] of [1, . . ., n] such that the scores are sorted in decreasing order:ŷ p 1 !ŷ p 2 ! . . . !ŷ p n . Given the reference trait values y = [y 1 , . . ., y n ] and any such permutation π, the DCG at position k (we assume k n) is defined by Here, g(y) is a monotonically increasing gain function and d(i) is a monotonically decreasing discount function. Common choices for the gain function are g(y) = y (linear gains) and g(y) = 2 y −1 (exponential gains). For the discount function, a common choice is dðiÞ ¼ 1 . Intuitively, a model obtains the highest possible DCG@k when the order of the predicted scoresŷ agrees with the order of the true observed traits y.
DCG can be difficult to interpret because its values are unbounded. In practice, the normalized DCG (NDCG) is often used instead. If we define by π(y) = [π(y) 1 , . . ., π(y) n ] a permutation of [1, . . ., n] for sorting y in decreasing order, i.e., y π(y) 1 ! . . . ! y π(y) n , then NDCG at position k is defined by NDCG@kðy;ŷÞ ¼ DCG@kðy; pðŷÞÞ DCG@kðy; pðyÞÞ : Intuitively, NDCG is simply the ratio between the DCG score of the predicted ranking and the DCG score of the ideal ranking. NDCG is easier to interpret than DCG because its values are always between 0 and 1, assuming y 2 R n þ . In our experiments, we also report results of Mean NDCG@K, which is simply the mean of NDCG scores from k = 1 to k = K: NDCG@kðy;ŷÞ: We discuss the choices of the position k, gain function g(y) and discount function d(i) in the "Discussion" section.
Background on "learning to rank" Estimating a ranking model, commonly known as "learning to rank", has attracted a great deal of research in the machine learning community. Intuitively, when formulating GS as a ranking problem, we should estimate a model so as to directly maximize a ranking accuracy measure of interest, such as NDCG. Unfortunately, this turns out to be a non-convex problem that can be NP-hard [34]. To solve this problem, learning to rank approaches replace the true loss function by an easier to optimize one called surrogate loss function. Learning to rank approaches are typically divided into three categories depending on the type of surrogate loss function used.
Pointwise approaches involve a surrogate loss function on individual samples x i . They typically reduce ranking to either regression, classification or ordinal regression/classification. It was shown that DCG errors are bounded by regression [34] and classification [35] errors.
Pairwise approaches involve a surrogate loss function on pairs of samples (x i ,x j ). Their main idea is that if y i > y j , then the model h does not need to predict y i and y j accurately: it only needs to respect the relative order h(x i ) > h(x j ). RankSVM [30], RankBoost [31] and RankNet [32] are based on pairwise versions of the hinge, exponential and logistic surrogate loss functions, respectively.
Listwise approaches involve a surrogate loss function or algorithm based on a list of samples. Some listwise methods such as LambdaMART [36] can optimize top-k ranking accuracy directly.
For more details on "learning to rank", we refer the reader to [37,38].

Overview of ranking models
Our ranking-based formulation allows us to employ machine learning methods for ranking which had never been considered in the GS literature before. For our experiments, we chose three representative methods of the pointwise, pairwise and listwise categories.
McRank (pointwise). McRank [35] is a method for indirectly optimizing NDCG through multiple classification. This is motivated by the fact that classification can be an easier task than regression. Suppose that the traits can only take on a finite number B of values, i.e., Y = {b 1 , b 2 , . . ., b B }, where b r 2 R. If that is not the case, we can always discretize the trait values (of the training set only), as explained in our experiments. The main idea of McRank is to rank candidates according to their expected trait value, which can be computed by McRank exists in two variants: multiclass and ordinal McRank. The variants differ in how they compute the Pr(y = b r jx) probabilities.
Multiclass McRank models the Pr(y = b r jx) class probabilities using a probabilistic classifier. Any probabilistic classifier (e.g., logistic regression) can in theory be used. Although the original McRank paper used gradient boosting as classifier, in this paper, we used random forests, since they worked better in our experiments. Unfortunately, multiclass McRank completely ignores the natural ordering Ordinal McRank addresses this problem as follows. Notice that Pr(y = b r jx) = Pr(y b r jx) −Pr(y b r−1 jx). In other words, we can model class probabilities Pr(y = b r jx) using cumulative probabilities Pr(y b r jx) and Pr(y b r−1 jx). The advantage is that this takes into account the natural ordering b 1 b 2 . . . b B . Cumulative probabilities Pr(y b r jx) can easily be modeled as follows. First, we partition the training data into two groups {y i b r } (positive class) and {y i ! b r+1 } (negative class). Using this partition, we can then train a two-class probabilistic classifier. The probability of the positive class gives Pr(y b r jx). Again, although any probabilistic classifier can be used, we used random forests since they performed best in our experiments.
RankSVM (pairwise). Recall that we previously defined the preference set as P(y) = {(i, j): y i > y j }. The main idea of RankSVM [30] is to use the support vector machine (SVM) framework in order to find a model h(x) such that h(x i ) > h(x j ) for all (i, j) 2 P(y). In this paper, we consider the kernelized version of RankSVM [39], which minimizes the objective where λ is a regularization parameter, K 2 R n × n is a kernel matrix with elements K ij = κ(x i ,x j ), K i 2 R n is the i th column of K and r 2 {1, 2}. Once α has been obtained, the model hðxÞ ¼ P n i¼1 a i κðx; x i Þ can be used to sort candidates in decreasing order. Using the kernelized version of RankSVM has the same advantages as RKHS regression. First, the model is non-linear if we use a non-linear kernel. This allows to model non-linear relationships. Second, the optimization problem is n-dimensional instead of p-dimensional, which is advantageous in GS. When r = 2, the RankSVM objective is differentiable and can thus be solved by gradient methods such as the conjugate gradient method or limited-memory BFGS (L-BFGS) [40]. The gradient expression necessary to run these methods is given by The global solution, which is unique, is guaranteed to be found, since RankSVM has a strictly convex objective [39]. When r = 1, the conjugate gradient and L-BFGS methods cannot be used, since the RankSVM objective is not differentiable everywhere. Instead, it is possible to solve the objective using the subgradient method.
LambdaMART (listwise). LambdaMART [36,41] is a method which builds upon GBRT (c.f. overview of regression models) to optimize NDCG@k. It achieved the leading results in the Yahoo! learning to rank challenge [27]. As explained previously, GBRT incrementally builds an ensemble of M regression trees hðxÞ ¼ P M s¼0 a s h s ðxÞ by adding on each stage a regression tree h s fitted against "pseudo-responses", the negative gradient of the objective function. To deal with the discontinuity of the NDCG objective, LambdaMART uses an approximation of the negative gradient called λ-gradient. GBRT is also known as MART (multiple additive regression trees), hence the name LambdaMART. Let us define the following expressions on stage s: a r h r ðx i Þ prediction score up to stage s À 1: Then, the λ-gradient on stage s is an n-dimensional vector whose elements are defined by l i ¼ P n j¼1 l ij . The λ ij can be interpreted as follows. If x i has higher breeding value than x j , then x j will get a push downwards of strength jλ ij j. Otherwise, x j will get a push upwards of strength jλ ij j. However, if x i and x j are both not ranked in the top-k elements, then, by the definition of NDCG, there will be no gain from swapping them. Therefore, ΔNDCG ij = λ ij = 0 in this case. For a detailed derivation and discussion of the λ-gradient, see [41]. To reduce overfitting, the same techniques as RF and GBRT can be used.
Complexity comparison. We now briefly compare the computational complexity of training algorithms for ranking and regression. We assume that the number of samples n is much smaller than the number of markers p, as is usually the case in GS. For RKHS regression (a.k.a. kernel ridge regression), we pre-compute the kernel matrix K, which takes O(n 2 p). Once this is done, a closed form solution can be obtained by solving a system of linear equations, which takes O(n 3 ). Alternatively, an approximate solution can be obtained by any gradient solver, such as the conjugate gradient method or limited-memory BFGS (L-BFGS). In this case, the main cost per iteration comes from computing the gradient, which takes O(n 2 ). For kernel RankSVM, we also pre-compute the kernel matrix. Since a closed form solution is not available, we use a gradient method. The main cost per iteration stems from computing the gradient, which takes O(n 2 ), the same as for RKHS regression. For random forests, GBRT, McRank and LambdaMART, the computational cost is proportional to the number of trees in the ensemble. Tree induction takes O(p 0 n log 2 n) in average and O(p 0 n 2 logn) in worst case, where p 0 p is the number of markers considered for node splitting [42]. For LambdaMART, each tree needs to be fitted against the λ-gradient. Computing the λ-gradient takes O(kn), where k is the parameter used for NDCG@k.

Datasets
We evaluated the validity of our ranking approach using the following 6 datasets.
Arabidopsis. This dataset comprises 422 lines of Arabidopsis thaliana developed by INRA [43]. We chose 3 traits, flowering time in short days (FLOSD), shoot dry matter in non-limiting nitrogen conditions (DM10) and shoot dry matter in limiting nitrogen conditions (DM3), which were also selected in [15,44]. We excluded 5 lines which were not evaluated for these traits. Marker genotypes were available for 69 SSRs. We imputed the missing genotypes using the R package qtl [45]. The dataset is available at http://publiclines.versailles.inra.fr/page/33.
Barley. The Barley-CAP project evaluated the grain yield of 432 barley lines in Aberdeen, Idaho (trial name: NSGC_2012_NormN_Irr_Aberdeen). Among them, 381 lines were genotyped using 3945 SNPs. We imputed missing genotypes using BEAGLE [46]. The dataset is available at http://triticeaetoolbox.org.
Rice. This dataset consists of 395 lines genotyped with 1311 SNPs [48]. We imputed missing genotypes using BEAGLE [46]. Among these lines, phenotypic values of a total of 34 traits were available for 383 lines, with some missing records [49]. We chose 335 lines without missing records in 14 traits. The traits were flowering time, flag leaf length, flag leaf width, number of panicles per plant, plant height, panicle length, primary panicle branch number, number of seeds per panicle, number of florets per panicle, seed length, seed width, seed volume, seed surface area and amylose content. The dataset is available at http://www.ricediversity.org/data/ index.cfm.
Wheat (CIMMYT). This dataset comprises 599 wheat lines developed by the CIMMYT Global Wheat Breeding program [10]. Trait values correspond to grain yield evaluated in 4 different environments. Wheat lines were genotyped using 1447 DArT (Diversity Array Technology) markers. Markers may take on the values 1 or 0, indicating their presence or absence. Markers with allele frequency less than 0.05 or greater than 0.95 were removed. The total number of markers retained after this processing was 1279. The dataset is provided as part of the R package BLR.
Wheat (Pérez-Rodríguez). This dataset, used in the study of [50], consists of 306 lines genotyped with 1695 DArT markers. Phenotypic values for two traits, grain yield and days to heading, were available for these lines. The dataset is available at the same URL as for Maize data.
For SSR and SNP markers, genotypes were encoded by 0 (AA), 1 (AB), and 2 (BB). For DArT markers, the presence or absence of an allele was encoded by 0 and 1 for Wheat (CIM-MYT) and by 0 and 2 for Wheat (Pérez-Rodríguez).
For datasets comprised of several traits, we build one model per trait and report the averaged evaluation scores.
All traits described above are inherently non-negative quantities. In the case of the Wheat (CIMMYT) dataset, grain yield values were centered so as to have zero mean prior to public release of the dataset. As a result, the dataset contains negative yield values. In order to avoid negative NDCG scores, we converted the yield values back to positive values. To do so, since the original centering was not known to us, we simply shifted the yield values such that the smallest value in the entire dataset is zero.

Experimental setup
Cross-validation setup. To estimate the generalization performance of different models, i.e., the ranking accuracy on new candidates, we used a randomized cross-validation scheme. For each cross-validation iteration, the dataset was split into 80% for model estimation and 20% for evaluation. To ensure fair comparison, we made sure that all methods use the same splits. Evaluation scores were computed for 10 cross-validation iterations and averaged. We report results for 6 evaluation measures: Pearson correlation, Kendall's τ, NDCG@1, NDCG@5, NDCG@10 and Mean NDCG@10. For models which need hyper-parameter tuning, we further used 5-fold cross-validation within the train split. To obtain the best possible results, we always selected the hyper-parameters which maximize the same measure as used for evaluation. For example, for NDCG@10 results, the hyper-parameters were selected to maximize NDCG@10.
Parameter inference for Bayesian regression methods. Parameters of the Bayesian regression methods were estimated using variational Bayesian approaches. The algorithms for BL and EBL were proposed by [20]. The wBSR algorithm was introduced by [13]. [51] introduced a variational Bayesian algorithm for linear regression with a spike and slab prior. A difference between the algorithm for BayesC in this study and that in [51] is that the authors used importance sampling to infer the posterior distribution of σ 2 , t 2 0 and π, whereas we inferred the factorized posteriors of σ 2 and t 2 0 , and used a fixed value for π as described further below. The algorithms of SSVS and MIX were implemented by the second author and will be published elsewhere. Phenotypic values were standardized prior to training. All Bayesian regression methods were performed with a program written in C.
Model estimatation for other methods. For random forests and GBRT, we used implementations provided in the scikit-learn Python package [52,53]. For ridge and RKHS regression, we used the R package rrBLUP [54]. For McRank and LambdaMART, we used the source code available at https://github.com/mblondel/ivalice. For RankSVM, we solved the objective function with r = 2 by L-BFGS [40]. We set the maximum number of iterations to 500.
Hyper-parameter tuning. For BL, we tested five values, 0.1, 1, 10, 30 and 100, for ϕ. For ω, we tested six log-spaced values from ϕ/20p to 5ϕ, where p is the number of markers. Because , and the expectation of l 2 B is ϕ/ω, these values of ϕ and ω correspond to the grids of 1=t 2 j from 1/10p to 10, which are obtained by replacing 2=l 2 B with ϕ/ω. In total, this corresponds to 30 possible parameter combinations. For EBL, we used the same values for ψ and θ and tested three values, 0.1, 1 and 10. For ϕ and ω, we tried the same values as for BL. Consequently, this corresponds to a total of 90 parameter combinations. For wBSR and BayesC, we fixed ν to 4. For π, we tested 5 log-spaced values from 1/p to 1. For S 2 , we tested 10 log-spaced values from 1/20p to 5. Because the expectation of InvChi2(ν, S 2 ) is νS 2 /(ν−2), these values of ν and S 2 correspond to the grids of s 2 j from 1/10p to 10. This corresponds to a total of 50 possible parameter combinations for wBSR and BayesC. For SSVS and MIX, we fixed ν and π to 4 and 0.01, respectively. For c, we tested 5 values from 10 −5 to 10 −1 . The values tested for S 2 were the same as that of wBSR and BayesC. This corresponds to a total of 50 possible parameter combinations tested for SSVS and MIX.
For tree-based ensemble methods including random forests (RF), gradient boosting regression trees (GBRT), McRank and LambdaMART, we used 300 trees in the ensemble. The parameter max_features was set to 0.6, meaning that only 60% of the features are considered when searching for the best split during tree induction. This both speeds up tree induction and reduces overfitting. For ridge and RKHS regression, we used the R package rrBLUP [54], which can estimate the regularization and kernel parameters automatically by (restricted) maximum likelihood (i.e., without cross-validation). This approach is closely related to gaussian processes in the machine learning community [55].
For RankSVM, we set l ¼j P jl, where jPj is the number of preference pairs, and chosel from 15 log-spaced values between 10 −6 and 10 6 . We use the RBF kernel κ(x i ,x j ) = exp (−γkx i −x j k 2 ). Following [54], we set g ¼ 1 4ps 2 , where p is the number of markers. We then chose σ from 10 linearly-spaced values between 0 and 1.

Cross-validation results
The general method ranking across six datasets is given in Table 1. Overall, the five best methods for each evaluation measure were as follows:  Tables 2-7. For datasets comprising several traits, we report the average scores only, for the purpose of clarity. Bold numbers in parentheses indicate the ranking of the five best methods with respect to the corresponding evaluation measure. We do not include results of LambdaMART with respect to Pearson correlation and Kendall's τ, since LambdaMART is a method designed to optimize NDCG.

Comparison of RKHS regression and RankSVM
We compared RKHS regression and RankSVM when varying the regularization and RBF kernel hyper-parameters. Heatmaps indicating Mean NDCG@10 for various hyper-parameter combinations are shown in Fig 1. Overall, our results reveal an interesting trend: RankSVM achieves better Mean NDCG@10 than RKHS regression for many different hyper-parameter settings. That is, RankSVM appears to be much more robust than RKHS regression to hyperparameter choice.

Experiments with tree-based ensemble methods
Effect of the learning rate and number of trees on GBRT. An important parameter to prevent overfitting in GBRT is the learning rate parameter. This parameter controls the importance given to the trees added at each stage of the algorithm. We compared GBRT with learning rates {1.0, 0.1, 0.01, 0.001} when varying the number of trees. Results for Mean NDCG@10 are shown in Fig 2. Our results show that, in order to obtain optimal accuracy, the learning rate must be set neither too large nor too small. This is because large values overfit the training set while small values underfit it. Except on the Maize dataset, the best learning rate was 0.1. This is value is therefore a good rule of thumb for a practical application of GBRT to GS.
Effect of the number of bins on McRank. Since McRank assumes that Y is a finite set, we need to discretize the continuous training trait values. To do so, we divided the training trait values into B equal-width bins and computed their means b 1 , b 2 , . . ., b B . Then, we replaced training trait values y i by the mean value of the bin they belong to. Although this discretization might sound like a loss of information, it can be beneficial when the goal is to maximize NDCG on unseen data, i.e., data that does not belong to the training set. We compared the Mean NDCG@10 of multiclass and ordinal McRank when varying the number of bins. The number of trees used was fixed to 300. Results are shown in Fig 3. For comparison, we also indicate the results of RF regression. Our results clearly show the superiority of ordinal McRank over multiclass McRank. Therefore, modeling the cumulative probabilities, as done in ordinal McRank, seems clearly beneficial. Comparison of evaluation measures Table 1 shows that the best methods tend to vary depending on the evaluation measure used.
To quantify the similarity (or lack thereof) between evaluation measures, we computed their Spearman's correlation coefficient, also known as Spearman's ρ, which is a measure of how well their rankings agree. The results, given in Table 8, indicate several interesting trends. First, Pearson correlation and Kendall's τ correlate poorly with NDCG@k when k is small. This is not surprising since Pearson correlation and Kendall's τ are evaluation measures for global ranking which treat all individuals equally, regardless of their importance. This confirms that one should not use these measures for evaluation and hyper-parameter selection when one is mostly concerned with ranking accuracy at the top. Second, and more surprisingly, Pearson correlation was more correlated with NDCG@k than Kendall's τ. This suggests that a method designed to maximize Pearson correlation might perform better than a pairwise ranking method.
Finally, the most correlated measure with NDCG@1, NDCG@5 and NDCG@10 was Mean NDCG@10. This suggests that choosing a model which maximizes Mean NDCG@k is a good compromise, if we want a model which works reasonably well (without retraining) at various positions k.

Discussion
The choices of the position k, gain function g(y) and discount function d(i) used in the DCG and NDCG definitions naturally depend on the usecase. For long-lived perennial plants, such as fruit and forest trees, it is important to select good candidates at their juvenile stage (or even at their small seedling stage) for further field testing. At the same time, it is also important to select a small number k of candidates because selected candidates usually become parents for   the next generation. If too many candidates are selected, selection intensity becomes low and it is not possible to obtain good improvement of the target trait in the next generation. For these reasons, and since field testing is typically expensive, we chose to evaluate models for small values of k: k 2 {1, 5, 10}. In the IR literature, the exponential gain function g(y) = 2 y −1 is frequently used. This is because it is often assumed that more relevant documents are exponentially more useful than irrelevant documents. However, in IR, y is usually a small number (say, between 1 and 5) which assesses the relevance of a document to some query. In contrast, in GS, y can take on much larger values depending on the trait. For this reason, we choose the linear gain function g(y) = y. For the discount function, a possible choice is to not use any discount at all, i.e., d(i) = 1. This amounts to completely ignore order in the top-k candidates. This choice is only reasonable if we are sure to conduct field testing for all k selected candidates. Another possible choice is dðiÞ ¼ 1 log 2 ðiþ1Þ , which is also the most common choice in the IR literature. This discount function assigns a monotonically decreasing weight to candidates as a function of their rank. It thus takes into account the fact that a candidate is more likely to be examined if it is placed higher in the ranking. This choice is more reasonable if we need to prioritize field testing of candidates with high breeding value, for example due to budget or time constraints. For this reason, we chose dðiÞ ¼ 1 log 2 ðiþ1Þ in our experiments. Trees are a non-parametric method which can approximate complex functions. However, they tend to severely overfit the training data when used alone. For this reason, trees are usually used as part of an ensemble method. Overall, we found that tree-based ensemble methods perform very well for ranking. This confirms a trend which was also observed during the Yahoo! learning to rank challenge [27]. With respect to Mean NDCG@10, Ordinal McRank, RF and GBRT achieved overall 1st, 2nd and 5th places. Tree-based ensemble methods have a number of other advantages including their ability to handle categorical variables, handle missing values without prior imputation and estimate variable importances and interactions [25,26]. Therefore, while the GS community has until now mainly focused on ridge regression and Bayesian regression methods, we believe that tree-based ensemble methods should be considered a top contender in the context of GS.
RKHS regression achieved 1st place with respect to Pearson correlation on four out of six datasets. Therefore, our results confirm the good results of RKHS regression previously reported in the literature [10]. With respect to Mean NDCG@10, RKHS regression proved to be a very good method, achieving an overall 3rd place. However, this shows that the best method with respect to Pearson correlation is not necessarily the same as the best method with respect to NDCG or Mean NDCG.
RankSVM achieved an overall 4th place with respect to Mean NDCG@10. However, with respect to NDCG@1 and NDCG@5, RankSVM outperformed RKHS regression. On the Barley LambdaMART achieved a disappointing overall 6th place with respect to Mean NDCG@10. This is worse than regular GBRT, of which LambdaMART is an extension. This is surprising, since LambdaMART achieved the leading results in the Yahoo! learning to rank challenge [27]. However, in the Yahoo! learning to rank challenge, the number of samples n was much greater than the number of features p, i.e., n ) p. This contrasts with GS, where typically n ( p. We thus hypothesize that LambdaMART does not work well in the n ( p setting. Traditional (Bayesian) regression methods overall did not perform well with respect to NDCG and Mean NDCG. For example, although they were suggested as some of the best methods in the recent study of [15], BL and wBSR only achieved overall 7th and 13rd places, with respect to Mean NDCG@10. One exception where traditional regression methods performed well is the Arabidopsis thaliana dataset, with RKHS regression, SSVS and MIX achieving 1st, 2nd and 3rd places, respectively, with respect to Mean NDCG@10. In the Arabidopsis dataset, genotypes are all recombinant inbred lines (RILs) derived from two homozygous parents. Therefore, quantitative trait loci (QTL) harbored by lines (i.e., RILs) are completely bi-allelic. In contrast, in other datasets, QTL harbored by lines may have allelic variation. In fact, important agronomic traits have multiple alleles in candidate genes [56]. In this case, allelic effects cannot be represented by a single bi-allelic marker. Therefore, there may exist complex relationships between causal polymorphisms and SNPs linked to the polymorphisms. In the Arabidopsis dataset, the extent of linkage disequilibrium (LD) between a marker and QTL is simply related to the recombination rate between the marker and QTL. In contrast, in other datasets, the extent of LD between a marker and QTL may be affected by various other factors such as demographic history [57]. In this case, the relationship between markers and QTL becomes complex and may be difficult to model via linear regression models.
On the rice dataset, the difference between the best (RF) and worst (wBSR) methods with respect to NDCG@10 appears quite small (0.925 vs. 0.946), while it appears larger with respect to NDCG@1 (0.930 vs. 0.830). Similar trends are shown in other datasets, such as barley and wheat. We observed that the NDCG@k score typically increases as a function of k (although not monotonically). That is, if k < r then NDCG@k < NDCG@r will usually hold. Since NDCG@k is always between 0 and 1, this means that NDCG@k gets closer to 1 as k increases. This also means that the range of possible values taken by NDCG@k will typically decrease as a function of k. This however does not necessarily mean that small values of k are better for discriminating between methods. In general, k should be set to the number of candidates one wants to select when applying the model. Sometimes, both ranking and regression accuracies are important. This is for example the case when predicted trait values are used to determine a selling price (e.g., determine crop price in terms of the predicted grain yield). Because of their overall good ranking accuracy, RF and RKHS regression seem like a good choice in this case. Unfortunately, RankSVM and Lambda-MART cannot be used in this case, since the loss function they minimize only guarantees ranking. On the other hand, McRank can be used, since it can compute the expected trait values.
For simplicity, we assumed throughout this paper that our goal is to select candidates with a high trait value. Of course, it is easy to adapt our framework if the goal is to select candidates with low trait value instead. However, sometimes it may be necessary to select candidates with an appropriate value, rather than highest or lowest value. For example, a certain level of acidity is necessary for fruits, but excessive or insufficient acidity is not preferred. In this case, model evaluation based on mean squared error (MSE) may be more suitable.
Pearson correlation is a commonly used measure in GS because it enables to predict the response to selection (provided there is no non-genetic cause of resemblance between offspring and parents) [58]. Because the response to selection corresponds to the change of population mean, Pearson correlation is important for breeding schemes that focus on whole population improvement. This is typical in animal breeding such as dairy cattle, where a single head of cattle contributes only little to the total production gain. However, particularly in breeding of plants that can be clonally reproduced (e.g., inbred crops or graftable fruit trees), the aim is often to produce a few excellent lines, rather than improving an entire population. In this case, Pearson correlation may not always be a good choice. Our results using 4 plant species show indeed that Pearson correlation often correlates poorly with NDCG. On the other hand, we found that Mean NDCG was the most correlated with NDCG at various positions.
Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG and Mean NDCG can be useful evaluation measures for GS, especially in plant breeding.