Human genotype-to-phenotype predictions: Boosting accuracy with nonlinear models

Genotype-to-phenotype prediction is a central problem of human genetics. In recent years, it has become possible to construct complex predictive models for phenotypes, thanks to the availability of large genome data sets as well as efficient and scalable machine learning tools. In this paper, we make a threefold contribution to this problem. First, we ask if state-of-the-art nonlinear predictive models, such as boosted decision trees, can be more efficient for phenotype prediction than conventional linear models. We find that this is indeed the case if model features include a sufficiently rich set of covariates, but probably not otherwise. Second, we ask if the conventional selection of single nucleotide polymorphisms (SNPs) by genome wide association studies (GWAS) can be replaced by a more efficient procedure, taking into account information in previously selected SNPs. We propose such a procedure, based on a sequential feature importance estimation with decision trees, and show that this approach indeed produced informative SNP sets that are much more compact than when selected with GWAS. Finally, we show that the highest prediction accuracy can ultimately be achieved by ensembling individual linear and nonlinear models. To the best of our knowledge, for some of the phenotypes that we consider (asthma, hypothyroidism), our results are a new state-of-the-art.


S1.1.1 A subsampling-based estimate
Let β be one of the accuracy metrics (i.e., r 2 or ROC AUC). Let β * be its true population value, and let β N be its estimate obtained using a random N -element sample. The estimate β N is a random variable converging to the deterministic value β * in the limit N → ∞. We will assume that the estimate β N is unbiased up to O( 1 N ): and its variance scales as ∼ σ 2 N with some σ 2 > 0: Under these assumptions, for large N , we can take the standard error for β N to be where β N = 1 M M k=1 β N ,k . In practice, we form the samples D 1 , . . . D M by randomly dividing the size-N sample into M equal parts, so that, in particular, N = N/M. Combining Eq. (3) with Eq. (4), we conclude that we can estimate the standard error for a size-N sample by Though this formula depends on M , the resulting estimate σ N should be approximately independent of M . In Figure S1 we plot σ N computed for all our phenotypes at various values of M , and indeed observe an approximate independence of M , confirming our methodology.

S1.1.2 Standard error for ROC AUC
Suppose that a "soft" binary classifier produces a numerical value x k ∈ R for the k'th sample of class 0, and a value y l ∈ R for the l'th sample of class 1. The ROC AUC of such a classifier is defined by where n 0 , n 1 are the numbers of samples in the classes 0 and 1.
Assuming that the samples x k and y l are randomly drawn from populations with a continuous distribution function, the variance of this classifier is (see [1,2]) where β * = Eβ is the population mean.
Assuming that n 0 and n 1 are large, the leading term of σ 2 AU C is In particular, if x and y have the same distribution, then β * = 1 2 and P(x < min(y 1 , y 2 )) = P(max(x 1 , x 2 ) < y) = 1 3 , so that σ 2 AU C ≈ n 0 + n 1 12n 0 n 1 .
In the general case, when x and y have different distributions, we can estimate the probabilities P(x < min(y 1 , y 2 )) and P(max(x 1 , x 2 ) < y) by sampling. Using additionally the estimated value β * of ROC AUC, we can then obtain an estimate for σ 2 AU C . In Table S1 we provide the standard errors estimated for our phenotypes by the two methods described in this and the previous section. In the case of ROC AUC the results are quite close, confirming the validity of our estimates.

S1.2 Simulated phenotypes: simulation details
A simulated phenotype is modelled as a weighted linear combination of linear effects y l , pairwise epistatic effects y ep and random noise y : y = w l y l + w ep y ep + w y .
Here, y, y l , y ep , y are vectors with N components corresponding to different simulated samples. The weights w l , w ep , w are sample-independent and sum to 1. The values of the weights depend on the simulation run.
For simulation experiments that are presented in this paper we set w = 0.2 and vary w ep from 0 to 0.8. Linear effects y l are generated by y l = Xf l , where X is the N ×M genotype matrix, with M the number of SNPs and f l the M -dimensional column vector of individual SNP effects. In all our experiments M = 1000 and N = 100000. Importantly, f l has only 0.1M = 100 non-zero entries which are distributed normally with zero mean and unit variance. This is done to mimic the sparsity of real-world genotype-phenotype interaction. Linear effects are scaled inversely by MAF (minor allele frequency) of the corresponding SNP to reflect the fact that rarer SNPs typically have larger effect sizes [3]. Non-zero entries of f l are clipped to the interval [−5, 5].
To generate pairwise epistatic effects, we randomly select 25 pairs of SNPs from the SNPs already associated linearly with the phenotype, and the same number from the SNPs not associated with the phenotype. For each pair i = (i 1 , i 2 ) of SNPs we generate a random epistatic effect size f ep,i ∼ N (0, 1) and randomly select a pair of SNP minor allele counts c 1 , c 2 , i.e. one value from the set {0, 1, 2} for each SNP in the pair. Then, the epistatic effect for pair i and sample j is calculated by Here, X j,i1 and X j,i2 are the minor allele counts of the first and second SNP in the i'th pair, respectively. The total value of epistatic effects for sample j is In contrast to the linear effects, the epistatic effects are not scaled inversely with MAF.