The impact of age on genetic risk for common diseases

Inherited genetic variation contributes to individual risk for many complex diseases and is increasingly being used for predictive patient stratification. Previous work has shown that genetic factors are not equally relevant to human traits across age and other contexts, though the reasons for such variation are not clear. Here, we introduce methods to infer the form of the longitudinal relationship between genetic relative risk for disease and age and to test whether all genetic risk factors behave similarly. We use a proportional hazards model within an interval-based censoring methodology to estimate age-varying individual variant contributions to genetic relative risk for 24 common diseases within the British ancestry subset of UK Biobank, applying a Bayesian clustering approach to group variants by their relative risk profile over age and permutation tests for age dependency and multiplicity of profiles. We find evidence for age-varying relative risk profiles in nine diseases, including hypertension, skin cancer, atherosclerotic heart disease, hypothyroidism and calculus of gallbladder, several of which show evidence, albeit weak, for multiple distinct profiles of genetic relative risk. The predominant pattern shows genetic risk factors having the greatest relative impact on risk of early disease, with a monotonic decrease over time, at least for the majority of variants, although the magnitude and form of the decrease varies among diseases. As a consequence, for diseases where genetic relative risk decreases over age, genetic risk factors have stronger explanatory power among younger populations, compared to older ones. We show that these patterns cannot be explained by a simple model involving the presence of unobserved covariates such as environmental factors. We discuss possible models that can explain our observations and the implications for genetic risk prediction.

1 Bayesian clustering of curves 1

.1 Generative model specification
We use a mixture of curves model to cluster genetic risk profiles to latent curves fitted with a spline basis. The graphic representation of the model is shown in SSFig 1. Our analysis is performed independently for each disease, therefore without specific clarification, we omit the disease index.
Our model assumes risk profiles of the S SNPs are assigned to K clusters, where we denote the mean of risk profile for the j th SNP (j = 1, 2, ..., S) asβ j ∈ R M and the standard error ofβ j to be j ∈ R M . M refers to the number of age intervals we have. The mean and standard error of risk effect sizes within M age intervals are the summary statistics which are our model inputs.
The model infers latent curves that generate risk profiles for each variant. We use a linear combination of cubic spline bases to construct smooth latent curves, where Xθ i (∈ SSFig 1. Schematic representation of the model. i = 1, 2, .., K is the index of the mixture components; j = 1, 2, .., S is the index of the SNPs.β j and j are the summary statistics for SNP j that are estimated using an interval-censored proportional hazard model (See S1 Supplemental Methods). z j is the latent variable, which assigns SNP j to cluster i. (1) Here m = 1, 2, ..., M , and s = 1, 2, ...P . P are the number of bases, which controls the degrees of freedom of the cubic spline. (·) + denotes the positive part of a function and ξs are even split points across the M age intervals. θ i ∈ R P is the linear coefficients of P bases. Other variables shown in SSFig 1 specify the generative process for the risk profiles, using a mixture model. Σ s is a vertical translation matrix with all elements equal, which allowsβ j to be vertically translated from the latent curve Xθ i that generates it. π i is the mixture weight of the i th latent curve and z j ∈ R K is latent one-hot vectors assigning j th SNP to one of K latent curves. We use Θ = (π, , θ, X, Σ s ) to denote the set of parameters in the mixture model, then we summarise our generative model as follows: Here Σ j ∈ R M ×M is a diagonal matrix with diagonal elements 2 j . Intuitively, the model is a mixture of K latent curves, each fitted with a spline Xθ i . The profile for the j th SNP is generated from one of the latent curves with a variance Σ j , then vertically translated with Σ s . Note Σ j is the standard error forβ j (summary statistics; see S1 Supplemental Methods) and the vertical translation is controlled by a hyper-parameter Σ s .
We further specify a prior distribution for θ i , which is a Gaussian distribution with zero mean and a fixed covariance matrix: In following sections we will derive the equations used in the inference of the latent curve Xθ i for each cluster. Note the spline bases X are shared across clusters, so we infer the posterior of θ i .
Two hyper-parameters we have in the model are set as:

Inference of cluster distribution: EM
In our model, the variables for different clusters are exchangeable [2], which will cause the so-called "label switching" problem if we apply a sampling method. Therefore, we derive an EM algorithm to maximize the following log likelihood: To derive the inference equations we first write down the complete data log likelihood function with the latent variable Z: Classic mixture models use "cluster specific" covariance matrices, which are inferred for each mixture component. However, it is worth noting that we use "SNP specific" covariance matrices Σ j for the j th variant. The rationale is that when we estimate the effect sizeβ j for a specific locus, we obtain its standard error j . Therefore, we use different Σ j for each variant to capture the uncertainty differences across loci.
To keep the notation simple, from here on we will use Σ j to denote Σ s + Σ j . By maximizing the expected value of the complete data likelihood E z [ln{P (β, Z|Θ)P (θ i )}] from Equation 5 and 3, we obtained the M-step update as follows: γ(z ji ) is the expectation of z ji , which will be obtained in the E-step. By Bayes' theorem, we could write down the posterior distribution of the latent variable Z: Since the posterior factorizes over j, z j follows an Multinoulli distribution for each SNP. From this we can derive the expectation of z ji , which is the E-step: The inference is performed by randomly initialising θ i s, then alternating between the Mstep and E-step until likelihood function converges. For each setting, we initialize multiple runs (see S1 Supplemental Methods) and keep the sequence with the highest converged likelihood.

Approximating the confidence interval of latent curves: Variational inferences
The EM algorithm provides an efficient way to maximize the likelihood, but we only get a point estimate of θ i without uncertainty quantification. Since we are interested in how genetic risk changes over age, we would ideally want to get the posterior of the latent risk curves over age. As discussed in previous sections, any sampling method will cause the "label switching"" problem, which will be hard to deal with, so we choose an approximation method. Here we apply "one step" of a Variational Bayes update to approximate the posterior distribution of the latent curves, the validity of which is discussed later in this section. We assume a factorized Variational distribution q(z, θ) = q(z)q(θ). The full data distribution is as follows: In order to get the posterior distribution of the latent profiles, we derive the update of the variational distribution q * (θ i ) as follows: Since θ i is Gaussian, the latent profile Xθ i is also Gaussian (cubic spline is a linear transformation of θ i ) with covariance matrix XA −1 i X T . We note that full Variational Bayes inference could potentially be performed by estimating q * (z ji ) and π i by maximizing the Variational lower bound of the marginal likelihood [3, p. 484]. However, for Variational inference we use a factorized distribution q(z, θ) = q(z)q(θ) to approximate the true posterior distribution of P (Z, θ|β), which can not be factorized. So, directly using Variational Bayes to estimate the mean of θ that maximizes Equation 5 is not as good as EM if we want an unbiased estimate. We therefore use EM to estimate the mean and the Variational Bayes approach to estimate the credible interval.
It is interesting to notice in Equation 11, the only expectation we took was over z ji , which is also estimated in the EM algorithm's E-step, as in Equation 9. We now show how directly plugging in the expectation from EM will give a good approximation for the posterior distribution of latent curves.
Recall that the EM algorithm maximizes the marginal likelihood by iteratively maximizing the lower bound L EM (q, θ) with respect to q and θ. The estimated value after converging can be expressed as: The variational Bayes method also maximizes a lower bound for the marginal likelihood L V B (q), which can be written as: In Equation 12, if we plug in theq EM (Z) estimated via EM, we see that the lower bound L EM (q EM (Z), θ) = g(θ) is a function of θ, which is maximized atθ EM . We note that the Variational lower bound in Equation 13 is the negative KL(q||g), which is maximized when q = g. Comparing Equation 11 and Equation 12, we can see that g(θ) is estimated as q * (θ) on the right hand side of Equation 11, with respect toq EM (Z). Therefore, one step of Variational approximation of q * (θ) is sufficient to provide an approximate posterior distribution for the latent curve for each cluster, which has the mode that maximizes the marginal likelihood (as in the EM algorithm).

The effect of unobserved factors on risk effect estimation 2.1 Independent unobserved effects: frailty
Disease is caused by many risk factors, including those that are not observed. The unobserved confounders can cause biases for effect size estimation [4]. Here, we describe how unmeasured confounders affect risk effect size estimation, even when the confounders are independent of the focal variant of interest. We also provide methods to account for these effects through estimating frailty parameters from incidence rates of disease over age.

Parametric modelling of effect size in the presence of frailty
Assuming we have a proportional hazards risk for disease incidence, with an unobserved risk background u, the hazard rate is given by where h(t|x, u) is the conditional hazard rate, x is the focus genetic variant of interest. u is the unobserved background risk effect (called frailty in epidemiology). Since u is unobserved, instead of estimating the effect size β and baseline risk h 0 (t) under the true model specification, we can only estimate the marginal effect size β * and baseline hazard h * 0 (t) under a mis-specified model: Suppose we have a parametric form for u, we can then work out the marginal distribution for β * and h * 0 (t) directly. For simplicity, here we assume that u has a Gamma distribution [5]: u ∼ F (u) = Gamma(u|shape = a, scale = θ). The marginal hazard ratio h(t|x) can be computed as follows: where Λ(t) = Since most of the diseases have a low incidence rate in the population and the genetic risk sizes are relatively small, we can assume both Λ(t) and β to be small too. This leads to an approximation for β * : which indicates that as age increases, we tend to increasingly underestimate the true risk.
To account for the frailty effect in the clustering of curves, we change the intercept base in Equation 1 to be (1 − θΛ(t)), and set the degrees of freedom of the model to be 1, which will infer a latent curve with an age dependency computed from the frailty model. The likelihood is then computed by plugging in the latent curve in Equation 4.

Inferring frailty parameters from the population incidence rate
We describe an approach to fitting a parametric distribution F (u)) for frailty, with the constraint that the expectation, E[u], equals one [5]. We use u to represent the disease hazard heterogeneity, and assume that the baseline risk h 0 (t) increases monotonically for each individual. The net impact of frailty is to cause individuals with larger frailty u to acquire the disease earlier, causing the incidence over the population to bend or even decrease for older age groups. We can fit a parametric hazard model with frailty to empirical incidence rates over age by assuming following parametric form of h 0 (t) and F (u): Here we adopted a polynomial baseline hazard which represents the multi-stage nature of disease and a Gamma distribution for F (u) [5]. We then compute the survival function: Considering E[u] = aθ = 1, the hazard over the population is then computed as: .
We then computed the empirical hazard rateĥ(t) from age 45 to age 70 in the UK Biobank cohort and subtracted the intercept to match the parametric form. The disease incidence after 70 years old is low and its analysis is complicated by competing health related events, giving an empirical hazard rate rate in the population that drops rapidly to zero. Therefore, we decided to only fit the model until age 70. The parameter k, b, γ is then computed through the following least squares optimization: Examples of the fitted hazard rates using the frailty model are shown in S2 Fig, we then plug the fitted values k, b, γ into Equation 16 to obtain the frailty correction term. We can then fit a genetic profile with frailty, as described in the S1 Supplemental Methods.

Dependent unobserved effects: GxE and GxG interactions
Here, we show how interactions can induce the decreasing profile of genetic risk that we observe. Interactions between genetic risk factor and other risk factors, either observed or unobserved, can be modeled as an individual-level risk effect that is centered on the population level marginal effect size. Intuitively, if a risk allele affects individuals differently (because of the interactions), those individuals with the higher risk will tend to get the disease earlier, while those with lower risk will tend to have a later onset. When estimating risk effects at a particular age, we are essentially estimating the effect size from individuals whose disease onset is within that age group. Thus, we observe the decreasing risk over age.
To illustrate how such interactions can lead to a decreasing effect size (over age) for genetic risk factors, we consider a proportional hazard model with a single covariate. Following the model specification in Section 2.1, we use β * to represent the marginal effect size under a mis-specified model: where h 0 (t) is a positive baseline hazard and x is the covariate. Assuming the effect size β interacts with the environment or other genetic factors, we further assume that the effect size for each individual is generated from a positive defined probability distribution: The survival function F(t) can be computed by integrating out β.
We first discuss the initial condition when t → 0: .
. Comparing Equations 18, 20, 21, and 22, we conclude that the estimated marginal effect size β * will be underestimated as t increases. This result could be extended to a general case of almost any probability distribution for G(β). Many different scenario could be fitted into this framework, all of which will lead to a decreasing genetic risk profile with age.