Probing the aggregated effects of purifying selection per individual on 1,250 medical phenotypes in the UK biobank

Understanding the relationship between natural selection and phenotypic variation has been a long-standing challenge in human population genetics. With the emergence of biobank-scale datasets, along with new statistical metrics to approximate strength of purifying selection at the variant level, it is now possible to correlate a proxy of individual relative fitness with a range of medical phenotypes. We calculated a per-individual deleterious load score by summing the total number of derived alleles per individual after incorporating a weight that approximates strength of purifying selection. We assessed four methods for the weight, including GERP, phyloP, CADD, and fitcons. By quantitatively tracking each of these scores with the site frequency spectrum, we identified phyloP as the most appropriate weight. The phyloP-weighted load score was then calculated across 15,129,142 variants in 335,161 individuals from the UK Biobank and tested for association on 1,250 medical phenotypes. After accounting for multiple test correction, we observed a strong association of the load score amongst coding sites only on 28 traits including body mass, adiposity and metabolic rate. We further observed that the association signals were driven by common variants (derived allele frequency > 5%) with high phyloP score (phyloP > 2). Finally, through permutation analyses, we showed that the deleterious load score had an excess of nominally significant associations on many medical phenotypes. These results suggest a broad impact of deleterious load on medical phenotypes and highlight the deleterious load score as a new tool to disentangle the complex relationship between natural selection and medical phenotypes. Author summary This study aims to augment our understanding between the complex relation between natural selection and human phenotypic variation. We developed a load score to approximate the relative fitness of an individual and correlate it with a set of medical phenotypes. Association tests between the load score amongst coding sites and 1,250 phenotypes in a sample of 335,161 individuals from the UK Biobank showed a strong association with 28 traits including body mass, adiposity and metabolic rate. Furthermore, an excess of nominal associations at suggestive levels was observed between the load score and medical phenotypes than would be expected under a null model. These results suggest that the aggregate effect of deleterious mutations as measured by the load score has a broad effect on human phenotypes.


59
One of the primary questions of interest in the study of human population 60 genetics is the relation between natural selection and the evolution of human 61 phenotypes, from quantitative traits to complex disease. With the emergence of 62 biobank-scale datasets, along with new statistical metrics to approximate strength of 63 purifying selection at the variant level, it is now possible to both estimate the net impact 64 of deleterious mutations for each individual in a large population sample and correlate it 65 to a range of medical phenotypes exhibited by that individual. This provides an 66 opportunity to simultaneously study the genetics of individuals within a relatively 67 homogenous population and the potential impact of natural selection on annotated 68 phenotypes. 69 [18], GenoCanyon [19], Eigen and EigenPC [20]). Although these scores are formulated 116 as tests for strong selection or for molecular function, rather than as estimates of the 117 strength of selection, they are also correlated to the strength of selection, and are often 118 used as proxies for strength of selection [4,21,22]. In this study, we compared the 119 predicted deleteriousness of alleles for four widely used scoring methods that 120 approximate deleteriousness of a variant--GERP++, phyloP, CADD, and fitCons--with 121 their effects on allelic frequency in human population genetic data to select the most 122 appropriate measure for computation of the additive load. 123 Under negative or purifying selection, natural selection acts to reduce the 124 population frequency of deleterious mutations. This effect is more able to overcome 125 genetic drift as the strength of selection increases. As a result, we expect to observe a 126 higher number of rare alleles and a lower number of common alleles in regions of the 127 genome that are under negative selection, relative to putatively neutral regions. This 128 can be seen as a shift of the allele frequency spectrum (AFS) towards rare alleles, with 129 a steeper slope of the AFS indicating stronger purifying selection. We evaluated the 130 extent to which each scoring method captures the deleteriousness of an allele by 131 grouping alleles by the scores provided by each method and measuring the slope of the 132 resulting AFS. The more strongly a score is related to the strength of selection, the 133 more marked the increase in slope will be for high-scoring alleles relative to lower 134 scoring alleles. 135

138
For each scoring method, polymorphic sites are grouped into score intervals by the value of the score 139 annotated at the sites. Each solid line represents derived allele frequency spectrum of polymorphic sites 140 belonging to one score interval and three dashed lines represent derived allele frequency spectra of three 141 control categories: synonymous (syn), missense (mis), and loss of function (LOF) variants.

143
We evaluated this correlation using whole genome sequencing data from a non-144 Finnish European population in the Genome Aggregation Database (gnomAD) [23]. For 145 each scoring method, we grouped alleles by score and compared the non-normalized 146 scoring coding variants than noncoding variants, while the phyloP method is identical for 170 coding and noncoding sites. For this reason, we concluded that phyloP has the most 171 consistent relationship between score and strength of negative selection, and selected 172 phyloP as our weight for our load score computation. 173 174

Per-individual load scores in UK Biobank 175
We calculated a per-individual deleterious load score by summing the total 176 number of derived alleles per individual, weighting each derived allele by its phyloP 177 score to account for the strength of purifying selection. We considered three load 178 scores: a genome-wide load score, a coding-specific load score, and a non-coding-179 specific load score. Each score was computed for 335,161 unrelated, white-British

194
Histogram of three load scores computed from three sets of variants: coding variants (coding load score), 195 non-coding variants (non-coding load score), and both coding and non-coding variants (genome-wide 196 load score). Each load score was computed for 335,161 unrelated, white-British ancestry individuals.

198
Significant association between load score of coding variants and 199

anthropometric and metabolic traits 200
To explore the overall effect of deleterious mutations on specific clinically 201 measured phenotypes, we tested the association of each of the three load scores 202 (genome-wide, coding and non-coding) with 681 traits and 569 clinical phenotypes, after 203 adjusting for age, sex, genotyping chip, and assessment center. To account for potential 204 confounders, we further included a set of geographical and socioeconomic variables 205 available in the UK Biobank data as additional covariates (S2 Table). We note that 206 many of these variables are significantly associated with the load score but the effects 207 are small. Nonetheless, careful consideration was taken to add these as covariates in 208 our association tests (S2 Table). 209 We discovered no phenotype significantly associated with either the genome-210 wide load score or non-coding load score (Bonferroni P value threshold = 1.3x10 -5 ). However, 28 traits were significantly associated with the load score calculated from 212 coding SNPs; these included body mass, metabolic rate, and several adiposity traits 213 such as body mass index and waist circumference (Table 1). Some of these traits have 214 been found under directional selection in contemporary populations [24][25][26]. 215 216 217 218 Stratification by derived allele frequency showed that these association signals 222 are more pronounced when limiting to variants that are common (DAF > 5%) but not 223 close to fixation (DAF < 70%), while stratification by phyloP score shows that they are 224 more pronounced when limiting to variants with higher phyloP scores (phyloP>2, S3 and 225 S4 Tables). We therefore performed an additional stratification analysis by both DAF 226 and phyloP score (S5 Table). We observed that the signals are mostly driven by 227 common variants (5<=DAF<70%) with higher phyloP score (phyloP>2). This class of 228 variants notably contributes a large fraction (mean: 0.38 and sd: 0.005 per individual) 229 towards the per individual coding load score. This analysis necessarily excludes 230 extremely rare alleles, which are not well captured by the process of genotyping and 231 imputation. It is not clear how significant the aggregate contribution of these alleles to 232 the per individual load score would be. 233 To assess the effect of our weighting procedure, we calculated an unweighted 234 load score, the per-individual mutation burden, that simply counts derived alleles with no 235 reference to phyloP or other measures of selection. When using this score, all 236 significant association signals observed for the coding load score disappeared and no 237 significant association for genome-wide and non-coding unweighted score was detected 238 (S2 Fig). We further tested the associations with burden scores while restricting to only 239 rare variants (DAF < 5%) or only common variants (5% < DAF < 70%, S2 Fig), however 240 no significant association was observed. This is likely due to the domination of the 241 mutation burden by alleles under effectively no purifying selection, highlighting the need 242 for a weighting scheme to identify correlations to the relative per-individual fitness.
To assess whether the observed significant associations are sensitive to 244 reference bias, we included as a covariate the number of non-reference sites per 245 individual in our association testing for the top results in Table 2 (S6 Table). Association 246 results were very consistent, suggesting that reference bias is not likely a confounder. 247 Similarly, associations between the phenotypes and load score remain significant when 248 restricted to variants at which reference alleles are the same as predicted ancestral 249 alleles (S7 Table). We also re-computed load scores using phyloPNH scores, which are 250 phyloP scores calculated without human reference genome [4], and obtained similar but 251 slightly less significant results, with all the 28 phenotypes yielded p-value < 6.13x10 -4 252 (S8 Table). Phenome wide association test results showed that no single disease is significantly 256 associated with the load score (all P > 0.05 after accounting for multiple tests using 257 Bonferroni correction). However, rather than the load score having a strong effect on a 258 single disease, we hypothesized that the load score may have subtle effects on many 259 diseases, leading to an excess of weak associations that do not individually reach 260 statistical significance. To test this hypothesis, we compared the number of phenotypes 261 nominally associated with the load score (p-value < 0.05 without multiple test correction) 262 to a null distribution generated by random permutation of individual load score values 263 (Methods). For this analysis, we restricted to associations with clinical phenotypes 264 defined by ICD (International Statistical Classification of Diseases) codes. Out of 364 265 ICD-10 codes, 28, 25, and 26 ICD codes (S9 Table) were found to be nominally associated with coding load score, non-coding load score, and genome-wide load score 267 respectively. The number of nominally significant associations for each load score is 268 significantly larger than the expected number under the null model, supporting this 269 hypothesis (Fig 3). This analysis was statistically significant for both the coding load 270 score (P=0.02) and the genome-wide load score (P=0.04) but not the non-coding load

285
In this study, we have described a polygenic load score that estimates the deleterious 286 load carried by an individual, and applied this score to 335,161 white British individuals 287 from the UK Biobank. Our analysis produced two major results: First, while we found no 288 significant associations between individual medical phenotypes and the genome-wide 289 load score, we found that more phenotypes are nominally associated with the load 290 score than would be expected under a null model (Figure 3). This suggests that the 291 deleterious load has a broad effect on the human phenome, rather than being 292 specifically associated with a small number of phenotypes. This is consistent with 293 Fisher's Geometric Model of fitness, which proposes that the fitness of a population is 294 determined by overall phenotypic distance from a theoretical optimal point in a 295 phenotype space that potentially encompasses the organism's entire phenome [27,28]. 296 Second, by restricting to protein coding variation, we found significant associations 297 between the coding-only deleterious load score and a variety of adiposity phenotypes, 298 along with other anthropometric phenotypes and phenotypes related to metabolic rate. 299 This suggests that adiposity may be under polygenic selection driven by a large number 300 of coding variants in humans. This is consistent with previous results obtained from the 301 UK Biobank using an unrelated methodology [24]. We found no similar associations with 302 the noncoding deleterious load score, which is in contrast to numerous studies finding significant genetic associations in noncoding regions, including associations with the 304 same adiposity traits we found associated with our coding load score. Since our derived 305 allele frequency spectrum analysis (Fig. 1) suggests that sites with higher phyloP scores 306 are under purifying selection in noncoding regions as well as coding, the lack of 307 significance in non-coding regions cannot be interpreted as a lack of purifying selection 308 in these regions or poor sensitivity to selection in these regions. It may instead indicate 309 that selection acts on phenotype associations in noncoding regions in a different way 310 from how it acts in coding regions, possibly due to the small effect size of individual 311 noncoding variants. 312

313
There are several limitations to our method. We computed the load score from 314 imputed genotypes rather than sequenced whole genomes, which gives us little 315 information about extremely rare variants in the population, masking potentially large 316 contributions to the load from variants under the strongest selection. As a future topic of 317 research, the same methodology can be applied to include rare variants, which would 318 shed light on the relative contribution of common and rare variation to the phenotypic 319 associations of load. Previous studies have shown that rare variation contributes 320 substantially to differences in deleterious load between human populations, so we may 321 expect it to have a significant impact on individual load in this context as well [1,2]. 322 Furthermore, the phyloP score used to estimate the deleteriousness of alleles measures 323 only the likelihood that a site is evolving under constraint in vertebrates, and is not a 324 direct estimate of the selective effect of a variant in humans. It is possible that the use of 325 vertebrate-level conservation has reduced our ability to identify recent selection on