The power of pathway-based polygenic risk scores


 Polygenic risk scores (PRSs) have been among the leading advances in biomedicine in recent years. As a proxy of genetic liability, PRSs are utilised across multiple fields and applications. While numerous statistical and machine learning methods have been developed to optimise their predictive accuracy, all of these distil genetic liability to a single number based on aggregation of an individual’s genome-wide alleles. This results in a key loss of information about an individual’s genetic profile, which could be critical given the functional sub-structure of the genome and the heterogeneity of complex disease. Here we evaluate the performance of pathway-based PRSs, in which polygenic scores are calculated across genomic pathways for each individual, and we introduce a software, PRSet, for computing and analysing pathway PRSs. We find that pathway PRSs have similar power for evaluating pathway enrichment of GWAS signal as the leading methods, with the distinct advantage of providing estimates of pathway genetic liability at the individual-level. Exemplifying their utility, we demonstrate that pathway PRSs can stratify diseases into subtypes in the UK Biobank with substantially greater power than genome-wide PRSs. Compared to genome-wide PRSs, we expect pathway-based PRSs to offer greater insights into the heterogeneity of complex disease and treatment response, generate more biologically tractable therapeutic targets, and provide a more powerful path to precision medicine.

while the linear trend of enrichment is also assessed and referred to as the 'linear' test model (see 160 Extended Data Fig. 1, Methods). 161 Here we perform these analyses in the same data and traits used in the previous section. In the 162 absence of well-established roles for individual tissue and cell types in these outcomes, we sought 163 a priori candidates from two domain experts for each outcome to provide an agnostic way to 164 evaluate the performance of the different methods in this setting (Methods). We observed 165 significant associations between expert opinion and the tissue-type specificity results, although 166 results varied substantially depending on the pathway method and test model used (Fig. 3a). 167 Associations were strongest for schizophrenia ( Fig. 3b-c, Extended Data Fig. 2) and BMI 168 (Extended Data Fig. 3), in which MAGMA and LDSC had a higher correlation with expert's Data Fig. 5). The associations relating to the cell-type specific 171 analyses were relatively weak (Fig. 3d), with significant results only observed for MAGMA and 172

Fig. 4) and CAD (Extended
PRSet in relation to schizophrenia. For AD, the strongest and only significant result was that of 173 PRSet implicating microglia using the top quantile test model (Fig. 3e), which is notable since 174 microglia has been extensively linked to AD etiology in the literature 37 . However, individual results 175 reported here should be treated with caution, since they appear highly sensitive to the test model 176 (top quantile / linear) and the number of quantiles used (Extended Data Fig. 2-7)   presently the most powerful for PRS analyses (Methods). We compared the power of PRSet for 219 sub-typing with two genome-wide PRS approaches, lassosum and PRSice (C+T), using two 220 strategies. First, in the absence of well-established subtypes for these outcomes we produced 221 "pseudo subtypes" by combining the 5 outcomes into pairs. We then meta-analysed the two 222 GWAS of each pair for use as base data and applied each of the PRS methods to stratify the set 223 of pooled UK Biobank patients (pooled across each pair, removing comorbid individuals), 224 comparing inferred subtypes to the original patient groups (Methods). Supervised stratification 225 was performed with 5-fold cross-validation, using lasso regression applied to the most enriched k 226 pathway PRS predictors from PRSet (competitive P-value < 0.05), or standard regression with 227 genome-wide PRS predictors generated by lassosum or PRSice (Methods). 228 PRSet had the highest subtype classification performance across all ten comparisons (Fig. 4a). 229 Most of these showed strikingly higher discriminatory power for PRSet, notable since different 230 PRS methods typically have similar predictive power [39][40][41] . We performed several sensitivity 231 analyses to explore the causes of the relatively high power of PRSet here, repeating the analyses: these scenarios, indicating that the improvement in PRSet performance is partially due to (i) the enrichment of signal in genes and/or pathways, (ii) the application to sub-typing, rather than 238 case/control prediction, and (iii) the use of a regularisation framework (see Discussion). 239 Next, we defined subtypes of CAD, HC, T2D and obesity as the presence/absence of comorbidity 240 within each pair of these four diseases (Fig. 4b). PRSs were calculated based on one of the 241 disease GWASs before performing sub-typing. For example, PRSs based on CAD GWAS were 242 used to discriminate between CAD patients with or without T2D, with or without HC, and with or 243 without obesity. Classification performance estimates for these "real subtype" analyses were 244 lower than for the pseudo subtypes, with R 2 estimates < 0.016 (Fig. 4b). In 11 of the 12 245 comparisons PRSet outperformed lassosum and PRSice, while for T2D with/without obesity, 246 lassosum and PRSice outperformed PRSet. 247 Finally, we tested stratification performance in a disease with a relatively small number of patients 248 in the UK Biobank but with well-established subtypes: inflammatory bowel disease (IBD) (N = 249 6,008) and its subtypes Crohn's disease (N = 2,101) and ulcerative colitis (N = 3,681). Supervised 250 sub-typing was performed as above, with PRSet again showing higher discriminatory power (R 2 251 = 0.007, P-value = 1.2x10 -6 ) than PRSice (R 2 = 0.005, P-value = 0.01). Given the availability of a 252 known number of subtypes for validation, we also performed unsupervised stratification using K-253 means clustering (K = 2; Methods) and found that neither PRSet (P-value = 0.69) or PRSice (P-254 value = 0.98) produced clusters that were associated with the two true subtypes.

265
Here we have introduced a novel, pathway-based, polygenic risk score approach and software 266 tool, PRSet, for performing pathway PRS analyses. We demonstrated that pathway PRSs can 267 capture genetic signal across pathways with similar power to MAGMA and LDSC, with the distinct 268 advantage of providing individual-level estimates of pathway liability. We do not presently 269 recommend PRSet as an enrichment tool over these established methods, given its lower power 270 under simulation in small target sample sizes (Fig. 2a). However, its capacity to capture genetic 271 signal over pathways highlights the promise of pathway PRSs as higher-resolution, more 272 biologically interpretable, alternatives to genome-wide PRSs. 273 Next, we directly compared the performance of the two PRS approaches in an application for 274 which much hope is placed in genome-wide PRSs: disease stratification. In a series of supervised 275 stratification analyses we found that PRSet outperformed genome-wide PRS methods, lassosum 276 and PRSice, in almost all scenarios (22 out of 23) and often by striking margins. While we 277 observed no significant unsupervised stratification of known IBD subtypes, we expect higher 278 power when the corresponding pathway PRSs are trained in large case/control data. In contrast, 279 genome-wide PRSs may be limited-by-design in their application to disease stratification because 280 they are dominated by variants that affect multiple disease subtypes, while pathway PRSs allow 281 for the independent influence of pathways that differentiate between subtypes. 282 Pathway PRSs, as outlined here, have two major limitations: (i) pathways are not well-defined 283 and so are likely a poor proxy of biological function, (ii) variants are linked to pathways by location 284 (i.e. inside genes) only (Methods). However, the rapid advances being made in functional 285 genomics, via the integration of rich resources of multi-omics data now available, can help to 286 address both issues. For example, future pathway PRSs could be enhanced so that pathways 287 are defined according to robust differential gene co-expression or protein-protein interaction networks and annotated by regulatory elements active in tissue and cell types relevant to the 289 disease under study. 290 We believe that pathway-based PRSs may ultimately offer greater promise of delivering stratified 291 medicine for complex diseases than genome-wide PRSs, which aggregate disparate forms of risk 292 into a single number. However, despite promising early results for pathway PRSs reported here, 293 they have several limitations that need addressing, some of which rely on field-level advances, 294 before their potential can be realised. A better understanding of how genetics leads to biological 295 function, and the role of pivotal genes in signalling and mechanistic cascades, will contribute to 296 more accurate and powerful modelling of the multiple genetic liabilities underlying complex 297

disease. 298
Our new method, PRSet, provides a novel approach to computing and analysing polygenic risk 299 scores, motivated by the functional sub-structure of the genome and the heterogeneity of disease. In order to optimise statistical power for benchmarking the performance of the methods tested in 329 the study, we selected complex phenotypes with high SNP-heritability estimates, with publicly 330 available summary statistics from large GWASs and that were measured in the UK Biobank or 331 the Sweden-Schizophrenia Population-Based cohort (Extended Data Table 1  conservatively counting the observed gene set as 1 potential null set 55 . One consideration of this permutation procedure is that the smallest achievable competitive P-value is 1/( + 1), which 374 can lead to difficulties in ranking highly significant gene sets. 375

MAGMA 376
To directly compare the performance of PRSet vs MAGMA (v1.07b) given identical input data, we 377 removed all ambiguous SNPs and non-overlapping SNPs prior to MAGMA analyses. It is 378 important to note that this step is unnecessary for MAGMA and might negatively impact its 379 performance. After filtering, gene-based analyses were performed on summary statistics and 380 genotype data independently. Same as for PRSet analyses, a 35kb window upstream and a 10kb 381 window downstream were added to gene coordinates, the MHC region was excluded for all traits, 382 and the APOE region was excluded for AD. Gene-based results were meta-analysed using the 383 inbuilt `--meta` function and were subsequently used as input to the pathway analysis. 384

LDSC 385
Partitioned LD scores were calculated using the 1000 Genomes European genotype data 56 . 386 Similar to PRSet, SNPs were annotated to genes and pathway with 35kb upstream and 10kb 387 downstream extension prior to calculation of LD scores. Ambiguous SNPs and non-overlapping 388 SNPs were removed prior to LDSC analyses to allow for direct comparison between PRSet and 389 LDSC. GWAS were performed on the target genotype data using PLINK v1.90b6.7 57 , and were 390 meta-analysed with the external GWAS summary statistics using METAL (2011-03-25) 58 . 391 Partitioned LD score regression was then performed using LDSC v1.01 7,29 , with the MHC (all 392 traits) and APOE (AD only) regions excluded.

Generation of causal pathways 396
Out of 4,079 empirical pathways extracted from six publicly available collections (see "definition 397 of pathways" section), we randomly select 50 pathways. These pathways were simulated with 398 different levels of enrichment, ranging from 5-50% randomly assigned causal SNPs per pathway, 399 with step size of 5%. The empirical enrichment level for all pathways included in our studies were 400 calculated as the proportion of causal SNPs included. The simulation process was repeated 20 401 times and results over the 20 sets of simulations provided. 402

Phenotype simulation and sample selection 403
Simulation was performed using UKB genotype data to assess the performance of PRSet in

409
For each trait, 250,000 individuals from European ancestry were randomly selected to generate 410 the GWAS summary statistics using PLINK v1.90.b6.7. An independent set of either 1k, 10k or 411 100k individuals were then randomly selected as the target samples. Pathway analyses were 412 performed as described in the previous sections.

Assessment of pathway enrichment using MalaCards relevance scores 415
To assess whether pathway enrichment results were in line with previous biological knowledge 416 on the phenotypes of interest, disease-associated relevance scores for each pathway were 417 constructed using information from the MalaCards database 59 . The MalaCards database 418 provides a disease relevance score for each gene based on experimental evidence and co-419 citation in the literature. For the six diseases included in this analysis (schizophrenia, AD, alcohol 420 consumption, LDL, CAD and BMI), we downloaded the MalaCards disease-associated relevance 421 scores (Accessed on 2020-11-27, see Extended Data Table 3 for disease terms used and 422 number of genes). Next, we performed a rank normalization of the scores where, assuming that 423 a disease has n genes with MalaCards scores, a score of (r+1)/(n+1) were assigned to each gene, 424 with r being the inverse ranking of the gene with MalaCards score. Genes without a MalaCards 425 score are assigned a score of 0. MalaCards provide gene information as gene symbols, which 426 were transformed to ENSEMBL gene names. 427 To obtain disease-associated relevance scores for each pathway, we calculated the sum of the 428 rank transformed MalaCards scores for the genes included in a pathway and divided by the 429 number of genes in the pathway to account for pathway size. which includes gene expression specificity information for 24 brain cell types obtained from single 452 cell RNA-sequencing data. Again, expression specificity of each brain cell type was divided into 453 11 quantiles with the first quantile containing all non-expressed genes in a given cell type. Genes 454 within each quantile were grouped into a single pathway. 455

Ranking the importance of cell type / tissue 456
To provide an objective estimate of tissue / cell-type importance for each phenotype, we invited 457 two experts (per phenotype) who were blind to our experiment and algorithm design to provide 458 their opinion on what cell-type(s) and tissue(s) are expected to be implicated for each disease 459 context (Extended Data Table 4). The expert response was coded as "none" (both experts think 460 tissue/cell type is not implicated), "single" (only one expert thinks a tissue/cell type is important) 461 and "both" (both experts agree about the importance of a tissue/cell type).

Cell type and tissue specificity analyses 463
We used two testing strategies to assess the relationship between disease GWAS signals and The concurrence of the methods' ranking of the tissues / cell types with that of the experts within 473 each disease for both the top quantile and linear enrichment strategies was measured by 474 regressing the inverse normalized -Log10 P-value for the top quantile / linear enrichment 475 strategies for each cell type / tissue against the expert opinion, coded as factor. 476 MAGMA has a specific model which accounts for expression specificity (`--gene-covar`). However, 477 in favour of a more consistent analysis between the three software methods, this model was not 478 used. It is thus possible that MAGMA can provide more powerful results using the dedicated 479

model. 480
Results from the regressions against the expert confidence score assessed the association of the 481 gene expression specificity and GWAS signal with the expert opinion for each pathway

Pseudo subtype analysis 485
Composite phenotypes were generated by combining two distinct phenotypes, including T2D, 486 CAD, Obesity (defined as BMI > 30), extreme height (defined as top 5% each sex) and 487 hypercholesterolemia (defined as LDL > 4.9 mmol/L) from the UK Biobank. To simplify the 488 analysis, we removed controls and co-morbid cases so that the analysis was only performed in 489 participants presenting either case outcome. Phenotypes were encoded mimicking sub-490 phenotypes of a given disease, for example, for the phenotype CAD-Obesity, samples with CAD 491 (and not Obesity) were coded as 0 and those with Obesity (and not CAD) were coded as 1. 492

Comorbid subtype analysis 493
For the analysis of subtypes with presence/absence of comorbid diseases, we used T2D, CAD, 494 Obesity and hypercholesterolemia, as these diseases present high comorbidity between them. 495 For each disease, three traits were derived mimicking the subtypes of the disease as the 496 presence/absence of the other disorders. 497 For computational expediency, sex, age, age of diagnosis (for CAD and T2D), genotyping batch, 498 recruitment centre and first 15 principal components were adjusted for by using pseudo residuals 499 from logistic regression analyses as the outcome variable in the PRS analyses. 500

Meta-analysis of GWASs 501
Previously published GWASs from each pair of traits were combined into a meta-analysis to 502 obtain joint summary statistics for each composite phenotype. Meta-analyses were performed 503 using METAL 58 with the sample-size weighted fixed-effects algorithm. To truly mimic a composite 504 phenotype GWAS, only variants included in both GWAS summary statistics were retained.

Calculation of PRSs and evaluation of model performance 506
PRS analyses were performed using a 5-fold cross validation. For each iteration, samples were 507 split into 80:20 training:validation subsets. Summary statistics of the "composite" phenotype were 508 used as base sample, and UK Biobank samples with the residualised phenotypes were used as 509 target sample (Extended Data Tables 5 and 6). 510 Pathway-specific PRSs for 4,079 pathways (see definition of pathways) were calculated using 511 PRSet. Competitive P-values were calculated using 10,000 permutations and pathways with 512 competitive P-value < 0.05 were defined as enriched. PRSs for the enriched pathways were 513 recalculated using P-value thresholding, such that the predictive power of each PRS was 514 maximized. The PRSs with the "best" predictive performance for each enriched pathway were 515 then included in a generalized linear model with lasso regularization using the `cv.glmnet` function 516 from the glmnet package (v4.0-2) in R with 5-fold cross-validation to select the lambda parameter 517 that generates the smallest out-of-sample mean squared error (MSE). The resultant best fitting 518 model was applied to the validation sample to calculate the model R 2 . 519 We also performed genome-wide PRS analyses using lassosum and PRSice-2. Parameters (P-520 value thresholds for PRSice; penalty factor λ and soft-thresholding parameter s for lassosum) 521 were optimised in the training samples and were applied to the validation samples to calculate 522 the model R 2 . 523

Stratification of inflammatory bowel disease subtypes 524
For the stratification of inflammatory bowel disease (IBD) subtypes, pathway and genome-wide 525 PRS were calculated using IBD summary statistics as base sample, and cases of Crohn's disease 526 vs cases of ulcerative colitis as target sample. Comorbid cases were excluded from the analysis. and pathways with non-zero coefficients after the lasso regularisation were used in the 536 unsupervised algorithm step. 537 PRS for the pathways selected in the training step were calculated for the validation sample using 538 PRSet (for pathway PRS) and PRSice (for genome-wide PRS). The silhouette method was 539 applied to select the optimal number of centroids for K-mean clustering, and we used the optimal 540 number of centroids and also 2 centroids (when the optimal is not 2) for the down-stream k-mean 541 clustering. K-mean cluster groups were regressed against the subtypes to assess the 542 classification performance of the K-means algorithm. death records (Field IDs 40001 and 40002) and primary care data, participants who were adopted 859 (Field ID 1767), as well as participants whose parents were aged under 60 years (Field IDs 1845 860 and 2946), dead before age 60 years, or without age information (Field IDs 3526 and 1807). After 861 merging with the genetic data, 41,164 participants had at least one parent who was affected by 862 AD and 223,253 participants with parents non-affected by AD. Parental AD status was then 863 defined as the number of parents affected by AD. Residuals of the parental AD status were 864 calculated by adjusting for the sex of the participant, paternal age, maternal age, recruitment Height: Standing height information was extracted from Field ID 50. Residuals were calculated 867 by adjusting for sex, age, recruitment centre, genotyping batch and 15 principal components using 868 linear regression. 869 Type 2 diabetes: Type 2 diabetes cases were ascertained as individuals diagnosed with non-870 insulin-dependent diabetes mellitus as indicated by their hospital records and primary care data 871 pathway relevance scores include part of the disease signal captured by the three pathway 925 enrichment tools. The effect was stronger for PRSet and LDSC than for MAGMA. For PRSet and 926 LDSC correlations were no longer significant (Supplementary Fig. 3b), and correlation 927 coefficients across the six diseases decreased from t = 0.077 to t = -0.0015 for PRSet, and from 928 t = 0.043 to t = 0.0076 for LDSC (Supplementary Fig. 4); whereas for MAGMA, results were still 929 correlated with the original pathway relevance scores, albeit with reduced correlation and 930 significance (t = 0.029; P-value = 7.35x10 -11 ). 931