Figures
Abstract
Co localisation is a powerful approach to assess if two genetic association signals are likely to share a causal variant. However, association analyses in large bio banks and molecular quantitative trait loci (molmol) studies now routinely identify millions of association signals across thousands of traits, making it infeasible to test for colocalization between all pairs of signals. Here we introduce gpu-coloc, a GPU-accelerated re-implementation of the coloc algorithm that combines efficient data storage with parallelisation to achieve a 1000-fold speed increase while maintaining near-identical results. As a result, the run time of gpu-coloc now approaches the colocalisation posterior probability (CLPP) method, a competing method that only uses information from fine mapped credible sets to detect colocalisations. Using summary statistics from UK Biobank, FinnGen, and eQTL Catalogue, we demonstrate that gpu-coloc and CLPP detect highly concordant results, especially when restricting the analysis to confidently fine mapped signals. We introduce the colocalisation collider metric to quantify spurious colocalisations in large-scale colocalisation graphs and use it to choose decision thresholds that provide a reasonable trade-off between sensitivity and specificity. Finally, we demonstrate how gpu-coloc can also be applied to marginal GWAS summary statistics from studies that lack fine mapping, where it is still able to recover molQTL colocalisations for ~80% of the GWAS loci. Our efficient software and comprehensive analyses provide practical guidelines for future large-scale colocalisation analyses.
Author summary
Over 90% of human genetic variants associated with human traits and diseases lie in non-coding regions of the genome, making it difficult to interpret the mechanisms by which these variants act. Genetic colocalisation is a powerful approach to disentangle these mechanisms by testing if two traits share a causal variant at a genetic locus. The traits of interest can cover complex diseases, metabolic measurements, or molecular traits such as gene expression levels profiled from inside the cells. However, as the size of the genetic association datasets increase, there is a pressing need for more efficient computational methods. Here we present gpu-coloc, a 1000-fold faster re-implementation of the popular coloc algorithm designed for testing genetic colocalisations. Our approach yields nearly identical results compared to the original coloc implementation while easily scaling to millions of association signals detected in large biobanks. Finally, we perform extensive benchmarking to define optimal thresholds for colocalisation testing that minimise spurious overlaps and support biological interpretability.
Citation: Jesse M, Riet A-E, Alasoo K (2026) Ultra-fast genetic colocalisation across millions of association signals. PLoS Genet 22(6): e1012209. https://doi.org/10.1371/journal.pgen.1012209
Editor: Lin S. Chen, The University of Chicago, UNITED STATES OF AMERICA
Received: January 27, 2026; Accepted: June 7, 2026; Published: June 17, 2026
Copyright: © 2026 Jesse et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data used for this article is available publicly: eQTL Catalogue release 7 at https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/susie/, 56 metabolic traits by Rahu et al. 2025 summary statistics at https://doi.org/10.5281/zenodo.20100264, fine mapping results at https://zenodo.org/records/13821038, and FinnGen release 12 at https://www.finngen.fi/en/access_results. Colocalisation results generated in this study are available at https://doi.org/10.5281/zenodo.15878809. The pre-formatted eQTL Catalogue input files for gpu-coloc are available from https://github.com/mjesse-github/gpu-coloc.
Funding: This work was funded by the following grants from the Estonian Research Council (https://etag.ee/en/): PSG415 to K.A, salary to K.A. and M.J.; MOB3ERC115 to K.A., salary to K.A. and M.J.; PRG2531 to A-E.R., salary to A-E.R.. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: We have read the journal’s policy and the authors have no competing interests to declare.
Introduction
Genome-wide association studies (GWAS) have identified millions of associations linking genetic variants to thousands of human traits and diseases. However, over 90% of these variants are non-coding and can often have cell type and context-specific effects, complicating efforts to identify their functional roles [1,2]. Genetic colocalisation methods can help to interpret GWAS by identifying molecular traits and biomarkers that share causal variants with disease or trait GWAS signals [3–7]. For example, colocalisation with gene expression quantitative trait loci (QTLs) has helped to prioritise effector genes and relevant cell types for various diseases [8,9]. Thus, large-scale genetic colocalisation has the potential to greatly improve GWAS interpretation.
One of the most widely used colocalisation methods is coloc [3] that only requires marginal association summary statistics from both studies. However, coloc makes the restrictive assumption of at most one causal variant per locus. This limitation can be overcome with statistical fine mapping, which distinguishes multiple conditionally distinct signals at the same locus [10,11]. The colocalisation posterior probability (CLPP) method first presented in the eCAVIAR paper performs colocalisation at the level of credible sets, the minimal sets of variants for a fine mapped signal expected to contain the conditionally distinct causal variant with a given probability (e.g., 95%) [4]. However, eCAVIAR’s exhaustive search fine mapping algorithm made it too slow for most practical datasets [4,10]. Large-scale fine mapping became feasible with the development of the highly scalable FINEMAP [12] and Sum of Single Effects (SuSiE) [13] algorithms. Importantly, both the coloc and CLPP methods have been adopted to support SuSiE output [6,14], making it possible to use signal-specific Bayes factors (BFs) and credible sets from SuSiE directly in colocalisation.
Although a large proportion of GWAS summary statistics in the GWAS Catalog now follow a standard format [15,16], sharing of fine mapping results is much more fragmented. For example, Million Veterans Program [17] and Open Targets Platform [18,19] have only released posterior inclusion probabilities (PIPs) of the credible set variants that can be used by the CLPP method. In contrast FinnGen [20], eQTL Catalogue [21,22] and a few other studies [14,23] have published the logarithms of Bayes factors (LBFs) for all tested variants required by coloc [6]. Moreover, while CLPP is defined at the variant level, coloc’s posterior probability that two association signals share a causal variant (PP.H4) is defined at the locus level, making it tricky to compare results from these two methods to each other. Thus, rigorous empirical benchmarking is needed to understand the relative performance of the CLPP and coloc methods in identifying colocalising signals, and whether the ~ 1000-fold additional space required to store signal-specific LBFs is justified by potentially increased sensitivity of coloc.
A key challenge in benchmarking colocalisation is computational efficiency. While CLPP can be calculated instantaneously, and has been used previously to perform biobank-scale colocalisations [14,20], the current R implementation of coloc does not scale well to millions of colocalisation tests. Under single causal variant assumption, coloc first converts marginal summary statistics to approximate Bayes factors (ABFs) [3]. In presence of multiple causal variants, this step is replaced by fine mapping with SuSiE to obtain signal-specific conditionally distinct LBFs. In the next step, both implementations (hereinafter coloc.abf and coloc.susie) then use the same algorithm (coloc.bf_bf, see Methods) to calculate coloc posterior probabilities between the two signals of interest. While the ABFs and LBFs can be pre-computed once and cached for all future colocalisation tests, the calculation of posterior probabilities by coloc.bf_bf needs to be repeated for all pairs of traits and is prohibitively expensive when performing millions of tests.
We developed gpu-coloc, a re-implementation of the coloc algorithm that combines caching of pre-computed Bayes factors (either ABFs or LBFs) with ultra-fast parallel calculation of coloc posterior probabilities to achieve approximately 1000-fold speed-up over the original R implementation while yielding nearly identical results. We then applied gpu-coloc to fine mapped signals from the eQTL Catalogue [21,22], FinnGen [20] and Rahu et al., 2025 [23] (Fig 1). The fast re-implementation allowed us to systematically compare the colocalisation results from CLPP and gpu-coloc to empirically identify thresholds at which CLPP and gpu-coloc yielded comparable results. We found that when restricting the analysis to confidently fine mapped signals, > 90% the colocalisation identified by CLPP and gpu-coloc were shared, demonstrating good concordance between the two methods. Finally, we demonstrate that gpu-coloc can also be used in the absence of fine mapping, in which case it is able to identify colocalising molecular quantitative trait loci (molQTLs) for ~80% of the GWAS loci (compared to when full fine mapping results are available).
gpu-coloc extracts all independent associations from molecular QTL (molQTL) compendia or GWAS databases (1), sorts the signals by genomic position (2), divides the signals into chunks of ~1000 signals (3), masks missing variants (4), and performs GPU-assisted colocalisation between all overlapping chunks (5). We apply gpu-coloc to fine mapped association signals from eQTL Catalogue [21,22], FinnGen [20] and Rahu et al., 2025 [23], and compare the results to the colocalisation posterior probability (CLPP) approach.
Results
Implementation of the gpu-coloc method
Our method, gpu-coloc, is an efficient reimplementation of the coloc.bf_bf [3,6] algorithm in Python and with GPU support. Instead of performing fine mapping at each GWAS locus, gpu-coloc uses SuSiE fine mapped LBFs from publicly available sources such as eQTL Catalogue [21,22], FinnGen [20] and Rahu et al., 2025 [23]. These LBF vectors are first converted into matrices in parquet format for fast retrieval (Fig 1, see Methods). Alternatively, gpu-coloc can also leverage pre-calculated ABFs from studies that lack reliable fine mapping information. Similarly to coloc, gpu-coloc then estimates the posterior probability of whether two signals share a causal variant in the region (PP.H4, see Methods). Whereas the original coloc.bf_bf implementation in R tests for colocalisation between one pair of signals at a time, gpu-coloc can test thousands of pairs concurrently (Fig 1). Massive matrix calculation is made possible by chunking signals by genomic position, where the matrix rows are signals and columns are variants, with the matrix entries being LBFs or ABFs. Our gpu-coloc implementation uses the same prior probability parameters as coloc.bf_bf: the priors p1 and p2 specifying the prior probabilities that any variant in the shared region is causal for the first or second signal, respectively; and the joint prior p12 specifying the prior probability that a variant is causal for both signals [3].
An important consideration in any colocalisation analysis is how to handle associations with variants that are present in one dataset but missing in the other. By default, the pairwise comparisons performed by coloc.abf and coloc.susie use the intersection of variants present in both studies. This ensures that information from missing variants is appropriately treated as missing, rather than evidence for or against colocalisation. However, this is not feasible when calculating coloc posterior probabilities as matrix operations in gpu-coloc (Fig 1). As a workaround, within each collection of association summary statistics (e.g., different traits within FinnGen), gpu-coloc replaces all missing LBF values with a very small value (-10⁶), effectively assuming that these variants are not associated with the trait of interest (see Methods). This mask was chosen via sensitivity analysis across values of 0, -2, -10, -10⁶, and -10⁸. The mean difference from coloc.bf_bf plateaued at masks ≤ -10, but -10⁶ was preferred over -10 as it did not induce artefact results (S1 Fig). Next, when performing colocalisations between two collections of summary statistics or fine mapping results (e.g. eQTL Catalogue vs FinnGen), gpu-coloc, similarly to coloc.bf_bf, only keeps the intersection of variants that are present in both collections. This approach achieves high computational efficiency while ensuring that variants systematically missing between two different collections of summary statistics (e.g., due to differences in allele frequency thresholds or genotype imputation panels) are still appropriately treated as missing.
We believe that our masking approach is well suited for collections of summary statistics where the proportion of missing variants is low, such as in biobank-wide GWAS studies [17,20,23,24], where the same set of variants is tested against all traits. Similarly, in the eQTL Catalogue, the genotype data from all studies have been imputed against the same 1000 Genomes reference panel [25], thus reducing the number of missing variants to a minimum. However, our masking strategy might not be appropriate when assembling a heterogeneous set of summary statistics from, e.g., GWAS Catalog [16] into a single large collection.
gpu-coloc is a faster version of coloc.bf_bf with comparable accuracy
First, we validated gpu-coloc by comparing it to the official R implementation of coloc [6]. We compared gene expression QTLs (eQTLs) from eQTL Catalogue GTEx [26] datasets and fine mapped GWAS signals for 56 metabolic traits from the UK Biobank analysed by Rahu et al., 2025 [23] on the chromosome 19. Using conservative prior probabilities (p1 = p2 = 1 × 10−4, p12 = 1 × 10−6), we found that both gpu-coloc and coloc.bf_bf produced almost identical PP.H4 values when focussing on high confidence colocalisation (PP.H4 ≥ 0.8) (standard deviation of 0.0012, maximum absolute error 0.0087, mean absolute error = 0.0003) (Fig 2A). Across the full range of PP.H4 values, most values were still highly concordant (mean absolute error = 0.0007), but there were 60 outliers (0.018% of all tested pairs) where the absolute error exceeded 0.02 (Fig 2A).
(A) Comparison of the PP.H4 values between gpu-coloc and coloc.bf_bf. (B) Time spent in milliseconds per colocalisation test between coloc.bf_bf, and gpu-coloc with CPU and Metal backends. (C) Comparison of PP.H4 values while including or excluding indels. (D) Posterior probability comparison between gpu-coloc and CLPP across varying prior parameterisations (p12 = 5 × 10−6, 1 × 10−6, and 1 × 10−7) in the FinnGen versus eQTL Catalogue colocalisation.
We next compared the time spent per colocalisation test between gpu-coloc and coloc.bf_bf while varying the number of colocalisation tests from 230 (single GWAS vs a single eQTL dataset) to 359,698 (all GTEx eQTLs vs all metabolic trait associations) (Fig 2B). We ran all benchmarks single-threaded and removed all signals with maximum LBF < 5, as we have found these do not result in colocalisations. We tested gpu-coloc with all three Torch backends: CUDA GPU, Mac Metal GPU, and CPU. We found that while the time spent per colocalisation test stayed flat or even increased for coloc.bf_bf, gpu-coloc became more efficient as the number of parallel tests increased (Fig 2B). When benchmarking just the colocalisation step of the analysis on the University of Tartu HPC system (thus excluding data input-output (IO) and various pre-processing steps), the gpu-coloc with CUDA GPU backend was 795-fold faster compared to the R implementation and the gpu-coloc with the CPU backend was 190-fold faster on the same node (see Methods). Total speed-ups (including IO and pre-processing steps) of the two backends were 1658-fold and 516-fold, respectively (S2 Table). As the GPU backends speedup cannot be disentangled from the more powerful hardware (see Methods), we also benchmarked coloc.bf_bf with our gpu-coloc on an M2 Max 14“ MacBook Pro with 32 GB RAM. Compared to R, the gpu-coloc with Metal GPU backend yielded a 369-fold speedup in the colocalisation step and, surprisingly, the CPU backend a 416-fold speedup in the colocalisation step, with a total speedup of 1,170-fold (S3 Table). This demonstrates that gpu-coloc can also be successfully run on machines without GPUs, while still maintaining the large performance gains over a naive R implementation.
Formatting data for gpu-coloc, performed once prior to computation, took an additional 5 minutes 23 seconds on the MacBook Pro, reducing the total effective runtime advantage to approximately 122-fold. However, the formatting step is computationally inexpensive, linear-time, and reusable across multiple analyses, thus imposing minimal overhead on subsequent studies. Furthermore, we have publicly released properly formatted LBF files for all eQTL Catalogue signals (250 GB total, ~ 50 GB per molQTL type) and will keep these up-to-date as new versions of the eQTL Catalogue are released (see Data availability).
Finally, for testing large-scale colocalisation on a personal computer, we ran gpu-coloc between 306 eQTL Catalogue gene expression datasets (~65 GB) and FinnGen r12 (~13 GB) on a M2 Max 14“ MacBook Pro with 32 GB RAM. The run took 1 hour, 29 minutes and 51 seconds, during which we tested colocalisation between 15,654,823 unique signal pairs. Thus, gpu-coloc enables researchers to efficiently test for colocalisation between their GWAS results and all of the eQTL Catalogue, and other large-scale molQTL and GWAS studies, on their personal computers.
Effect of missingness on colocalisation between datasets
As demonstrated above (Fig 2B), gpu-coloc achieves the largest speed-up when performing many-to-many colocalisation tests. However, an important limitation of this approach is that if some association signals contain missing genetic variants, then these need to be masked globally (with coloc.bf_bf, missing variants are handled separately for each pairwise colocalisation test). To quantify the impact of masking missing values, we performed colocalisation between GTEx eQTLs and Rahu et al., 2025 metabolic trait signals twice: once with indels included for all metabolic traits, and once after indels were removed from every other metabolic trait (so that they would be treated as missing variants and masked). We then restricted both analyses to the metabolic traits where indels had been excluded in one version of the analysis and compared the resulting colocalisation PP.H4 values to each other (Fig 2C). Of the 2,923 colocalising pairs (PP.H4 ≥ 0.8) detected when indels were included throughout, 113 (3.9%) no longer passed the trim_posterior step after indels were removed, and a further 18 (0.6%) fell below the PP.H4 ≥ 0.8 threshold. Conversely, of the 2,968 pairs detected in the analysis with indels removed, 150 (5.1%) had been below the PP.H4 threshold in the original analysis and 26 had not passed trim_posterior, suggesting these are false positives induced by masking of missing values (Fig 2C). Interestingly, the impact on colocalisation was very similar when indels were removed from all metabolic trait signals (and thus completely excluded from colocalisation testing) (S2 Fig), suggesting that missingness itself has a larger impact on colocalisation than masking.
Comparison of posterior probabilities between gpu-coloc and CLPP
An even faster alternative to coloc is the CLPP method introduced with eCAVIAR [4]. On the same benchmark where we compared gpu-coloc and coloc.bf_bf, gpu-coloc was ~ 200 times slower (37.7 seconds) than CLPP (0.18 seconds). Furthermore, storing the LBFs for all tested variants and fine mapped signals required ~1000-fold more space (1.6 GB) than simply storing the PIPs for the credible set variants (20.3 MB). However, while coloc’s PP.H4 has an intuitive interpretation at the locus level (probabilities close to 1 indicate strong evidence for colocalisation), CLPP is a simple sum of variant-level joint probabilities (see Methods) and values as small as 0.1 or even 0.01 are often used as evidence for colocalisation [4]. As a result, it has been challenging to directly compare the colocalisation results from the two methods [27,28]. To explore this question in more detail, we calculated both the CLPP and gpu-coloc PP.H4 values for approximately 200 million molQTL and GWAS signal pairs across the eQTL Catalogue and FinnGen datasets. We found that although the relationship between the PP.H4 and CLPP values was not linear, there was a continuous lower bound allowing us to identify CLPP values above which the PP.H4 always exceeded a certain threshold (Fig 2D). For example, using coloc.bf_bf default prior probabilities of p1 = p2 = 1 × 10-4 and p12 = 5 × 10−6, and filtering for pairs where CLPP ≥ 0.01 revealed that all these signal pairs had PP.H4 ≥ 0.8 (Fig 2D). The converse was not always true as some signal pairs with PP.H4 ≥ 0.8 had lower (or missing CLPP) values (Fig 2D). Decreasing the p12 prior probability to 1 × 10−6 increased the CLPP threshold corresponding to PP.H4 ≥ 0.8 to 0.04. Finally, when the p12 prior probability was set to 1 × 10−7 then the CLPP value corresponding to PP.H4 ≥ 0.8 was ~ 0.3 (Fig 2D). Colocalisation between Rahu et al. 2025 metabolic traits and eQTL Catalogue revealed similar results (S2A Fig). This demonstrates that it is possible to define thresholds where PP.H4 and CLPP are directly comparable, but leaves the question of choosing the optimal threshold and corresponding prior probabilities.
Choosing the colocalisation thresholds for gpu-coloc and CLPP
If we decide to keep the PP.H4 threshold fixed (e.g., PP.H4 ≥ 0.8), then how should we select the optimal p12 prior probability and the corresponding CLPP threshold? For coloc, a recent analysis recommends setting p12 = 5 × 10−6 as a good starting point [29]. Interestingly, at the PP.H4 ≥ 0.8 threshold, this corresponds to CLPP ≥ 0.01, which was independently shown in the CLPP method simulations to have a low false positive rate [4]. However, these thresholds have been derived in the context of comparing a single GWAS locus to a single molQTL locus and it is unclear if they are still suitable when performing millions of colocalisations tests across hundreds of thousands of fine mapped loci. Furthermore, in cases of limited statistical power, SuSiE might sometimes merge the associations driven by two independent causal variants into a single signal. This is most obvious in eQTL studies where the same gene might have two or more conditionally distinct signals in a well-powered study but only one signal in a study with a smaller number of samples [28]. This can then lead to colocalisation colliders, where two fine mapped signals for gene X in dataset A do not directly colocalise with each other (since they are conditionally distinct), but do colocalise with a non-fine mapped signal for the same gene in dataset B (Fig 3A). The simplest collider events contain three signals, but they could also be part of longer chains. While these colocalisation colliders are technically not false positives (there is still at least one shared causal variant in both studies, see example on S3 Fig), they complicate the interpretation of conditionally distinct signals which might perturb the target gene via different mechanisms. Thus, when the goal is to characterise the pleiotropic impact of individual causal variants [18,30], it might be desirable to reduce colocalisation colliders as much as possible.
(A) Example of a colocalisation collider. Two fine mapped eQTL signals for gene X in dataset A (signals 1 and 2, top row) do not colocalise with each other (because they are conditionally distinct), but both separately colocalise with an eQTL signal for the same gene in dataset B. (B) Effect of p12 prior probability and CLPP threshold on the rate of colocalisation colliders observed in the eQTL Catalogue all-against-all eQTL colocalisation graph. For all p12 values, PP.H4 threshold was set to PP.H4 ≥ 0.8. (C) Number of method specific colocalisations results between gpu-coloc and CLPP inside the eQTL Catalogue all-against-all colocalisation at comparable decision thresholds (CLPP ≥ 0.04 and p12 = 1 × 10−6, PP.H4 ≥ 0.8), and the proportion of gpu-coloc specific results, where either signal did not have a credible set (CS). (D) Percentage of clusters containing at least one colocalisation collider for all gpu-coloc colocalisation results (CLPP ≥ 0.04 and p12 = 1 × 10−6, PP.H4 ≥ 0.8), for gpu-coloc results where both signals have credible sets, for colocalisation results shared by gpu-coloc and CLPP, and all CLPP colocalisation results.
To identify colocalisation colliders, we used gpu-coloc and CLPP to perform all-against-all colocalisation between 636,057 fine mapped eQTL signals in the eQTL Catalogue [21,22]. We constructed a colocalisation graph, where the nodes are the signals and the edges are all pairwise significant colocalisation between the signals (PP.H4 ≥ 0.8). We then extracted all connected components (clusters) of that graph and counted the number of clusters that contained at least one colocalisation collider. At comparable thresholds, the rate of colliders was always higher for gpu-coloc relative to CLPP and decreased as the thresholds became more stringent (Fig 3B). For example, with p12 = 5 × 10-6 and PP.H4 ≥ 0.8, 7.31% of gpu-coloc clusters contained at least one colocalisation collider. In contrast, 3.03% of the clusters with CLPP ≥ 0.01 contained at least one collider. With p12 = 1 × 10-6 and corresponding CLPP ≥ 0.04, the rate of colliders decreased to 2.52% and 1.12%, respectively (Fig 3B). Since the rate of decrease in collider events was lower for more stringent CLPP values (S4 Fig), we decided to use CLPP ≥ 0.04 (and corresponding p12 = 1 × 10-6) as our default thresholds for the remaining analyses.
As an alternative strategy to detect colocalisation colliders, we quantified the number of colocalisation clusters that contained credible sets with two or more confidently fine mapped genetic variants (PIP > 0.9). We observed a broadly similar trend where the proportion of clusters containing at least one collider event decreased at more stringent colocalisation thresholds (S5 Fig). However, the overall collider rate was higher with this metric, perhaps because variant-level PIPs might not always be perfectly calibrated [31].
Colocalisations colliders are caused by fine mapping uncertainty
Next, we wanted to see what leads to the increased rate of colocalisation colliders in the gpu-coloc analysis compared to CLPP. In the eQTL Catalogue all-against-all colocalisation, we detected a total of 10,334,443 colocalisation events (PP.H4>=0.8, p12 = 1 × 10−6), 8,726,442 (84.4%) were shared between gpu-coloc and CLPP, 1,052,332 (10.1%) were detected only by gpu-coloc, and 555,669 (5.4%) were detected only by CLPP (Fig 3C). These proportions varied slightly in the eQTL Catalogue vs FinnGen and eQTL Catalogue vs Rahu et al., 2025 analyses (S6 Fig). Notably, the majority (68%) of the additional colocalisation detected by gpu-coloc were between signals where at least one of them was missing a credible set (Fig 3C). The primary reason why SuSiE might not report a credible set in the presence of large LBFs (so that the colocalisation could be detected by gpu-coloc and not by CLPP) is when the variants to be included in the credible set would fail SuSiE’s purity filter. This filter is implemented in SuSiE to reduce the probability that a credible set would contain more than one independent causal variant. By default, this means that the minimal LD between any two variants included in the credible set has to be r2 > 0.5. Thus, we hypothesised that these SuSiE signals that failed purity filtering might have caused the increased rate of colocalisation colliders that we observed for gpu-coloc.
To test this, we excluded the 718,378 edges involving at least one SuSiE signal without a credible set (Fig 3C) from the gpu-coloc eQTL Catalogue all-against-all colocalisation graph and re-calculated the colocalisation collider rate. The collider rate decreased from 2.52% to 1.18% (Fig 3D). In contrast, when we excluded a random set of 718,378 edges 1000 times, the rate of clusters containing a collider decreased only marginally (new mean collider rate = 2.51%, p < 0.001). Finally, when we focussed on colocalisation that were shared between gpu-coloc and CLPP, then their collider rate was 1.118%, which was almost the same as the collider rate of CLPP alone (1.124%) (Fig 3D). We observed broadly concordant results in the eQTL Catalogue vs FinnGen and eQTL Catalogue vs UK Biobank analyses (S6 Fig). Thus, these results indicate that gpu-coloc colocalisations involving a SuSiE signal that failed purity filtering should be treated with caution. When all such signals were excluded from gpu-coloc then the proportion of shared colocalisation with CLPP increased from 84% to 91%, demonstrating that gpu-coloc and CLPP produce largely concordant results.
Running gpu-coloc without fine mapping
By default, gpu-coloc expects SuSiE fine mapped signal-specific LBFs for each trait used in colocalisation. However, most publicly available GWASs (except for FinnGen and eQTL Catalogue) have not been fine mapped. Furthermore, accurate fine mapping might not be feasible for large-scale GWAS meta-analyses combining summary statistics from heterogeneous studies [32]. In that scenario, it is possible to rely on the single causal variant assumption of the coloc.abf method [3] and convert GWAS summary statistics into ABFs.
To test how the restrictive single causal variant assumption influences colocalisation, we obtained the marginal GWAS summary statistics from FinnGen r12 [20] and Rahu et al., 2025 [23], converted these to ABFs and then performed colocalisation against the fine mapped LBFs from the eQTL Catalogue (using the same prior p12 = 1 × 10−6). We then compared the ABF colocalisations to those found using the fine mapped LBFs available from both studies. We found that in both cases, fine mapping increased the number of colocalising GWAS loci by ~18% and the number of conditionally distinct colocalising signals by >50% (Fig 4A). Nevertheless, for the majority of GWAS loci, fine mapping was not necessary to identify colocalisation with molQTLs, demonstrating that gpu-coloc can also be reliably used in scenarios where fine mapping results are not available.
(A) The number of GWAS loci and signals from Rahu et al, 2025 and FinnGen r12 that colocalise with at least one molecular QTL from the eQTL Catalogue (p12 = 1 × 10−6, PP.H4 ≥ 0.8). The results have been stratified by GWAS fine mapping status. (B) Example of a colocalisation detected without fine mapping. GWAS signal for bacterial pneumoniae at the CRP locus (1-159713648-C-G, rs1800947) colocalises with a CRP sQTL in the GTEx liver tissue dataset (QTD000270) from the eQTL Catalogue. (C-D) Examples of two colocalisation detected only after fine mapping. Secondary GWAS signal (panel C) for total branched-chain amino acids (Total_BCAA) (19-48806519-G-C, rs117048185) at the BCAT2 locus colocalises with an sQTL for BCAT2 in the GEUVADIS LeafCutter dataset (QTD000114) of the eQTL Catalogue. Fourth signal for Total_BCAA (19-48797174-G-A, rs35230038) at the BCAT2 locus (panel D) colocalises with a BCAT2 eQTL in the FUSION adipose tissue eQTL dataset (QTD000090) from the eQTL Catalogue.
For example, a GWAS hit for bacterial pneumoniae at the C reactive protein (CRP) locus (lead variant rs1800947) in FinnGen r12 colocalised with a splicing QTL for the CRP gene in the GTEx v8 liver dataset (QTD000270) (Figs 4B and S7). As a result, the colocalisation was detected both with fine mapped LBFs (PP.H4 = 0.998) as well as with ABFs calculated from marginal summary statistics without fine mapping (PP.H4 = 0.988) (Fig 3B). Although CRP is a pattern recognition receptor that can bind to bacterial polysaccharides [33], it is also widely used as a diagnostic criterion to distinguish between bacterial and viral infections [34,35]. Thus, it is highly likely that the FinnGen GWAS signal is better explained by diagnostic bias (i.e., if infected, variant carriers are more likely to receive diagnosis) rather than true biological effect on infection risk.
In contrast, at the branched-chain amino acid aminotransferase 2 (BCAT2) locus for total branched-chain amino acids (Total_BCAA) from Rahu et al., 2025 [23], there were five fine-mapped conditionally distinct signals that interfered with each other. Thus, using ABFs calculated from marginal summary statistics did not reveal any colocalisations. However, using fine mapped LBFs detected 22 colocalisations with four conditionally distinct signals (S1 Table), including the two highlighted here: GWAS signal 2 (lead variant rs117048185) colocalised with a splicing QTL for BCAT2 in the GEUVADIS [36] lymphoblastoid cell line dataset (PP.H4 = 1) (Figs 4C and S8) and GWAS signal 4 (lead variant rs35230038) colocalised with an eQTL for BCAT2 in the FUSION [37] adipose tissue dataset (PP.H4 = 0.93) (Figs 4D and S9). Although rs117048185 is also annotated as a missense variant, it is located 2 bp from the splice donor site and is predicted by SpliceAI [38] to lead to splice donor loss (delta score = 0.41), which is consistent with the reduced rate of exon 3 inclusion in the RNA-seq data (S8 Fig). This highlights how fine mapping can reveal colocalisations with multiple conditionally distinct signals that perturb the same target gene via distinct mechanisms.
Recommendations for large-scale colocalisation analysis
Based on our analysis, we make the following recommendations for future large-scale colocalisation analyses:
- If PIPs from fine mapped credible sets are available for all traits of interest, then relying on the CLPP method alone is likely to be the most efficient approach with only minimal loss in sensitivity. Furthermore, fine mapped credible sets for many large-scale GWAS are publicly available from the Open Targets Platform (https://platform.opentargets.org/downloads) [18].
- We recommend using gpu-coloc as a computationally efficient alternative to coloc.bf_bf when either some or all traits have not been fine mapped, or when there is a need to include low-confidence fine mapped signals that fail SuSiE’s purity filtering.
- Although our method is called gpu-coloc, it achieves competitive speedup even when only CPUs are available (Fig 2B). Thus, when taking higher cost of GPU hardware into account, using the CPU backend might be the best option for most applications.
- We believe that setting p12 prior to 1 × 10−6 and using PP.H4 ≥ 0.8 threshold for gpu-coloc represents a reasonable trade-off between sensitivity and limiting colocalisation colliders for most applications. For the CLPP method, this corresponds to retaining results with CLPP ≥ 0.04.
As a practical example, we have used gpu-coloc in a parallel effort [39] to identify over 932,000 colocalisation events between 85,821 metabolic trait GWAS signals and over 1 million GWAS and molecular trait associations across three large biobanks [17,20,24] and three gene expression, splicing and protein QTL datasets [21,40,41]. We have made these results publicly available at https://elixir.ut.ee/eqtl/nmr-coloc. While perhaps technically feasible, this analysis would have been prohibitively expensive using the original R implementation of coloc.
Discussion
We introduce a novel implementation of the coloc algorithm, gpu-coloc, which enables biobank-scale colocalisation. Our approach achieves up to 1000-fold increase in computational speed compared to coloc.bf_bf while maintaining comparable accuracy. We performed colocalisation between all fine mapped molecular traits from eQTL Catalogue release 7, fine mapped GWAS signals from FinnGen release 12, and fine mapped metabolic traits from Rahu et al., 2025 [23]. We identified thresholds at which gpu-coloc and CLPP yielded comparable results and introduced the concept of colocalisation colliders to quantify spurious colocalisations in large-scale colocalisation graphs. We found that although gpu-coloc had ~ 20% increased sensitivity compared to CLPP, it also had ~ 2.5-fold higher rate of colocalisation colliders. This difference was driven by low-confidence fine mapped signals that did not pass SuSiE’s purity filter and were thus excluded by CLPP. After these low-confidence fine mapped signals were excluded from gpu-coloc, both methods yielded highly concordant (~90% shared) results. Finally, we demonstrate that gpu-coloc can still be used when fine mapping is not feasible for one (or both) sets of traits, in which case it is expected to find colocalisation for approximately 80% of the GWAS loci.
The gpu-coloc approach is related to several other methods. The tensorQTL [42] software package also provides a GPU-enabled implementation of the coloc algorithm, but their implementation does not currently support SuSiE LBFs as input and is limited to testing colocalisation separately for each pair, thus missing a large proportion of the efficiency gains that we demonstrate. HyPrColoc [27] can efficiently test the colocalisation between multiple signals at the same time, but is restricted by the single causal variant assumption and does not provide pairwise colocalisation estimates for all traits. Flanders only stores the LBFs for variants within the credible sets and imputes the others [43]. Finally, gwas-pw [44] and fastENLOC [5,7] use enrichment analysis to identify prior probabilities for colocalisation instead of defining these values a priori.
Our approach also has several limitations. The GPU-enabled parallelisation of gpu-coloc assumes that there are multiple association signals in the same genetic locus. Thus, gpu-coloc performs best when testing for colocalisations between hundreds or even thousands of traits at the same time, or alternatively, when testing a single GWAS study against a large uniform collection of summary statistics. When testing for colocalisation between a single molQTL study and a single GWAS, then there is unlikely to be significant speed-up compared to the tensorQTL implementation. Furthermore, for one collection of summary statistics, gpu-coloc assumes that all variants were tested for all traits and that there are no missing values. This assumption is likely to be satisfied in the eQTL Catalogue and large, uniformly processed biobanks (e.g., FinnGen [20], Pan-UK Biobank [24], Million Veterans Program [17]), but might be problematic when harmonising heterogeneous summary statistics from sources such as the GWAS Catalog [16] or the IEU OpenGWAS database [45].
We believe our comprehensive analysis provides practical guidelines for future large-scale colocalisation studies. We strongly encourage public sharing of fine mapping results whenever possible as this can help to distinguish between multiple conditionally distinct signals with different molecular mechanisms at the same locus (as illustrated by the BCAT2 example in Fig 4). While sharing fine mapped LBFs for all tested variants does provide some advantages, our results indicate that ~95% of high-confidence gpu-coloc colocalisations can already be identified with the CLPP method that requires only the PIPs from the credible set variants. Thus, we hope that studies that are currently unable to openly share full genome-wide summary statistics (e.g., due to re-identification risk), will be able to publicly release PIPs for the variants in fine mapped credible sets.
Methods
Datasets used for colocalisation analysis
The eQTL Catalogue is an open repository of uniformly processed human gene expression and splicing QTLs [21,22]. Release 7 of the eQTL Catalogue contains 758 datasets from 42 studies, spanning 99 distinct tissues and/or cell types. The eQTL Catalogue contains over 3 million fine mapped signals, 2,142,311 of which have a credible set. The eQTL Catalogue is accessible at https://www.ebi.ac.uk/eqtl/.
The UK Biobank is a longitudinal biomedical study with approximately half a million voluntary participants aged 38–71 from the United Kingdom (at the time of recruitment between 2006 and 2010) [46]. In our colocalisation analyses we used GWAS summary statistics and fine mapping results from 246,683 UK Biobank participants analysed by Rahu et al. 2025 [23]. The study by Rahu et al. focused on 56 metabolomic biomarkers, which were measured from EDTA plasma samples between 2019 and 2024 using nuclear magnetic resonance platform from Nightingale Health. The fine mapped GWAS summary statistics contained 15,210 signals, from which 7,911 had credible sets [23]. The dataset is available at https://doi.org/10.5281/zenodo.13821038.
The FinnGen release 12 is based on digitalized health data for 500,348 Finnish donors, mostly from hospital biobanks [20]. The median age of the participant, while donating, was 53, with no age restrictions.The participants’ genotypes have been imputed using the Finnish population-specific SISu v.3 imputation reference panel [20]. FinnGen release 12 had 13,944 genome-wide significant loci for 2,502 traits, 1,302 of which had been fine mapped. These loci contained 21,764 fine mapped signals, of which 17,844 had a credible set. The FinnGen dataset is accessible at https://www.finngen.fi/en/access_results.
Implementation of gpu-coloc
gpu-coloc is a reimplementation of the coloc.bf_bf [6] method in Python and with support for GPU acceleration using the Torch library. The original R implementation enables running coloc on two traits with multiple signals, but testing colocalisation is done consecutively on a CPU. We leverage the same idea, but instead of testing two signals of two traits at the same time with vector-based calculations, we use matrix-based calculations to leverage GPUs. Usually up to 10 signals are given for a locus. However, 10 x 10 tests are not yet effective, thus we chunk different signals in a region by their genomic position to create matrices with up to 1000 different signals. By default, colocalisation is then further divided into batches of 100 x 100 signals, but this can be further scaled with GPU memory (see below). We also implemented the trim_by_posterior function from coloc.bf_bf to omit results with low overlap of strongly associated SNPs. Notably, gpu-coloc supports both GPU and CPU backends of Torch and can thus also be run on CPUs with comparable performance.
Data pre-formatting
GPU acceleration is enabled by pre-formatting signals into matrices. This first requires separating and summarising each signal (start and end of region, chromosome, highest LBF value). Here we discard all signals with maximum LBF < 5 (about ¾ of all signals), as they are unlikely to colocalise with other signals. Signals are sorted by the start of their region on every chromosome, then grouped into matrices based on overlap of regions. Since the signal regions are often defined around the lead variant or around a certain gene, then nearby signals overlap only partially. To format the regions into a matrix, we replace all missing LBF values with a placeholder value of -1 × 10⁶. This is a numerically negligible approximation of BF equals 0, thus assuming no association. Importantly, we assume that inside a collection of summary statistics (e.g., FinnGen or the eQTL Catalogue), all variants have been tested for all traits. Our approach is related to the idea of masking used to avoid confounding by multiple independent causal variants without fine mapping [29]. Matrices are stored in parquet format, which provide both small file size and fast reading speed. Each matrix currently accommodates up to 1,000 signals.
Calculating posterior probability of colocalisation (PP.H4)
Bayesian colocalisation methods such as coloc [3] and gpu-coloc analyze a shared genetic locus consisting of n variants. Each trait under investigation is represented by an n-dimensional vector of LBFs (or ABFs), with each coordinate corresponding to a specific variant. For trait 1, we denote this vector by , and for trait 2 by
. As Bayesian approaches, these methods require specifying prior probabilities:
for the probability that any given variant in the locus is associated with the first trait,
for the second trait, and
for association with both traits. To compute the posterior probability
, we first determine the unnormalised support for each hypothesis
through
:
Once these supports are obtained, posterior probabilities can be calculated using:
Most importantly
Calculating colocalisation posterior probability (CLPP)
Similarly to coloc, CLPP [4] assumes a shared locus consisting of n variants. For each trait, there is an n-dimensional vector where each coordinate corresponds to a variant and is assigned a PIP between 0 and 1. The PIP for a variant of a trait in a region can be approximated quite precisely by dividing its BF with the sum of all BFs for the trait in the region [47]. Thus, while the LBFs used as input to coloc.bf_bf can be trivially converted to PIPs required by CLPP, CLPP as commonly implemented only uses PIPs for variants that are shared between two credible sets of interest. The sum of the PIPs for a vector equals 1. For trait 1, we denote this vector by , and for trait 2 by
CLPP is calculated as:
Benchmarking
To benchmark the performance of gpu-coloc, we compared it with the original coloc.bf_bf implementation in R. Both implementations were applied to chromosome 19 in three batches of increasing size between the GTEx [22] datasets included in the eQTL Catalogue and the Rahu et al., 2025 [23] metabolic traits GWAS signals:
- GTEx blood gene expression (560 fine mapped signals) against the Albumin GWAS (19 signals);
- All GTEx gene expression datasets (17,598 signals) against the Albumin GWAS;
- All GTEx gene expression datasets against all metabolic traits GWAS from Rahu et al., 2025 [23] (1,127 signals).
We used prior probabilities p1 = p2 = 1 × 10−4 and p12 = 1 × 10−6 throughout. Benchmarks were performed in three hardware configurations, all on a single thread. The R implementation of coloc and the gpu-coloc CPU backend were run on an Ares CPU node of the University of Tartu HPC cluster (AMD EPYC 7702 64-core processor, 1 core and 30 GB RAM allocated, 8 TB SSD, HDR InfiniBand at 100 Gbps). The gpu-coloc GPU backend was run on a Firefly GPU node of the same cluster (AMD EPYC 9575F 64-core processor, 1 core and 30 GB RAM allocated, 1 NVIDIA H200 GPU with 141 GB VRAM, HDR InfiniBand at 100 Gbps). For a same-hardware comparison, we also ran the R implementation, the gpu-coloc CPU backend, and the gpu-coloc Metal GPU backend on a MacBook Pro with an Apple M2 Max and 32 GB RAM.
Colocalisations between eQTL Catalogue and FinnGen
We performed colocalisation between the whole of eQTL Catalogue release 7 [22] and all traits from FinnGen r12 [20]. For the eQTL Catalogue, we used fine mapped LBFs available from the eQTL Catalogue FTP server. For FinnGen, we performed two separate analyses. First, we used fine mapped LBFs directly available from FinnGen. Secondly, to quantify the importance of fine mapping for detecting colocalisations, we also downloaded the marginal GWAS summary statistics from FinnGen and converted those to ABFs [48] using the approx.bf.estimates (https://github.com/chr1swallace/coloc/blob/main/R/claudia.R#L96) function from the coloc R package [3]. For colocalisation with fine mapped signals, we kept the priors p1 = p2 constant at 1 × 10-4 and varied p12 across p12 = 5 × 10-6, 1 × 10-6 and 1 × 10-7. For colocalisation with the non-finemapped approximate Bayes factors, we used the priors p1 = p2 = 1 × 10-4 and p12 = 1 × 10-6. For colocalisation with CLPP, we only included credible sets that passed SuSiE’s purity filter (low_purity == FALSE in FinnGen summary statistics). For comparison between the methods, we selected the CLPP thresholds as the lowest CLPP value rounded to two decimals, where PP.H4 for the selected prior was always at least 0.8.
Colocalisations between eQTL Catalogue and metabolic traits
We performed colocalisation between the whole of eQTL Catalogue release 7 [22] and 56 metabolic traits from Rahu et al., 2025 [23]. For gpu-coloc, we used priors p1 = p2 = 1 × 10-4 and varying p12 across 5 × 10-6, 1 × 10-6 and 1 × 10-7. Fine mapping results released by Rahu et al. only contained credible sets that had already passed SuSiE’s purity filter.
All-against-all colocalisation within the eQTL Catalogue
We used a subset of 306 eQTL Catalogue gene expression and protein QTL datasets, where the the quantification method in the dataset metadata table (https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/data_tables/dataset_metadata.tsv) was either ‘ge’, ‘microarray’, or ‘aptamer’. These datasets contained a total of 2,748,413 conditionally distinct fine mapped signals, after discarding signals with maximum LBF < 5, 636,057 signals remained. We then tested for colocalisation between all pairs of signals, varying the p12 prior probability across p12 = 5 × 10-6, 1 × 10-6, and 1 × 10-7 while maintaining p1 = p2 = 1 × 10-4.
Detecting colocalisation colliders
We define a colocalisation cluster as a connected component in the colocalisation graph built from colocalising signal pairs with PP.H4 ≥ 0.8. Nodes are fine mapped signals (dataset, trait, signal ID) and edges indicate a colocalisation (PP.H4 ≥ 0.8). A colocalisation collider is a cluster that contains at least two distinct signals of any single molecular trait (dataset, trait). For example, if signal 1 and signal 2 of trait A both colocalise with signal 2 of trait B, the set {A1, B2, A2} is a colocalisation collider (Fig 3A). The detection algorithm filters pairs to PP.H4 ≥ 0.8, builds the undirected graph on the remaining signals, finds connected components, and flags any component where any trait appears more than once. Signals that do not colocalise with any other signal are excluded from the analysis.
Supporting information
S1 Table. Colocalisations with Total_BCAA at the BCAT2 locus.
https://doi.org/10.1371/journal.pgen.1012209.s001
(XLSX)
S2 Table. Results of the gpu-coloc benchmark against coloc.bf_bf in the UniTartu HPC.
https://doi.org/10.1371/journal.pgen.1012209.s002
(XLSX)
S3 Table. Results of the gpu-coloc benchmark against coloc.bf_bf on a MacBook Pro.
https://doi.org/10.1371/journal.pgen.1012209.s003
(XLSX)
S1 Fig. Sensitivity of gpu-coloc PP.H4 to the LBF mask value.
(A–C) Pairwise comparison of PP.H4 between gpu-coloc and coloc.bf_bf at LBF mask values of the gpu-coloc default −10⁶ (A), 0 (B), and −10 (C). (D) Mean PP.H4 difference between coloc.bf_bf and gpu-coloc as a function of LBF mask value (y-axis on log scale).
https://doi.org/10.1371/journal.pgen.1012209.s004
(TIFF)
S2 Fig. Additional benchmarking of gpu-coloc.
(A) Posterior probability comparison between gpu-coloc and CLPP across varying prior parameterisations (p12 = 5 × 10−6, 1 × 10−6, and 1 × 10−7) in the Rahu et al., 2025 metabolic trait GWAS versus eQTL Catalogue colocalisation. (B) Comparison of PP.H4 values while including or excluding all indel variants from the metabolic trait GWAS results.
https://doi.org/10.1371/journal.pgen.1012209.s005
(TIFF)
S3 Fig. Example of a colocalisation collider identified in the eQTL Catalogue and metabolic trait GWAS colocalisation analysis.
(A) Fine-mapped association signals for Alanine (Ala) and two independent eQTLs for SLC43A2 in the FUSION adipose eQTL dataset. The two lead eQTL variants (chr17_1590678_T_G and chr17_1605228_G_A) are in low LD with each other (r2 = 0.02). (B). Pairwise PP.H4 colocalisation posterior probabilities between the three association signals. The two eQTL signals both colocalise with the (unsuccessfully) fine mapped GWAS signal for Ala (PP.H4 = 0.85 and PP.H4 = 0.99, respectively), but they do not colocalise with each other (PP.H4 = 0.0004).
https://doi.org/10.1371/journal.pgen.1012209.s006
(TIFF)
S4 Fig. Effect of CLPP threshold on the rate of colocalisation colliders.
The red dots illustrate the CLPP thresholds and p12 prior probabilities at which CLPP and gpu-coloc (PP.H4 > 0.8) produce comparable results.
https://doi.org/10.1371/journal.pgen.1012209.s007
(TIFF)
S5 Fig. Effect of CLPP threshold and gpu-coloc priors on the percentage of clusters containing at least two credible sets with distinct fine mapped variants (PIP ≥ 0.9).
(A) Effect of the CLPP threshold on the percentage of clusters containing at least two distinct fine mapped variants (PIP ≥ 0.9). (B) Effect of p12 prior probability and CLPP threshold on the percentage of clusters containing at least two distinct fine mapped variants (PIP ≥ 0.9). For all p12 values, PP.H4 threshold was set to PP.H4 ≥ 0.8.
https://doi.org/10.1371/journal.pgen.1012209.s008
(TIFF)
S6 Fig. Sources of differences in colocalisation results between gpu-coloc and CLPP at comparable decision thresholds (CLPP ≥ 0.04 and p12 = 1 × 10−6, PP.H4 ≥ 0.8).
(A) Sources of difference from colocalisation between FinnGen versus eQTL Catalogue. (B) Sources of difference from colocalisation between 56 UK Biobank metabolic traits from Rahu et al., 2024 versus eQTL Catalogue. (C-D) Rate of colocalisation colliders in the eQTL Catalogue vs FinnGen r12 (Panel C) and eQTL Catalogue vs Rahu et al., 2025 (Panel D) analyses. Barplots show the percentage of clusters containing at least one colocalisation collider for all gpu-coloc colocalisation results (CLPP ≥ 0.04 and p12 = 1 × 10−6, PP.H4 ≥ 0.8), for gpu-coloc results where both signals have credible sets, for colocalisation results shared by gpu-coloc and CLPP, and all CLPP colocalisation results.
https://doi.org/10.1371/journal.pgen.1012209.s009
(TIFF)
S7 Fig. Visualisation of the CRP splicing QTL signal.
RNA-seq read coverage across the CRP gene in the GTEx liver dataset (QTD000270) is stratified by the genotype of the lead sQTL variant (rs1800947). The interactive plot can be viewed in the ELIXIR-Estonia eQTL Catalogue Browser (https://elixir.ut.ee/eqtl/?credible_set=QTD000270_1%3A159713640%3A159714007%3Aclu_11689_-_L1).
https://doi.org/10.1371/journal.pgen.1012209.s010
(TIFF)
S8 Fig. Visualisation of the BCAT2 splicing QTL signal.
RNA-seq read coverage across the BCAT2 gene in the GEUVADIS lymphoblastoid cell line dataset (QTD000114) stratified by the genotype of the lead sQTL variant (rs117048185). The interactive plot can be viewed in the ELIXIR-Estonia eQTL Catalogue Browser (https://elixir.ut.ee/eqtl/?credible_set=QTD000114_19%3A48800297%3A48807000%3Aclu_9896_-_L1).
https://doi.org/10.1371/journal.pgen.1012209.s011
(TIFF)
S9 Fig. Visualisation of the BCAT2 expression QTL signal.
RNA-seq read coverage across the BCAT2 gene in the FUSION adipose tissue dataset (QTD000090) stratified by the genotype of the lead eQTL variant (rs35230038). The interactive plot can be viewed in the ELIXIR-Estonia eQTL Catalogue Browser (https://elixir.ut.ee/eqtl/?credible_set=QTD000090_ENSG00000105552_L1).
https://doi.org/10.1371/journal.pgen.1012209.s012
(TIFF)
Acknowledgments
We thank Ralf Tambets and Urmo Võsa for helpful comments on the manuscript. Some analyses presented in this paper were performed at the High Performance Computing Center, University of Tartu.
References
- 1. Farh KK-H, Marson A, Zhu J, Kleinewietfeld M, Housley WJ, Beik S, et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature. 2015;518(7539):337–43. pmid:25363779
- 2. Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337(6099):1190–5. pmid:22955828
- 3. Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 2014;10(5):e1004383. pmid:24830394
- 4. Hormozdiari F, van de Bunt M, Segrè AV, Li X, Joo JWJ, Bilow M, et al. Colocalization of GWAS and eQTL signals detects target genes. Am J Hum Genet. 2016;99(6):1245–60. pmid:27866706
- 5. Wen X, Pique-Regi R, Luca F. Integrating molecular QTL data into genome-wide genetic association analysis: probabilistic assessment of enrichment and colocalization. PLoS Genet. 2017;13(3):e1006646. pmid:28278150
- 6. Wallace C. A more accurate method for colocalisation analysis allowing for multiple causal variants. PLoS Genet. 2021;17(9):e1009440. pmid:34587156
- 7. Pividori M, Rajagopal PS, Barbeira A, Liang Y, Melia O, Bastarache L, et al. PhenomeXcan: mapping the genome to the phenome through the transcriptome. Sci Adv. 2020;6(37):eaba2083. pmid:32917697
- 8. Stankey CT, Bourges C, Haag LM, Turner-Stokes T, Piedade AP, Palmer-Jones C, et al. A disease-associated gene desert directs macrophage inflammation through ETS2. Nature. 2024;630(8016):447–56. pmid:38839969
- 9. Mountjoy E, Schmidt EM, Carmona M, Schwartzentruber J, Peat G, Miranda A, et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat Genet. 2021;53(11):1527–33. pmid:34711957
- 10. Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying causal variants at loci with multiple signals of association. Genetics. 2014;198(2):497–508. pmid:25104515
- 11. Li Z, Zhou X. Towards improved fine-mapping of candidate causal variants. Nat Rev Genet. 2025;26(12):847–61. pmid:40721533
- 12. Benner C, Spencer CCA, Havulinna AS, Salomaa V, Ripatti S, Pirinen M. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics. 2016;32(10):1493–501. pmid:26773131
- 13. Wang G, Sarkar A, Carbonetto P, Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Series B Stat Methodol. 2020;82(5):1273–300. pmid:37220626
- 14. Kanai M, Ulirsch JC, Karjalainen J, Kurki M, Karczewski KJ, Fauman E. Insights from complex trait fine-mapping across diverse populations. bioRxiv. 2021:2021.09.03.21262975.
- 15. Hayhurst J, Buniello A, Harris L, Mosaku A, Chang C, Gignoux CR, et al. A community driven GWAS summary statistics standard. bioRxiv. 2022:2022.07.15.500230.
- 16. Cerezo M, Sollis E, Ji Y, Lewis E, Abid A, Bircan KO, et al. The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity. Nucleic Acids Res. 2025;53(D1):D998–1005. pmid:39530240
- 17. Verma A, Huffman JE, Rodriguez A, Conery M, Liu M, Ho Y-L, et al. Diversity and scale: genetic architecture of 2068 traits in the VA Million Veteran Program. Science. 2024;385(6706):eadj1182. pmid:39024449
- 18. Tsepilov YA, Suveges D, Considine D, Szyszkowski S, Ge XJ, Santiago IL, et al. The human pleiotropic map of GWAS associations and therapeutic implications. bioRxiv. 2026:2026.04.28.721048. pmid:42094581
- 19. Buniello A, Suveges D, Cruz-Castillo C, Llinares MB, Cornu H, Lopez I, et al. Open Targets Platform: facilitating therapeutic hypotheses building in drug discovery. Nucleic Acids Res. 2025;53(D1):D1467–75. pmid:39657122
- 20. Kurki MI, Karjalainen J, Palta P, Sipilä TP, Kristiansson K, Donner KM, et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature. 2023;613(7944):508–18. pmid:36653562
- 21. Kerimov N, Tambets R, Hayhurst JD, Rahu I, Kolberg P, Raudvere U, et al. eQTL Catalogue 2023: New datasets, X chromosome QTLs, and improved detection and visualisation of transcript-level QTLs. PLoS Genet. 2023;19(9):e1010932. pmid:37721944
- 22. Kerimov N, Hayhurst JD, Peikova K, Manning JR, Walter P, Kolberg L, et al. A compendium of uniformly processed human gene expression and splicing quantitative trait loci. Nat Genet. 2021;53(9):1290–9. pmid:34493866
- 23. Rahu I, Tambets R, Fauman EB, Alasoo K. Mendelian randomization with proxy exposures: challenges and opportunities. Genetics. 2025;231(4):iyaf210. pmid:41004399
- 24. Karczewski KJ, Gupta R, Kanai M, Lu W, Tsuo K, Wang Y, et al. Pan-UK Biobank genome-wide association analyses enhance discovery and resolution of ancestry-enriched effects. Nat Genet. 2025;57(10):2408–17. pmid:40968291
- 25. Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell. 2022;185(18):3426-3440.e19. pmid:36055201
- 26. The GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–30.
- 27. Foley CN, Staley JR, Breen PG, Sun BB, Kirk PDW, Burgess S, et al. A fast and efficient colocalization algorithm for identifying shared genetic risk factors across multiple traits. Nat Commun. 2021;12(1):764. pmid:33536417
- 28. Tambets R, Kolde A, Kolberg P, Love MI, Alasoo K. Extensive co-regulation of neighboring genes complicates the use of eQTLs in target gene prioritization. HGG Adv. 2024;5(4):100348. pmid:39210598
- 29. Wallace C. Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses. PLoS Genet. 2020;16(4):e1008720. pmid:32310995
- 30. Elmore AR, Hanson AL, Leyden GM, Johnson J, Davey Smith G, Paternoster L, et al. Building the human genotype-phenotype map to harness pleiotropy and refine disease mechanisms. medRxiv. 2026:2026.02.19.26346618.
- 31. Wu Y, Zheng Z, Thibaut L, Lin T, Feng Q, Cheng H, et al. Genome-wide fine-mapping improves identification of causal variants. Nat Genet. 2026;58(4):940–51. pmid:41912930
- 32. Kanai M, Elzur R, Zhou W, Global Biobank Meta-analysis Initiative, Daly MJ, Finucane HK. Meta-analysis fine-mapping is often miscalibrated at single-variant resolution. Cell Genom. 2022;2(12):100210. pmid:36643910
- 33. Mold C, Nakayama S, Holzer TJ, Gewurz H, Du Clos TW. C-reactive protein is protective against Streptococcus pneumoniae infection in mice. J Exp Med. 1981;154(5):1703–8. pmid:7299351
- 34. Póvoa P. C-reactive protein: a valuable marker of sepsis. Intensive Care Med. 2002;28(3):235–43. pmid:11904651
- 35. Póvoa P, Coelho L, Almeida E, Fernandes A, Mealha R, Moreira P, et al. C-reactive protein as a marker of infection in critically ill patients. Clin Microbiol Infect. 2005;11(2):101–8. pmid:15679483
- 36. Lappalainen T, Sammeth M, Friedländer MR, ’t Hoen PAC, Monlong J, Rivas MA, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501(7468):506–11. pmid:24037378
- 37. Taylor DL, Jackson AU, Narisu N, Hemani G, Erdos MR, Chines PS, et al. Integrative analysis of gene expression, DNA methylation, physiological traits, and genetic variation in human skeletal muscle. Proc Natl Acad Sci U S A. 2019;116(22):10883–8. pmid:31076557
- 38. Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, et al. Predicting splicing from primary sequence with deep learning. Cell. 2019;176(3):535-548.e24. pmid:30661751
- 39. Tambets R, Jesse M, Kronberg J, van der Graaf A, Abner E, Võsa U, et al. Genetic analysis of circulating metabolic traits in 619,372 individuals. Nature. 2026;:10.1038/s41586-026-10532–5. pmid:42162431
- 40. Tokolyi A, Persyn E, Nath AP, Burnham KL, Marten J, Vanderstichele T, et al. The contribution of genetic determinants of blood gene expression and splicing to molecular phenotypes and health outcomes. Nat Genet. 2025;57(3):616–25. pmid:40038547
- 41. Sun BB, Chiou J, Traylor M, Benner C, Hsu Y-H, Richardson TG, et al. Plasma proteomic associations with genetics and health in the UK Biobank. Nature. 2023;622(7982):329–38. pmid:37794186
- 42. Taylor-Weiner A, Aguet F, Haradhvala NJ, Gosai S, Anand S, Kim J, et al. Scaling computational genomics to millions of individuals with GPUs. Genome Biol. 2019;20(1):228. pmid:31675989
- 43.
Biostatistics-Unit-HT/Flanders. GitHub [Internet]. [cited 2025 Jun 30]. Available from: https://github.com/Biostatistics-Unit-HT/Flanders
- 44. Pickrell JK, Berisa T, Liu JZ, Ségurel L, Tung JY, Hinds DA. Detection and interpretation of shared genetic influences on 42 human traits. Nat Genet. 2016;48(7):709–17. pmid:27182965
- 45. Elsworth B, Lyon M, Alexander T, Liu Y, Matthews P, Hallett J. The MRC IEU OpenGWAS data infrastructure. bioRxiv. 2020.
- 46. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. pmid:30305743
- 47. Wellcome Trust Case Control Consortium, Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, et al. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat Genet. 2012;44(12):1294–301. pmid:23104008
- 48. Wakefield J. Bayes factors for genome-wide association studies: comparison with P-values. Genet Epidemiol. 2009;33(1):79–86. pmid:18642345