This is an uncorrected proof.
Figures
Abstract
Identifying causal genetic variants in a computational manner remains an open problem. Training end-to-end prediction models is not possible without large ground-truth datasets, while results of genome-wide association studies (GWAS) are entangled by linkage disequilibrium (LD), and gene expression datasets do not contain genetic variation at individual-level. Here, we propose Multiple Instance Fine-mapping (MIFM) – a multiple instance learning (MIL) objective to overcome the lack of strong labels by grouping putatively causal variants together based on their LD scores. Using MIFM, we trained a deep classifier on a dataset aggregating over 13,000 GWAS to predict causal variants based on their underlying DNA sequences. We validated variants prioritized by MIFM by constructing polygenic risk scores which transferred better to different target ancestries. Furthermore, we demonstrated how MIFM can be used to disentangle effect sizes of highly-correlated variants to better fine-map GWAS results.
Author summary
Genome-wide association studies have identified tens of thousands genetic variants associated with traits or diseases. However, the majority of identified variants is only spuriously correlated with the phenotype of interest, having no causal effect on it. Instead, these variants are often inherited together with nearby biologically causal variants, thus creating the spurious associations. Fine-mapping, i.e., predicting which variants are causal, is crucial for downstream tasks, such as uncovering the biological mechanisms affecting the phenotype or robustly identifying individuals with high genetic risk of a disease. While most fine-mapping methods are based on the available association statistics or functional annotations of genetic regions, it should be possible to identify causal variants based on their neighboring DNA sequences. However, training a standard machine learning classifier for that task is obstructed by the scarcity of strong, ground-truth labels. Here, we proposed a method to train sequence models predicting variant causality using weakly-labeled data. We trained a model on a large set of associated variants, and demonstrated its utility by improving cross-ancestry predictions of genetic risk, or disentangling the effect sizes of highly correlated variants.
Citation: Rakowski A, Lippert C (2026) Multiple instance fine-mapping: Predicting causal regulatory variants with a deep sequence model. PLoS Genet 22(6): e1012208. https://doi.org/10.1371/journal.pgen.1012208
Editor: Heather J. Cordell, Newcastle University, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: July 11, 2025; Accepted: June 5, 2026; Published: June 29, 2026
Copyright: © 2026 Rakowski, Lippert. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: https://github.com/HealthML/multiple-instance-fine-mapping.
Funding: CL has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101016775. https://www.interveneproject.eu/ The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. AR has received a salary from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101016775.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Genome-wide association studies (GWAS) remain a powerful tool for identifying genetic variants associated with phenotypes or diseases, with recent studies detecting up to thousands of associations per trait. However, while one usually assumes a clear genotype → phenotype causal direction, only a small fraction of variants significant in a GWAS are expected to be truly causal [1]. Due to linkage disequilibrium (LD), single nucleotide polymorphisms (SNPs) in proximity to causal variants become associated with the trait and can even have lower p-values than the causal SNPs. Identifying the true causal variants is important for understanding the underlying mechanisms such as transcription factor (TF) binding and for making robust predictions in populations with different LD structures than the one where the GWAS was performed.
Without in-vivo experimental validation, one can employ computational fine-mapping methods, which aim to narrow down the set of putative causal variants from GWAS summary statistics. The simplest approach is to test the significance of all candidate SNPs in a joint model. While yielding unbiased estimates of the true effect sizes, it is not feasible in scenarios with large numbers of variants or strong LD. More advanced methods utilize the Bayesian framework to estimate credible sets of variants given prior knowledge on the distribution of effect sizes [2–4], optionally incorporating functional annotations as additional priors [5–7]. While powerful, the above methods require selecting hyperparameters, such as the assumed number of causal variants, rely on availability of functional annotations for the regions of interest, and are sensitive to LD patterns, yielding large credible sets for strongly correlated variants.
Another approach is to use machine learning (ML) prediction models to assign a score to each variant as a proxy of its likelihood of being causal. A common choice are deep neural networks (DNNs) trained to predict functional genomic annotations or gene expression values from DNA, with architectures typically based on a convolutional neural network (CNN) [8–10] or a transformer backbone [11]. The difference in predictions for a sequence with the reference allele and a sequence with the alternative allele is then taken as a measure of the potential causality of a variant. As opposed to the statistical methods, the ML-based approaches are independent of GWAS results or the LD structure. However, while they take DNA data at base pair level resolution as inputs, they are typically trained on reference genome data, which limits the SNP-level variability of DNA motifs to ones present at population-level. Furthermore, the performance of such models can be hindered by coarseness and noise of the labels [12,13].
Here, we introduce Multiple Instance Fine-mapping (MIFM), a framework for training deep learning models to predict the causality of non-coding variants directly from the underlying DNA sequence, without the need for summary statistics or functional annotations at test time. We circumvent the lack of ground-truth labels at SNP resolution by formulating the training objective as a multiple instance learning (MIL) problem, where putatively causal variants in LD with each other are grouped together to form a single, weakly-labeled positive example, and fit the model on a dataset of more than 2 million associated variants from over 13,000 studies. We demonstrate the robustness of MIFM-prioritized variants by creating polygenic scores (PGS) of 20 traits, which transferred better from European to non-European ancestries, compared to variants selected using existing fine-mapping methods. Furthermore, we show how MIFM can be used to detect additional signals in analyses of GWAS results by prioritizing variants for joint tests, even in strongly-correlated cases. Finally, we report the results of model analysis which revealed enrichment of regulatory elements, existing and putatively novel TF motifs, as well as context-dependent mechanisms. The corresponding code as well as the trained model used in our experiments are available at github.com/HealthML/multiple-instance-fine-mapping
2 Description of the method
2.1 Overview of the method
We proposed Multiple Instance Fine-mapping (MIFM), a framework for training models prioritizing causal genetic variants with the multiple instance learning (MIL) paradigm, using GWAS associations as training data (Fig 1). Our goal was to obtain a classifier predicting the probability of a variant being causal, given the DNA sequence around it. To train such a model in the standard supervised manner one would need ground-truth labels for each individual instance (variant). However, variants discovered with GWAS typically contain a large number of false positives due to LD between SNPs. To circumvent the lack of ground-truth labels, we trained a model to classify bags of instances (LD blocks of variants), instead of individual SNPs. We constructed the training dataset by selecting significant variants from a large set of GWAS results and grouping SNPs in LD with each other into positive bags. Conversely, we created negative examples by selecting common variants from the human reference genome which were not significantly associated in any of the GWAS. During training, the model makes instance-level predictions for each SNP within an LD block, which are then pooled by selecting the highest score as the bag-level prediction. Once trained, the pooling operation is discarded, and the instance-level model can be used to make predictions for individual variants. See Sect 2.2 for a detailed description of the method, and Sect 2.3 regarding dataset construction.
We frame the task of identifying causal variants as a multiple instance learning (MIL) problem, where loci of GWAS-associated SNPs are grouped to form positive “bags” for the MIL algorithm, to overcome the lack of instance-level (per-variant) labels. We assume that at each positive bag contains at least one causal variant, while negative bags contain none. Dataset creation: We construct the training dataset using a large set of GWAS results, by selecting SNPs significantly associated in any study (marked in red). Since we do not have variant-level labels, we treat whole LD-blocks of associated SNPs as positive bags. Conversely, we construct the set of negative examples (marked in blue) by selecting the remaining, non-significant variants, to form single-element negative bags. Model training: we train a prediction model, e.g., a deep neural network, to classify the MIL bags, based on the underlying DNA sequences of the variants. The model first makes separate predictions for each element in the bag, which are then aggregated using the max operator to yield a single bag-level prediction. After the model is trained, we can discard the max operation and predict the causality of single variants.
2.2 Fine-mapping as a multiple instance learning problem
We begin by introducing the MIL paradigm for a binary classification task. We assume that the data consists of pairs of , where
are input features and
are binary labels, and the input instances are grouped into bags of examples
, where m can differ across
. We use i to index along the bag-level (e.g.,
) and j to index individual instances within a bag (e.g.,
). At training time, we only have access to bag-level labels, which indicate whether at least one instance in a bag is positive, and are defined as:
This can be interpreted as a form of weak labeling, since we do not know which particular instance(s) in a positive bag are positive, as opposed to strong, instance-level labels. Our goal is then to learn an instance-level classifier , given the bag-level data
. A common approach to train f is to define a bag-level classifier F:
and train it to minimize the cross-entropy loss :
wrt. the bag-level examples .
As GWAS estimate the marginal effect size of each SNP, the significantly associated variants typically comprise groups of SNPs in LD with each other, out of which only a small fraction is truly causal, and most SNPs are spuriously associated with the trait of interest through their correlations with causal variants. On the other hand, we assume that the causal signals are driven by DNA patterns around the variants, e.g., by TF binding motifs, and could be identified given enough data. Databases such as GWAS Catalog [14] or CAUSALdb [15] aggregate the results of thousands of GWAS, providing a large set of putatively causal variants of the human genome across a range of traits and populations. In order to use these data to infer about causal variants, we propose the following MIL scenario: let each represent the DNA sequence centered at a SNP and
be unobserved ground-truth labels indicating causal variants. We construct positive bags
as independent LD-blocks of putatively causal variants, e.g., surpassing a given significance threshold in any study and grouped using a clumping procedure, while the negative ones are single-element bags
of variants without any significant associations. If at least one causal variant is present in each positive bag, this is a valid MIL objective. Assuming that the causal variants share DNA patterns between each other, we can successfully train a sequence model f, e.g., a neural network, to predict causal variants.
Formally, given s GWAS over v SNPs, let be the summary statistics of the i-th GWAS and
be the p-value for the j-th variant in that study. Furthermore, let b be the total of independent LD-blocks, and
be pair-wise disjoint sets of integers indicating which variants belong to which LD block. Given a significance threshold T, the positive bags are then defined as:
while the negative bags are defined as:
2.3 Dataset and model training
We constructed the training dataset using data from the CAUSALdb2 database [16], which aggregates 13,709 GWAS summary statistics, resulting in over 2,618,834 putative causal variants from the GRCh37 human genome, which are grouped into 2,772 independent LD-blocks. We further divided the blocks into primary and secondary signals, using labels provided by CAUSALdb2, yielding 4,790 smaller blocks, and excluded blocks with fewer than 10 variants, to reduce the chance of a small number of false positive SNPs driving the training signal. We treated the resulting blocks as separate bag-level positive examples . Since CAUSALdb2 only contains associated variants, we created the negative bags
from common (
) GRCh37 variants which were not present in CAUSALdb2, excluding SNPs further than 128 base pairs away from any CAUSALdb2 variant, yielding a total of 1,045,506 training examples, each forming a separate single-element negative bag.
For model training, we employed a modified, smaller version of the Basenji2 CNN architecture [9], with 4 blocks in the first model stage, 2 blocks in the second stage, a final number of 64 filters, and a single output. We one-hot encoded the 512 base pair DNA sequences around each variant to serve as the inputs . As per Eq 2, for each block
the CNN makes a separate prediction
for each individual input
, and the maximum value across
is taken as the bag-level prediction. We trained the model for 100 epochs using the Adam optimizer [17] with a learning rate of
and exponential decay of
per epoch, with a batch size of 32 bags. For regularization, we used dropout on the residual layer connections of the model with a rate of 0.5, and randomly shifted the input sequences by up to 8 base pairs. We used a modified DNA one-hot encoding by introducing a 6-th special token “V”, which we replaced the nucletotide value in the middle of the sentence, i.e., at the SNP position (A = [1,0,0,0,0], C = [0,1,0,0,0], G = [0,0,1,0,0], T = [0,0,0,1,0], N=[0,0,0,0,0], V=[0,0,0,0,1]). The use of this additional token allowed us to employ random shift augmentations by signaling the position of the variant of interest to the network, and thus potentially increasing its precision to the base pair level. Using the standard 5-token encoding (e.g., as in [11]) with random shifts would instead force the model to make the same predictions for any neighboring SNPs within the shift range of the variant of interest. To further increase robustness, we employed model ensembling [18] by repeating the training with 5 different random seeds, and training a student model [19] to predict the averaged output of the 5 models. We used the same network architecture for the student model with the exception of using the standard 5 token encoding and not using data augmentations. We implemented the models using the PyTorch [20] and PyTorch Lightning [21] software libraries, and trained them using a single NVIDIA A40 48GB GPU and 8 CPU cores per-model, with an average training time of 34 hours. We did not observe a benefit of using larger versions of the model, other network architectures (Enformer [11], BPNet [10], a pretrained MFD [22]), or differentiable alternatives to the max operator in the formulation of F in Eq 2 (log-sum-exponential [23], generalized mean [24], attention-based [25]). Finally, we note that since only “weak” labels are used for model fitting, the network can be further retrained, or fine-tuned, “at test time”, i.e., whenever one obtains results from a new GWAS, by updating the positive bags with the set of new putative variants.
2.4 Functional annotation of the training data
We utilized GenoSTAN [26] and silencerDB [27] data to annotate CAUSALdb2 variants in terms of their regulatory functions. For GenoSTAN data, we mapped all the variants from CAUSALdb2 into hg38 coordinates using the UCSC genome browser LiftOver tool [28] and annotated them with chromatin state annotations of 127 cell lines for the hg38 genome which we downloaded from https://www.cmm.in.tum.de/public/paper/GenoSTAN/. For each variant, we assigned to it the chromatin states which were repeated in at least 5 different cell lines. A single variant could thus have multiple assigned states due to heterogeneity of the cell line experiments. For example, it could be marked as an enhancer-like element in one experiment and marked as a repressed region in a different cell type. For silencerDB, we downloaded annotations for the GRCh37 genome from http://health.tsinghua.edu.cn/SilencerDB/download/Species/Homo_sapiens.bed, and marked all CAUSALdb2 variant within each silencerDB region as silencers. We performed enrichment analyses for each annotation class by computing the odds ratios of SNPs with the given label compared to a subset of variants passing a given MIFM score threshold, and computed the p-values for enrichment using the Fisher exact test [29].
2.5 Motif discovery
We used Transcription-Factor Motif Discovery from Importance Scores (TF-MoDISco) [10,30] to identify motifs contributing to MIFM predictions. We computed the attribution scores for all training sequences using DeepLIFT [31] and ran TF-MoDISco with 200,000 positive and 200,000 negative seqlets (motif occurences). Finally, we matched the resulting patterns to known human TF binding models from HOCOMOCO 11 [32] using the TOMTOM [33]. For the in silico mutagenesis analysis of the contributions of positive and negative patterns, we selected sequences containing the 3 top positive and 3 top negative TF-MoDISco patterns matching the ARID3A binding motif. For each positive-negative pattern pair, we did the following:
- We computed the offset with respect to the ARID3A motif for the positive and negative pattern. As we had access to the starting positions of the TF-MoDISco patterns for each example, we also computed the start position of ARID3A motif in them.
- We obtained MIFM predictions for all positive and negative examples.
- We selected the positions in the negative pattern where the probability of the top nucleotide exceeded 0.4.
- For each positive example, we replaced the nucleotides at the positions matching the nucleotides selected from the negative pattern.
- We computed MIFM predictions for the modified positive examples.
- We repeated steps 3–5 by modifying the negative examples with the positive pattern.
2.6 Construction and evaluation of polygenic risk scores
We selected GWAS of 10 continuous and 10 binary traits from the CAUSALdb database [16] which were performed in populations of European ancestry and had matching traits in UK Biobank (UKB). For each study, we divided the corresponding variants into LD-blocks according to the CAUSALdb2 labels, treating the primary and secondary signals within a single block as separate blocks. We then constructed PGS by selecting the variant(s) with the highest fine-mapping annotation score from each block and using the effect sizes estimated in the corresponding GWAS. This resulted in 10 PGS per GWAS, using annotation scores from: MIFM, “raw” p-values, CADD v1.7 scores [34], pretrained DNN models: Basenji2 [35], DeepSEA-SEI [36], and Enformer [11], and each of the 7 fine-mapping tools included in the CAUSALdb2 annotations: ABF [37], CAVIARBF [6], FINEMAP [3], PAINTOR [5], SuSiE [4], PolyFun FINEMAP [3,7], and PolyFun SuSiE [4,7]. For the raw p-values, we calculated the annotation scores as 1 minus the p-value of a variant. For Basenji2 and Enformer, we calculated the annotation scores as the maximum differences in predictions for the alternative versus reference allele over all model outputs. To account for differences in ranges for different outputs, we first obtained predictions for common variants in the 1,000 Genomes dataset [38], and used these to normalize the outputs of each prediction track. We evaluated the scores on the African (AFR), Admixed American (AMR), Central/South Asian (CSA), East Asian (EAS) and Middle Eastern (MID) ancestry subsets of UKB, which we defined using the ancestry analysis functionality of pgs-calc [39], with a 1,000 Genomes LD reference panel [38]. This resulted in a total of 100 scenarios (20 traits 5 ancestries). Within each scenario, we divided the samples into 5 folds and fitted 5 linear models of the PGS and covariates (age, sex, UKB assessment center, genotyping batch, and the first 10 genetic principal components). Each time we selected a different set of 4 folds for model training and the remaining fold for evaluation, and averaged the final outcome. For each of the 100 scenarios, we assessed the significance of the difference between the performances in terms of the R2 score (we used the McFadden pseudo-R2 [40] for binary traits) of MIFM and each baseline using a permutation test with 108 permutations. We performed this evaluation in 3 settings, each time selecting the top 1, 5 or 10 variants with the highest annotation score per-block, dividing the effect sizes by the number of variants per-block.
2.7 GWAS and conditional analyses
We performed GWAS of 4 traits — height, red blood cell count, systolic blood pressure, and heel bone mineral density — on a sample of N = 40,000 unrelated individuals from UKB using the standard linear regression functionality of the BOLT-LMM software [41]. We filtered the SNPs with the following criteria: minor allele frequency (MAF), Hardy-Weinberg Equilibrium with a significance level of
, and included imputed variants with an INFO score
, which resulted in 9,637,426 SNPs in total. We transformed the phenotypes using the rank-based inverse normal transformation [42] and adjusted them for confounders using age, sex, the identifiers of the genotyping array and UKB assessment center, and the first 10 genetic principal components. For each trait, we constructed a set of independent loci using the clumping functionality of the PLINK software [43] with a significance threshold of
for the lead SNPs (variants with the lowest p-value per locus) and a threshold of
for secondary variants associated with the lead SNPs. Since we focus our analysis on disentangling the effect sizes of highly correlated variants, we used an R2 threshold of 0.9 and a physical distance threshold of 1,000 kb. For each lead SNP and its secondary variants, we fitted three joint models of SNPs and confounders: a baseline model with all variants in the clump, and two models, which only included the lead SNP and variants filtered based on either their p-values of MIFM scores. For the filtered models we retained secondary variants with p-values below the 30-th percentile of p-values of all GWAS-associated SNPs, or above the 70-th percentile of scores for the MIFM model. For each locus we additionally fitted a full joint model of all variants on a second, independent sample, and tested for differences in effect size estimates of all significant variants from the previous step between the filtered, and full models. This was to exclude cases where a non-causal variant becomes significant as a “substitute”, due to the true causal variant beyond removed.
3 Verification and comparison
3.1 MIFM variants are enriched for enhancer, repressed, and silencer chromatin signatures
To characterize variants prioritized by MIFM, we analyzed whether they enrich for regulatory elements compared to all putatively causal variants from CAUSALdb2. We computed the enrichment for 20 GenoSTAN-defined states [26], which we divided into 6 groups: enhancers, promoters, repressed regions, transcriptional elongations, repressed-enhancer regions, and low-signal regions (Fig 2). MIFM-prioritized variants are enriched significantly for repressed regions (odds ratio (OR) from 1.02 for the lowest quantile to 1.13 for the highest quantile), repressed-enhancer regions (OR from 1.01 to 1.13), and the enhancer elements (OR from 1.01 to 1.10), with the highest enrichment for “strongly” defined enhancers subgroup (“Enh.6”, up to an OR of 1.16). Conversely, we observed a significant depletion of low-signal regions at higher model score quantiles (OR from 0.98 at the 0.7 quantile to 0.97 at the 0.9 quantile), and for transcription elongations (OR 0.99 for the 0.3 quantile to 0.98 for the 0.9 quantile), with the exceptions of “Gen5’.13” which was enriched for, with an OR up to 1.12. We did not observe significant deviations from 1 in the odds ratios for promoter regions. To understand why MIFM enriches for repressed regions, we further analyzed the repressed and repressed-enhancer variants prioritized by MIFM, and observed a significant enrichment for enhancer regions compared to all repressed and repressed-enhancer variants of CAUSALdb2 (up to 1.08 for repressed regions and up to 1.10 for repressed enhancer-like regions) (Tables A–B in S1 Appendix), indicating that repressed regions prioritized by MIFM are often active enhancers in other cell lines. Additionally, we analyzed repressed, repressed-enhancer, and enhancer regions in terms of silencers, and observed a significant enrichment in MIFM-prioritized repressed regions and in enhancer regions (up to 1.07 in both subsets), and no significant change in OR for repressed-enhancer regions (Tables C–F in S1 Appendix). The presence of silencers is associated with the H3K27me3 histone modification [44], which also characterizes the repressed GENOSTAN states, further suggesting that MIFM enriches for repressed regions with functional elements. Silencers can act as enhancers depending on the cellular context [45–47], and their enrichment in the enhancer subset can either indicate a preference of the model for such “dual” elements.
We plot the odds ratios (y-axis) of variants with MIFM scores above each quantile (x-axis) versus all variants in the CAUSALdb2 data. Each plot represents a different regulatory group based on GenoSTAN [26] chromatin state annotations (plots a to f) and silencerDB [27] validated and predicted silencer elements (plot g).
3.2 Syntax analysis of a MIFM trained model
In order to analyze the syntax and patterns relevant for a trained MIFM model, we employed TF-MoDISco and identified 161 unique DNA patterns in total, consisting of 67 patterns with positive attribution scores for models predictions and 94 patterns with negative attributions. The negative patterns had a smaller support in terms of corresponding seqlets (instances of similar patterns), with a median number of 798 seqlets compared to a median 1,257 of seqlets for a positive pattern. In total, 60% of the positive patterns and 40% of the negative ones had a support of at least 1,000 corresponding seqlet instances in the data. 20 positive and 49 negative patterns were significantly matched to at least one known human TF binding motif using TOMTOM [33]. TF motifs with the highest number of matched patterns are shown in Table F in S1 Appendix. Overall, we observed several motifs that were matched to both positive and negative patterns at the same time.
We further analyzed how predictions of MIFM are influenced by the positive and the negative patterns by performing in silico mutagenesis of sequences with patterns matching the binding motif of the ARID3A protein, a TF with reported interactions with other regulatory elements [48–50]. We modified sequences containing positive patterns upstream of the SNP position by replacing the positive pattern with the negative one, and vice versa, and compared MIFM predictions for the original and modified sequences (Fig 3). Adding negative patterns to sequences with a positive pattern shifted the distributions of scores towards lower values (from an average mean score of 0.54 to 0.48) and increased their spread (average standard deviation from 0.03 to 0.08). Adding positive patterns to “negative” sequences slightly shifted their scores towards higher values (average mean from 0.17 to 0.20, average standard deviation from 0.08 to 0.09). This suggests that the context of a variant is necessary, but not sufficient, for it to be predicted as causal by MIFM — one can “disable” the function of a variant by modifying its context, but it is not enough to modify the context to make a variant being predicted as causal.
We selected 3 positive patterns (rows) and 3 negative patterns (columns) matching the ARID3A motif which were identified by TF-MoDISco as influencing MIFM predictions. For each positive-negative pattern pair, we computed MIFM scores for sequences containing the positive pattern (odd subcolumns) and sequences with the negative pattern (even subcolumns) and plot the density functions of the scores in blue. We modified the sequences by adding the negative pattern to the positive sequences (odd subcolumns) and vice versa (even subcolumns), scored the modified sequences with MIFM, and plot the density functions of the modified scores in red. The vertical lines denote the means of MIFM scores of the original (blue) and modified sequences (red).
4 Applications
4.1 Polygenic risk scores created with MIFM transfer better to non-European ancestries
The predictive performance of PGS can decrease when applied to populations different from the one where the GWAS summary statistics were obtained from [51–53]. Due to varying LD patterns across ancestries, SNPs associated with the phenotype in the GWAS population might not be tagging the causal variants in the target population. As most studies are biased towards European populations [54–56], this can increase health disparities, e.g., by failing to identify individuals at risk in minority ancestries [54]. On the other hand, there is evidence for causal variants and their effect sizes being consistent across ancestries [57–59]. Thus, identifying causal variants should improve the cross-ancestry transferability of PGS.
We created PGS by prioritizing variants using MIFM and 12 baseline methods, and evaluated their performance on non-European ancestries (Sect 2.6). Each method was evaluated for 20 traits and 5 ancestries, yielding a total of 100 scenarios per model. Across all scenarios, the MIFM PGS explained the most variance within the phenotypes, with an average R2 = 0.042, followed by R2 = 0.04 for DeepSEA-SEI and CADD (Fig 4). Within each scenario, we compared the performance of PGS created with MIFM and each baseline, and counted the total number of scenarios where MIFM would perform significantly better or worse than a baseline (Fig 5). Overall, MIFM performed better in 15% and worse in 3% of all scenarios. The net number of scenarios where MIFM performed better was positive regardless of the baseline, ranging from a net difference of 2% for Enformer, to 15% for PAINTOR. The smallest improvements were obtained for the AMR ancestry (9% of scenarios better, 5% worse), while the largest improvements were for the AFR ancestry (19% better, 1% worse) (Fig A in S1 Appendix). With respect to individual traits, MIFM performed the worst for inflammatory bowel disease (5% better, 25% worse) and glucose levels (2% better, 12% worse), while the most consistent improvements were for serum urate levels (72% better) and HDL cholesterol levels (42% better) (Fig B in S1 Appendix). All individual R2 scores for each model-ancestry-trait combination are included in S1 Table. As certain baseline methods, e.g., SUSIE, are designed to output a set of multiple putatively causal variants, instead of a single, most likely one, we repeated the above evaluation in two additional settings, selecting the top 5 and top 10 prioritized variants per-block (Figs C–F in S1 Appendix, S2–S3 Tables). This led to an improvement in terms of the R2 scores for the 7 finemapping tools, and in the top 5 setting, PAINTOR and PolyFun FINEMAP achieved a higher score than MIFM (R2 = 0.43 vs. R2 = 0.42 for MIFM, Fig E in S1 Appendix). In the top 10 variants, Enformer was significantly better in a larger number of scenarios (6 better, 4 worse, Fig D in S1 Appendix).
We created PGS using results from 20 GWAS performed on European samples and evaluated them on 5 non-European samples, yielding 100 test scenarios per model.
We created PGS using results from 20 GWAS performed on European samples and evaluated them on 5 non-European samples, yielding 100 test scenarios per model. For each baseline, we counted the number of scenarios where MIFM would perform better than the baseline (in green), worse (in red), or not significantly different (in gray).
4.2 MIFM enables discovery of additional GWAS signals
Joint regression models of multiple variants can estimate the causal effect sizes instead of the marginal ones [60]. However, in the presence of a large number of highly correlated SNPs, large sample sizes are needed to disentangle the signals. Thus it is often infeasible to test all the putative variants jointly, especially for studies of modest sizes. We used MIFM to prioritize variants for conditional testing of highly-correlated () SNPs and compared the results with a naive approach of selecting all variants in high LD in 4 moderately sized GWAS in Table 1. In each GWAS, the joint regression of MIFM-prioritized SNPs yielded a larger number of significant variants, yielding 47 variants in total, compared to 32 variants from the baseline models. 2 out of the 47 variants had significantly different effect size estimates in the larger model, and might be false positives. One of the variants identified by the MIFM joint model was above the significance threshold in the GWAS and would otherwise be undetected by the marginal effect size estimates. We also counted secondary signals, i.e., cases where more than 1 SNP was significant in a joint model, where MIFM identified two cases more than the baseline models (Table 2).
5 Discussion
Identifying causal non-coding variants is typically done with fine-mapping methods which rely on summary statistics and population LD structure, without directly using the underlying DNA sequences. Alternatively, one can employ sequence models of gene expression, which, however, are trained on reference genome data and do not observe individual-level DNA variation. To this end, we proposed a problem formulation of fine-mapping using the MIL objective, where we predict the presence of GWAS associations within LD-blocks containing multiple variants. By using the underlying DNA sequences as input features we can exploit similarities in DNA patterns between causal variants, while by constructing the labels using GWAS summary statistics we indirectly incorporate the individual-level genetic variation which drives SNP associations.
Using this approach, we trained a DNN model which predicts the probability of a SNP being causal given its neighboring DNA sequence, allowing us to prioritize variants of interest. One of the motivations for identifying causal variants is to robustly predict genetic liability of a phenotype across different populations, especially those which are under-represented in GWAS. By evaluating MIFM prioritized variants across a range of traits and ancestries, we were able to increase the robustness of polygenic scores predictions compared to a wide range of baselines. Furthermore, we showed that utilizing sequence information can be useful for disentangling highly-correlated GWAS variants, a task otherwise statistically infeasible with typical sample sizes.
We note that our goal was to propose a framework for training variant-prioritization models, rather than developing a new DNN architecture. We employed a relatively lightweight model for our experiments (less than 25,000 parameters), and while we did not observe improvements with more complex architectures, we did not conduct an exhaustive comparison of possible DNN models. Besides improving the predictive performance, a valuable extension would be to increase the interpretability of the model with interpretable-by-design architectures [61,62]. We further note that as MIFM utilizes a database of GWAS results, it can be continuously fine-tuned whenever new summary statistics are available, each time further narrowing down the MIL objective.
We showed how one can utilize the vast amount of GWAS results available to train machine learning models for variant prioritization, overcoming the problem of inaccurate labels due to confounding from LD. Such models can complement traditional fine-mapping methods, being able to reduce the number of putative variants to be analyzed, even without the access to the corresponding test statistics. Finally, by introducing base-pair level variations in the training data, this paradigm can be used to increase the robustness of existing DNA sequence models.
Supporting information
S1 Appendix. Document containing supplementary Figs A-F and supplementary Tables A-F.
Fig A in S1 Appendix Ancestry-stratified performance comparison of polygenic scores (PGS) created with MIFM and baseline methods on 5 non-European ancestries and 20 phenotypes. We counted the number of scenarios where MIFM would perform better than a baseline (in green), worse (in red), or not significantly different (in gray). Fig B in S1 Appendix Per-trait performance comparison of PGS created with MIFM and baseline methods on 5 non-European ancestries and 20 phenotypes. We counted the number of scenarios where MIFM would perform better than a baseline (in green), worse (in red), or not significantly different (in gray). Traits are sorted by the net difference in scenarios where MIFM was better, i.e., #Better - #Worse. Fig C in S1 Appendix Performance comparison of top-5 variants-per-block PGS created with MIFM and 12 baseline methods on 5 non-European ancestries and 20 traits. We created PGS using results from 20 genome-wide association studies (GWASs) performed on European samples and evaluated them on 5 non-European samples, yielding 100 test scenarios per model. For each baseline, we counted the num- ber of scenarios where MIFM would perform better than the baseline (in green), worse (in red), or not significantly different (in gray). Fig D in S1 Appendix Performance comparison of top-10 variants-per-block PGS created with MIFM and 12 baseline methods on 5 non-European ancestries and 20 traits. We created PGS using results from 20 GWASs performed on European samples and evaluated them on 5 non-European samples, yielding 100 test scenarios per model. For each baseline, we counted the number of scenarios where MIFM would perform better than the baseline (in green), worse (in red), or not significantly different (in gray). Fig E in S1 Appendix Mean performance measured by R2 of top-5 variants-per-block PGS created with MIFM and 12 baseline methods on 5 non-European ancestries and 20 traits. We created PGS using results from 20 GWASs performed on European samples and evaluated them on 5 non-European samples, yielding 100 test scenarios per model. Fig F in S1 Appendix Mean performance measured by R2 of top-10 variants-per-block PGS created with MIFM and 12 baseline methods on 5 non-European ancestries and 20 traits. We created PGS using results from 20 GWASs performed on European samples and evaluated them on 5 non-European samples, yielding 100 test scenarios per model. Table A in S1 Appendix Enrichment of enhancer regions in repressed-enhancer regions prioritized by MIFM. Table B in S1 Appendix Enrichment of enhancer regions in repressed regions prioritized by MIFM. Table C in S1 Appendix Enrichment of silencer elements in repressed-enhancer regions prioritized by MIFM. Table D in S1 Appendix Enrichment of silencers in repressed regions prioritized by MIFM. Table E in S1 Appendix Enrichment of silencers in enhancer regions prioritized by MIFM. Table F in S1 Appendix Transcription factor motifs matched to patterns identififed in MIFM using Transcription- Factor Motif Discovery from Importance Scores (TF-MoDISco). Pattern type denotes whether a TF-MoDISco pattern contributes positively or negatively to MIFM predictions. TF motif denotes the name of the transcription factor. No. seqlets – the total number of TF-MoDISco seqlets matching the given transcription factor (TF) motif. No. patterns – the total number of different TF-MoDISco patterns matching the given TF motif.
https://doi.org/10.1371/journal.pgen.1012208.s001
(PDF)
S1 Table. R2 scores for each model-ancestry-trait combination of the cross-ancestry PGS evaluation for the top-1 variant-per-block setting.
https://doi.org/10.1371/journal.pgen.1012208.s002
(TSV)
S2 Table. R2 scores for each model-ancestry-trait combination of the cross-ancestry PGS evaluation for the top-5 variants-per-block setting.
https://doi.org/10.1371/journal.pgen.1012208.s003
(TSV)
S3 Table. R2 scores for each model-ancestry-trait combination of the cross-ancestry PGS evaluation for the top-10 variants-per-block setting.
https://doi.org/10.1371/journal.pgen.1012208.s004
(TSV)
Acknowledgments
This research has been conducted using the UK Biobank Resource under Application Number 77717.
References
- 1. Sinnott-Armstrong N, Naqvi S, Rivas M, Pritchard JK. GWAS of three molecular traits highlights core genes and pathways alongside a highly polygenic background. Elife. 2021;10:e58615.
- 2.
"Hormozdiari F, Kostem E, kang EY, Pasaniuc B, Eskin E. Identifying causal variants at loci with multiple signals of association. In: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. 2014. p. 610–1. https://doi.org/10.1145/2649387.2660800
- 3. Benner C, Spencer CCA, Havulinna AS, Salomaa V, Ripatti S, Pirinen M. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics. 2016;32(10):1493–501. pmid:26773131
- 4. Wang G, Sarkar A, Carbonetto P, Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Series B Stat Methodol. 2020;82(5):1273–300. pmid:37220626
- 5. Kichaev G, Yang W-Y, Lindstrom S, Hormozdiari F, Eskin E, Price AL, et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 2014;10(10):e1004722. pmid:25357204
- 6. Chen W, Larrabee BR, Ovsyannikova IG, Kennedy RB, Haralambieva IH, Poland GA, et al. Fine mapping causal variants with an approximate Bayesian method using marginal test statistics. Genetics. 2015;200(3):719–36.
- 7. Weissbrod O, Hormozdiari F, Benner C, Cui R, Ulirsch J, Gazal S, et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat Genet. 2020;52(12):1355–63. pmid:33199916
- 8. Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12(10):931–4. pmid:26301843
- 9. Kelley DR. Cross-species regulatory sequence activity prediction. PLoS Comput Biol. 2020;16(7):e1008050. pmid:32687525
- 10. Avsec Ž, Weilert M, Shrikumar A, Krueger S, Alexandari A, Dalal K, et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet. 2021;53(3):354–66. pmid:33603233
- 11. Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021;18(10):1196–203. pmid:34608324
- 12. Kumasaka N, Knights AJ, Gaffney DJ. High-resolution genetic mapping of putative causal interactions between regions of open chromatin. Nat Genet. 2019;51(1):128–37. pmid:30478436
- 13. Broekema RV, Bakker OB, Jonkers IH. A practical view of fine-mapping and gene prioritization in the post-genome-wide association era. Open Biol. 2020;10(1):190221. pmid:31937202
- 14. Cerezo M, Sollis E, Ji Y, Lewis E, Abid A, Bircan KO, et al. The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity. Nucleic Acids Res. 2025;53(D1):D998–1005. pmid:39530240
- 15. Wang J, Huang D, Zhou Y, Yao H, Liu H, Zhai S, et al. CAUSALdb: a database for disease/trait causal variants identified using summary statistics of genome-wide association studies. Nucleic Acids Res. 2020;48(D1):D807–16. pmid:31691819
- 16. Wang J, Ouyang L, You T, Yang N, Xu X, Zhang W, et al. CAUSALdb2: an updated database for causal variants of complex traits. Nucleic Acids Res. 2025;53(D1):D1295–301. pmid:39558176
- 17.
"Kingma DP. Adam: a method for stochastic optimization. 2014. https://arxiv.org/abs/1412.6980
- 18. Ganaie MA, Hu M, Malik AK, Tanveer M, Suganthan PN. Ensemble deep learning: a review. Engineering Applications of Artificial Intelligence. 2022;115:105151.
- 19.
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv preprint. 2015. https://arxiv.org/abs/1503.02531
- 20. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. Pytorch: an imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems. 2019;32.
- 21.
William Falcon and The PyTorch Lightning Team. PyTorch Lightning. 2019.
- 22. Rakowski A, Monti R, Huryn V, Lemanczyk M, Ohler U, Lippert C. Metadata-guided feature disentanglement for functional genomics. Bioinformatics. 2024;40(Suppl 2):ii4–10. pmid:39230700
- 23.
Boyd SP, Vandenberghe L. Convex optimization. Cambridge University Press; 2004.
- 24. Babenko B. Multiple instance learning: algorithms and applications. PubMed. 2008;19.
- 25.
Ilse M, Tomczak J, Welling M. Attention-based deep multiple instance learning. In: International Conference on Machine Learning. 2018. p. 2127–36.
- 26. Zacher B, Michel M, Schwalb B, Cramer P, Tresch A, Gagneur J. Accurate Promoter and Enhancer Identification in 127 ENCODE and Roadmap Epigenomics Cell Types and Tissues by GenoSTAN. PLoS One. 2017;12(1):e0169249. pmid:28056037
- 27. Zeng W, Chen S, Cui X, Chen X, Gao Z, Jiang R. SilencerDB: a comprehensive database of silencers. Nucleic Acids Res. 2021;49(D1):D221–8. pmid:33045745
- 28. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, et al. The UCSC genome browser database: update 2006. Nucleic Acids Research. 2006;34(suppl_1):D590–8.
- 29. Fisher RA. On the interpretation of χ 2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society. 1922;85(1):87.
- 30.
Shrikumar A, Tian K, Avsec Ž, Shcherbina A, Banerjee A, Sharmin M, et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5. arXiv preprint. 2018. https://arxiv.org/abs/1811.00416
- 31.
Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. In: International conference on machine learning. 2017. p. 3145–53.
- 32. Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, et al. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Research. 2013;41(D1):D195–202.
- 33. Tanaka E, Bailey T, Grant CE, Noble WS, Keich U. Improved similarity scores for comparing motifs. Bioinformatics. 2011;27(12):1603–9. pmid:21543443
- 34. Schubach M, Maass T, Nazaretyan L, Röner S, Kircher M. CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res. 2024;52(D1):D1143–54. pmid:38183205
- 35. Kelley DR, Reshef YA, Bileschi M, Belanger D, McLean CY, Snoek J. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 2018;28(5):739–50. pmid:29588361
- 36. Chen KM, Wong AK, Troyanskaya OG, Zhou J. A sequence-based global map of regulatory activity for deciphering human genetics. Nature Genetics. 2022;54(7):940–9.
- 37. Wakefield J. A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet. 2007;81(2):208–27. pmid:17668372
- 38. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. pmid:26432245
- 39. Lambert SA, Wingfield B, Gibson JT, Gil L, Ramachandran S, Yvon F, et al. Enhancing the polygenic score catalog with tools for score calculation and ancestry normalization. Nat Genet. 2024;56(10):1989–94. pmid:39327485
- 40. McFadden D. Regression-based specification tests for the multinomial logit model. Journal of Econometrics. 1987;34(1–2):63–82.
- 41. Loh P-R, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, Finucane HK, Salem RM, et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet. 2015;47(3):284–90. pmid:25642633
- 42. Beasley TM, Erickson S, Allison DB. Rank-based inverse normal transformations are increasingly used, but are they merited?. Behav Genet. 2009;39(5):580–95. pmid:19526352
- 43. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. pmid:17701901
- 44. Cai Y, Zhang Y, Loh YP, Tng JQ, Lim MC, Cao Z, et al. H3K27me3-rich genomic regions can function as silencers to repress gene expression via chromatin interactions. Nat Commun. 2021;12(1):719. pmid:33514712
- 45. Gisselbrecht SS, Palagi A, Kurland JV, Rogers JM, Ozadam H, Zhan Y, et al. Transcriptional silencers in drosophila serve a dual role as transcriptional enhancers in alternate cellular contexts. Mol Cell. 2020;77(2):324-337.e8. pmid:31704182
- 46. Ngan CY, Wong CH, Tjong H, Wang W, Goldfeder RL, Choi C, et al. Chromatin interaction analyses elucidate the roles of PRC2-bound silencers in mouse development. Nat Genet. 2020;52(3):264–72. pmid:32094912
- 47. Della Rosa M, Spivakov M. Silencers in the spotlight. Nat Genet. 2020;52(3):244–5. pmid:32094910
- 48. Garton J, Shankar M, Chapman B, Rose K, Gaffney PM, Webb CF. Deficiencies in the DNA Binding Protein ARID3a Alter Chromatin Structures Important for Early Human Erythropoiesis. Immunohorizons. 2021;5(10):802–17. pmid:34663594
- 49. Saadat KASM, Lestari W, Pratama E, Ma T, Iseki S, Tatsumi M, et al. Distinct and overlapping roles of ARID3A and ARID3B in regulating E2F-dependent transcription via direct binding to E2F target genes. International Journal of Oncology. 2021;58(4):1–12.
- 50. Shen M, Li S, Zhao Y, Liu Y, Liu Z, Huan L, et al. Hepatic ARID3A facilitates liver cancer malignancy by cooperating with CEP131 to regulate an embryonic stem cell-like gene signature. Cell Death Dis. 2022;13(8):732. pmid:36008383
- 51. Martin AR, Gignoux CR, Walters RK, Wojcik GL, Neale BM, Gravel S, et al. Human demographic history impacts genetic risk prediction across diverse populations. Am J Hum Genet. 2017;100(4):635–49. pmid:28366442
- 52. Duncan L, Shen H, Gelaye B, Meijsen J, Ressler K, Feldman M, et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nature Communications. 2019;10(1):3328.
- 53. Mars N, Kerminen S, Feng Y-CA, Kanai M, Läll K, Thomas LF, et al. Genome-wide risk prediction of common diseases across ancestries in one million people. Cell Genom. 2022;2(4):None. pmid:35591975
- 54. Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019;51(4):584–91. pmid:30926966
- 55. Sirugo G, Williams SM, Tishkoff SA. The missing diversity in human genetic studies. Cell. 2019;177(1):26–31.
- 56. Fitipaldi H, Franks PW. Ethnic, gender and other sociodemographic biases in genome-wide association studies for the most burdensome non-communicable diseases: 2005-2022. Hum Mol Genet. 2023;32(3):520–32. pmid:36190496
- 57. Wang Y, Guo J, Ni G, Yang J, Visscher PM, Yengo L. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat Commun. 2020;11(1):3865. pmid:32737319
- 58.
Saitou M, Dahl A, Wang Q, Liu X. Allele frequency differences of causal variants have a major impact on low cross-ancestry portability of PRS. medRxiv. 2022:2022–10.
- 59. Hou K, Ding Y, Xu Z, Wu Y, Bhattacharya A, Mester R, et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat Genet. 2023;55(4):549–58. pmid:36941441
- 60. Yang J, Ferreira T, Morris AP, Medland SE, Madden PAF, Heath AC, et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet. 2012;44(4):369–75.
- 61. Novakovsky G, Fornes O, Saraswat M, Mostafavi S, Wasserman WW. ExplaiNN: interpretable and transparent neural networks for genomics. Genome Biol. 2023;24(1):154. pmid:37370113
- 62.
Tseng AM, Eraslan G, Biancalani T, Scalia G. A mechanistically interpretable neural network for regulatory genomics. arXiv preprint. 2024. https://arxiv.org/abs/2410.06211