Benchmarking interpretability of deep learning for predictive genomics: Recall, precision, and variability of feature attribution

doi:10.1371/journal.pcbi.1013784

Fig 1.

Overview of computational workflow employed for interpretation benchmarking.

Following application approval and data selection, genotype quality control (QC) and phenotype processing and filtering were performed. The preprocessed data were then divided into training, validation, and testing sets. Spike-in SNPS and decoy genotypes were constructed along with running GWAS. DNN models were trained with the training set and optimized with the validation set. Association testing was performed by fitting a generalized linear model on the training set using PLINK to obtain GWAS results. Interpretation was then performed on the trained DNN model(s) with established interpretation algorithms including Saliency, Gradient SHAP, DeepLIFT, and Integrated Gradients. Finally, attribution recall, attribution precision, and attribution consistency were measured for the benchmarked interpretation methods.

More »

Expand

Fig 2.

Illustration of the attribution precision metric.

The full set of features consists of real SNPs (left, green) and decoy SNPs (right, red). Within the set of real SNPs, a small subset is truly associated with the phenotype (grey circle; SNPs with associations). A DNN interpretation method identifies a set of top-K SNPs (yellow oval; DL-Salient SNPs), containing three subsets: A (truly associated real SNPs), B (real SNPs lacking true association), and C (decoy SNPs). Since sets B and C are assumed to be comparable in size, the number of decoy SNPs in the top-K most highly attributed SNPs (C) is used as an estimate of the number of real SNPs lacking true association (B), enabling the calculation of attribution precision as 1 − (|C|/ |A + B|).

More »

Expand

Table 1.

Attribution recall for DNN attribution algorithms and the linear GWAS model across the top 10%, 5%, and 1% of SNPs ranked by attribution magnitude.

More »

Expand

Fig 3.

Mean attribution recall across quantile thresholds for DNN interpretation methods and GWAS.

Mean recall values are shown for DNN attribution methods with (orange) and without (blue) SmoothGrad, compared with the GWAS baseline (green). Shaded regions represent the mean ± 1 standard deviation across replicates. (A) Dominant-effect recall. Smoothed DNN attribution methods achieved consistently higher recall than both non-smoothed variants and GWAS for dominant synthetic variants. (B) Recessive-effect recall. A similar trend was observed for recessive variants, where smoothed methods maintained greater sensitivity across thresholds. (C) Epistatic-effect recall. Both smoothed and non-smoothed DNN methods recovered measurable epistatic associations, substantially outperforming GWAS, which exhibited near-zero recall.

More »

Expand

Table 2.

Attribution precision benchmarking results across the top 20%, 10%, 3%, 2%, and 1% of most highly attributed SNPs by attribution magnitude^†.

More »

Expand

Fig 4.

Average attribution precision across quantile thresholds for SmoothGrad and non-SmoothGrad variations.

Lines represent the mean attribution precision across attribution algorithms (Saliency, DeepLIFT, Gradient SHAP, and Integrated Gradients) with (orange) and without (blue) SmoothGrad applied. Shaded regions denote ± 1 standard deviation across the algorithms included in each mean curve. Quantile values along the x-axis correspond to the top fraction of most highly attributed SNPs. Precision increased monotonically with stricter quantile thresholds, with SmoothGrad consistently improving attribution specificity relative to non-smoothed variants.

More »

Expand

Table 3.

Ensemble consistency as model-wise SNP attribution variability.

More »

Expand

Fig 5.

Distribution of SNP-wise relative standard deviations (RSD) of attribution magnitudes across ten ensemble members for each algorithm.

For each method, the left (blue) half represents the non-SmoothGrad variant and the right (orange) half represents the SmoothGrad variant. Lower RSD values indicate higher ensemble consistency. Extended upper tails reflect outlier SNPs exhibiting greater attribution variability across models.

More »

Expand

Table 4.

Composite scores measured as the geometric mean across recall, precision and ensemble consistency to provide a single quantitative measure of interpretation performance.

More »

Expand