Learning genetic perturbation effects with variational causal inference

doi:10.1371/journal.pcbi.1013194

Fig 1.

Key components of SCCVAE.

(A) Illustration of a structural causal model, where the parameters associated with gene i are annotated. (B) The architecture of SCCVAE. It contains: an expression encoder that maps X to exogenous noise variables Z, a shift encoder that maps p to a shift vector S^p, a structural causal model (e.g., as illustrated in (A)) that maps Z,S^p to U^p, and an expression decoder.

More »

Expand

Table 1.

SCCVAE vs baselines on the out-of-distribution task.

Results are averaged across five different splits, each containing a different set of perturbations in the test set covering the entire set of all perturbations. Both when evaluating all essential genes and top 50 genes, SCCVAE achieves better metrics than control or GEARS for every metric except Pearson correlation, where all three methods achieve similar mean values but SCCVAE has lower variance in its predictions. SCCVAE outperforms a transformer-based model, scGPT, on both distributional-based metrics. When compared to a simple linear model from Ahlmann-Eltze et al [21], SCCVAE achieves similar metrics to the linear model, while additionally achieving low distributional loss, expanding its scope beyond the linear model.

More »

Expand

Fig 2.

(A) When comparing quantitative performance on the OOD task of GEARS vs SCCVAE to the control distribution, GEARS on average learns the control distribution but SCCVAE is more closely able to approximate the ground truth and is comparable to the linear method across all metrics.

This effect is very pronounced when observing all essential genes, in the case of top 50 genes there are a few outlier perturbations with unusually high error. The linear model is limited to bulk analysis and, therefore, does not include MMD evaluations. (B) SCCVAE and GEARS UMAP visualizations versus ground-truth perturbations on select perturbations in the OOD task. Consistent with quantitative results, GEARS outputs match the control distribution while SCCVAE outputs match the perturbationally distinct ground truth.

More »

Expand

Table 2.

Ablation studies.

SCCVAE with a learned graph (SCCVAE) consistently achieves superior MSE and MMD values Both the conditional model and the pre-specified graph-based model (Causal-GSP) still outperform the GEARS baselines from Table 1, and the pre-specified graph-based model outperforms the conditional model, but both are more restrictive than SCCVAE with a learned graph. The models with random causal graphs achieve high error, as is expected, but outperform all other models on the fraction same/changed metrics. This is likely due to the shifts being tuned to extreme values during the training/shift selection process to account for the poor distributional match from the random graph.

More »

Expand

Fig 3.

Distribituional loss (MMD) on SCCVAE ablations.

The learned causal graph in the SCCVAE model achieves lower MMD than conditional, sparse causal graph, and random graph equivalents.

More »

Expand

Fig 4.

Distance from the control distribution in the observational space vs the latent space (U^p), for all perturbations in each OOD split.

The Euclidean distance in the latent space is strongly correlated with the MMD in the expression space.

More »

Expand

Fig 5.

(A) For OOD test perturbations, the shift is selected to minimize MSE of pseudo-bulk predictions.

(B) Shift values closer to zero result in output predictions closer to control, and larger magnitude shift values result in more distinct perturbational distributions. The shift value that most closely matches the ground truth is .

More »

Expand

Fig 6.

(A) UMAP Visualizations of latent causal variables with various functional perturbation modules.

Genes belonging to the same perturbation module are close together within the latent space. (B) When visualizing just the average U^p of each perturbation, the perturbation modules form distinct clusters in the latent space.

More »

Expand