SCRaPL: A Bayesian hierarchical framework for detecting technical associates in single cell multiomics data

doi:10.1371/journal.pcbi.1010163

Fig 1.

Schematic and graphical representations of SCRaPL.

Here, we assume observed data consists of RNA expression and DNA methylation. 1A Schematic representation of the SCRaPL model. 1B SCRaPL’s graphical model, depicting the statistical dependencies between observed genomic data (Y_ij1 is RNA expression; Y_ij2 is DNA methylation), their associated latent variables (X_ij1, X_ij2) and feature-specific model parameters (μ_j, Σ_j). The additional parameter π_j is specific to the noise model that is assigned to RNA expression data and captures zero inflation. Full details are given in the model description section in Methods.

More »

Expand

Table 1.

Summary of synthetic data experiments.

In all cases, latent means and standard deviations were set as μ_j1 = 4, μ_j2 = 1, σ_j1 = 3 and σ_j2 = 2. Unless otherwise stated, our simulations were based on: I = 60 cells, J = 300 features, 20% ZI rate on average for the expression data (π_j = 0.20) and an average methylation coverage (n_ij) equal to 275 (sampled from a Uniform distribution with range [50, 500]) across cells and genes. When varying the number of cells, we use I ∈ {5, 10, 25, 50, 100, 200, 400, 800, 1600}. When varying expression ZI, we use π_j ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8}. When varying methylation coverage, we sample n_ij from Uniform distributions with ranges given by [5, 10], [10, 20], [20, 50], [50, 250] and [500, 1000]. Full details are provided in S3 Text.

More »

Expand

Fig 2.

Plots summarizing differences in correlation estimation between SCRaPL, Spearman in Experiment 1 with synthetic data.

(2A) Estimated correlation difference from true correlation as a function of cells for SCRaPL, Spearman and Pearson. (2B) Estimated correlation as a function of true correlation for SCRaPL, Spearman and Pearson in synthetic datasets with 300 genes and 1600 cells. Each dot represents a gene and is color-coded based inference approach.

More »

Expand

Fig 3.

Summary of experiments on real data.

Figures summarizing most important points from synthetic and real data experiments. (3A, 3B) Bayesian volcano plots for mESC and mEBC data respectively. Scatter plot of posterior probability under the null hypothesis (in log scale) as a function of posterior median correlation. Each dot represents a feature and is marked with different color depending the method that labels it as a significant association. (3C, 3D) Venn diagrams summarizing detection rates for SCRaPL, Pearson and Spearman in mESC and mEBC data. By accounting for different sources of noise it detects a large set of features identified by frequentist alternatives. SCRaPL also uncovers a additional large set that would be impossible for frequentist methods to identify in a robust way.

More »

Expand

Fig 4.

SCRaPL’s behavior compared to Pearson/Spearman correlation in micro and macro scale.

In all figures apart from 4D the scatter plot depicts raw data for chosen features color-coded by CpG coverage, and normalized expression plotted in the log(1 + x) scale. The violin plots depict the posterior correlation densities estimated by SCRaPL for the raw data in their left hand side. (4A) Example where both SCRaPL and Pearson/Spearman identify the feature’s association as significant. (4B) Example were only Pearson/Spearman identifies the feature’s association significant. (4C) Example were only SCRaPL identifies the feature’s association significant. (4D) Scatter plots to demonstrate the negative/positive relationship between alternative correlation estimates and CpG coverage/% zeros in expression respectively. ( and ρ_prs in Fig 4D are posterior mean and Pearson correlation for feature j.).

More »

Expand

Fig 5.

Cell label transfer from expression to accessibility data for raw 5A and SCRaPL 5B preprocessed data.

Visualization of sc-RNA and scATAC data on the same plot for raw 5C and SCRaPL 5D preprocessed data.

More »

Expand

Fig 6.

DIC difference between model with and without inflation for mESC and mEBC data.

The more negative the difference, the stronger the evidence in favor of the model with zero inflation on the gene expression component and vice versa. As a visual reference, zero is marked with dashed red line.

More »

Expand