A novel batch-effect correction method for scRNA-seq data based on Adversarial Information Factorization

doi:10.1371/journal.pcbi.1011880

Fig 1.

t-SNE visualization of a scRNA-seq dataset of human dendritic cells.

This dataset [10] is composed of 4 blood cell types and 2 batches, colored by either batch label (left) or cell type label (right).

More »

Expand

Fig 2.

Adversarial Information Factorization’s architecture (AIF).

The model comprises three blocks: the CVAE in blue, the GAN network in pink, and the auxiliary network in green. x is the original cell’s gene expression, y is the true batch label, is the latent vector, is the reconstructed cell’s gene expression, , , are the predicted batch labels based on x, and respectively.

More »

Expand

Fig 3.

t-SNE visualizations of the original and batch-effect corrected data.

The t-SNE is computed for the original data (1st column) and the methods’ corrected data (columns 2 to 9) on the three datasets’ log-normalized (rows 1 to 6) and raw counts (rows 7 to 10). The cells are colored with respect to their batch labels (odd lines) and cell type labels (even lines).

More »

Expand

Table 1.

Comparison of the methods based on the clustering metrics computed on the full datasets.

More »

Expand

Fig 4.

Evolution of the up and down-regulated DEGs’ F1 score with the log-fold-change threshold.

The results are shown for the highly variable genes (’HVG’) and all genes (’All’). The reported log-fold-change values are in base 2.

More »

Expand

Table 2.

Comparison of the methods’ differentially expressed genes’ AUC score for the simulated datasets (all versions of Dataset 3 and Dataset 4).

More »

Expand

Fig 5.

t-SNE visualizations of the original and batch-effect corrected data for the AML dataset.

The t-SNE is computed for the original data (top row) or AIF’s batch-effect-corrected data (bottom row) using a prior PCA. The cells are colored by batch label (left), cell type label (middle), or major cell type label (right).

More »

Expand

Table 3.

Comparison of the clustering results on the AML dataset.

More »

Expand

Fig 6.

Clustering relative performance per patient.

The clustering relative performance corresponds to the ratio between the corrected and original data’s performance for each clustering metric priorly scaled. Those metrics are based on Louvain clustering on each patient’s t-SNE embeddings with a prior PCA, using either the cell types (blue) or the major cell types (yellow). Each metric is computed for the cell type purity (CT), the batch mixing (B), and combining both (F1).

More »

Expand

Table 4.

Average per-patient clustering relative performance.

More »

Expand