Fig 1.
The input is the gene expression vector of a bulk sample. BLUE contains two branches. The top branch in the figure is a U-Net model to output cell-type-specific gene expression profiles. The bottom branch in the figure is a 5-layer fully connected MLP to output cell-type proportions. The two branches are jointly trained end-to-end on pseudobulk RNA-seq samples simulated from scRNA-seq reference datasets.
Fig 2.
BLUE achieves high correlations with ground truth in simulated and real-world datasets of PBMC.
a, The workflow for simulating bulk samples from scRNA-seq reference datasets. Training and validation pseudobulk samples were simulated from three publicly available PBMC scRAN-seq datasets. b, Scatter plot of predicted cell-type proportions vs. ground truth proportions for randomly generated 100 PBMC ctrl pseudobulk samples (blue dots) and 100 PBMC stim pseudobulk samples (orange dots). L1 error and CCC score are shown in each subfigure. c, Scatter plot of predicted cell-type proportions vs. ground truth proportions for real PBMC bulk samples with experimentally measured cell-type proportions serving as the ground truth. d, Heatmap (in scale) of predicted cell-type-specific gene expression profiles for three real bulk samples: 925L, 9JD4, G4YW. The left heatmap is cell-type-specific GEPs measured by bulk RNA-seq of FACS-sorted cell types. The right heatmap is the cell-type-specific GEPs predicted by BLUE. Each column is one GEP vector for one cell type in one sample. Rows are cell-type-specific DEGs ordered by their expression level within the cell type. The top 300 DEGs for each cell type are visualized. e, Heatmap (in
scale) of predicted cell-type-specific GEPs for the average of simulated bulk samples. Each column is the average of gene expressions of the same cell type from all 100 simulated samples under the same condition. Rows are genes differentially expressed among different cell types or between two conditions (ctrl vs. stim).
Fig 3.
BLUE captures distinct gene expression patterns in real bulk RNA-seq datasets and outperforms existing deconvolution methods.
a, Gene expression values of healthy and T2D conditions (in space). Each column is a cell-type-specific GEP averaged over cells of the same cell type in all samples under one condition (healthy or T2D). Each row is a gene that Segerstolpe et al. selected to be differentially expressed between two conditions. b, Scatter plot of the gene expression difference. The x-axis is ground truth differences calculated by subtracting the average expression profile under T2D condition from the average expression profile under the healthy condition. The y-axis is the predicted differences calculated in the same way as x-axis. The dashed line is the linear fitting between y and x, with a linear fitting score R2 = 0.52 and pearson correlation r = 0.72. c, Scatter plot of predicted proportions for beta cell vs. hemoglobin A1c (HbA1c) level. From the left subfigure to the right are predictions from CIBERSORTx, MuSiC, BLUE. Red lines are linear regression results. Linear fitting score R2 and pearson correlation r are shown in each plot. Blue dots are normal HbA1c level (under 6.5) while yellow triangles are HbA1c level that corresponds to T2D condition.
Fig 4.
An integrative framework for patient subtyping based on the output from BLUE.
BLUE was trained on one independent AML scRNA-seq reference dataset from [19] and applied to deconvolve TCGA-AML dataset into cell-type proportions and sample-specific, cell-type-specific GEPs. The top branch defined patient groups based on predicted cell-type proportions, similar to the analysis workflow in one previous study [22]. GMP patient group exhibits the most favorable prognosis outcome compared to the other three patient groups in the TCGA-AML cohort. The survival pattern failed to be validated in the TARGET-AML cohort. The bottom branch defines TCGA patient groups based on the predicted cell-type-specific GEPs and the resulting three groups exhibited distinct survival differences. The same survival pattern was also observed in TARGET-AML patient cohort, after applying a simple classification pipeline on selected gene expressions to classify TARGET-AML patients.