^{1}

^{1}

^{2}

^{1}

^{1}

^{1}

^{2}

^{1}

The authors have declared that no competing interests exist.

Quantifying cell-type proportions and their corresponding gene expression profiles in tissue samples would enhance understanding of the contributions of individual cell types to the physiological states of the tissue. Current approaches that address tissue heterogeneity have drawbacks. Experimental techniques, such as fluorescence-activated cell sorting, and single cell RNA sequencing are expensive. Computational approaches that use expression data from heterogeneous samples are promising, but most of the current methods estimate either cell-type proportions or cell-type-specific expression profiles by requiring the other as input. Although such partial deconvolution methods have been successfully applied to tumor samples, the additional input required may be unavailable. We introduce a novel complete deconvolution method, CDSeq, that uses only RNA-Seq data from bulk tissue samples to simultaneously estimate both cell-type proportions and cell-type-specific expression profiles. Using several synthetic and real experimental datasets with known cell-type composition and cell-type-specific expression profiles, we compared CDSeq’s complete deconvolution performance with seven other established deconvolution methods. Complete deconvolution using CDSeq represents a substantial technical advance over partial deconvolution approaches and will be useful for studying cell mixtures in tissue samples. CDSeq is available at GitHub repository (MATLAB and Octave code):

Understanding the cellular composition of bulk tissues is critical to investigate the underlying mechanisms of many biological processes. Single cell sequencing is a promising technique, however, it is expensive and the analysis of single cell data is non-trivial. Therefore, tissue samples are still routinely processed in bulk. To estimate cell-type composition using bulk gene expression data, computational deconvolution methods are needed. Many deconvolution methods have been proposed, however, they often estimate only cell type proportions using a reference cell type gene expression profile, which in many cases may not be available. We present a novel complete deconvolution method that uses only bulk gene expression data to simultaneously estimate cell-type-specific gene expression profiles and sample-specific cell-type proportions. We showed that, using multiple RNA-Seq and microarray datasets where the cell-type composition was previously known, our method could accurately determine the cell-type composition. By providing a method that requires a single input to determine both cell-type proportion and cell-type-specific expression profiles, we expect that our method will be beneficial to biologists and facilitate the research and identification of mechanisms underlying many biological processes.

This is a

The measured expression of a gene in a bulk sample reflects the expression of that gene in every cell in the sample. Consequently, the measured gene expression profile (GEP) of a tissue sample is commonly regarded as a weighted average of the GEPs of the different component cell types [

The heterogeneous nature of bulk tissue samples complicates the interpretation of bulk measurements such as RNA-Seq. Often researchers are interested in understanding whether an experimental treatment targets one particular cell type in a heterogeneous tissue or in investigating possible sources of variation among samples [

Deconvolution can also be approached computationally using GEP profiles from collections of bulk tissue samples [

Our goal was to develop a complete deconvolution method using only bulk RNA-Seq data by estimating cell-type proportions and cell-type-specific GEPs simultaneously. The underlying model was based on latent Dirichlet allocation (LDA) [

Using only bulk RNA-Seq expression data for multiple samples as input, CDSeq provides estimates of both cell-type-specific GEPs and sample-specific cell-type proportions simultaneously (

Heterogeneous samples consist of different cell types. The bulk RNA-Seq profile represents a weighted average of the expression profiles of the constituent cell types. CDSeq takes as input the bulk RNA-Seq data for a collection of samples and performs complete deconvolution that outputs estimates of both the cell-type-specific expression profiles and the cell-type proportions for each sample. This Figure depicts a simple scenario of six biological samples comprising four cell types, each with gene expression measurements on eight genes.

To describe our model and the statistical inference scheme, we first introduce the notation. Let _{i} = (_{i,1}, ⋯, _{i,T}) ∈ ^{T}, where ^{T} denotes a (_{t} = (_{t,1}, ⋯, _{t,G}) ∈ ^{G}, where ^{G} denotes a (_{1}, ⋯, _{M}] and _{1}, ⋯, _{T}] encapsulate all the features that we seek to estimate from the data based on our model.

We denote the true GEP of heterogeneous sample _{i} = (Φ_{i,1}, Φ_{i,2}, Φ_{i,G}) ∈ ^{G}. Φ_{i} is a weighted average of the pure cell-type GEPs with weights given by the sample-specific cell-type proportions, namely,

We do not observe the true Φ_{i} directly but instead observe reads from each sample and we can obtain the read assignments to genes. Assume that the length of every sequenced read, denoted _{i,j} denote read _{i,j} depend on the gene and its length), and let categorical random variable _{i,j} ∈ {1, ⋯, _{i,j}. Both _{i} denotes the number of reads from sample _{k} is the length of transcript _{i,j} has _{g}, which is total length of all the transcripts comprising the gene after projection into genomic coordinates. All the analyses reported here were done on the gene level.

Different cell types may generate different amounts of RNA owing to their varying sizes, therefore we employ a Poisson random variable with parameter _{t} to model the number of reads generated from cell type _{1}, ⋯, _{T}). Parameter

Finally, to complete specification of our model, we need to be able to assign reads in the heterogeneous sample to individual cell types; thus, we introduce a latent categorical random variable _{i,j} ∈ {1, ⋯, _{i,j}. Our model specifies that RNA-Seq reads from bulk tissues are generated as follows:

Generate gene expression profiles for different cell types, i.e., _{t} ∼

Choose _{i} ∼

For each of the _{i} RNA-Seq reads in sample _{i} denotes the total reads of sample

Choose a cell type _{i},

Choose a gene _{i},

Generate a read sequence _{i,j} by uniformly choosing one of the _{i,j}.

To this end, a graphical model of CDSeq is presented in

The light blue nodes, _{ij}, _{ij}, denote the values of observable random variables (either measured in the study or established in previous studies) whereas the white nodes, _{ij}, are unobservable random variables that need to be inferred from data. The outer box represents samples where

The cell types delineated by CDSeq are mathematical entities that must be matched to corresponding biological cell types. To match the CDSeq cell types to actual cell types requires a list of reference cell-type-specific GEPs and metric of similarity (for example, Pearson’s correlation coefficient or Kullback-Leibler divergence) (

In CDSeq, the Gibbs sampler iteratively assigns a cell type to each read using a binary search with a time complexity of

CDSeq is an unsupervised learning method that aims at discovering the latent pattern from data without any labeling or prior knowledge. The GEPs of the cell types identified by CDSeq may not closely match any available pure cell line GEPs. This issue may arise because highly correlated GEPs of multiple cell types or subtypes complicates the deconvolution problem and renders CDSeq less able to definitively separate cell types. For example, this issue is escalated in the problem of deep deconvolution. Deep deconvolution refers to the problem of using a whole blood or peripheral blood mononuclear cell (PBMC) sample to estimate the proportions and gene expression profiles of a greater number of cell subtypes, going further down into the hematopoietic tree [

To apply the quasi-unsupervised approach, one could simply append a set of pure cell line GEPs to the GEPs of the bulk samples for the same genes. Each appended pure cell line GEP is treated as a “bulk” sample by CDSeq. For example, let _{G×M} denote bulk RNA-Seq data for

We compared CDSeq to seven competing deconvolution methods using their default settings when applicable (

Deconvolution methods | Estimate proportions | Estimate GEPs | Reference | Dataset |
---|---|---|---|---|

CDSeq | ✓ | ✓ | ①-⑥ | |

CIBERSORT | ✓ | [ |
①-⑥ | |

DeconRNAseq | ✓ | [ |
①-⑥ | |

UNDO | ✓ | [ |
①-② | |

csSAM | ✓ | [ |
①-③ | |

DSA | ✓ | [ |
①-③ | |

deconf | ✓ | ✓ | [ |
①-⑥ |

ssKL | ✓ | ✓ | [ |
①-⑥ |

*

We generated 40 synthetic samples (

We first benchmarked CDSeq on synthetic mixtures with known compositions that we created numerically from publicly available GEPs from Cold Spring Harbor Laboratory. In this synthetic numerical experiment, we amplified the potential bias between RNA proportions and cell-type proportions by artificially increasing the RNA amount of certain cell types before mixing them together to generate the synthetic samples. We generated 40 synthetic samples where each sample was a combination of six different cell types in different proportions (

In estimating cell-type proportions, CDSeq outperformed CIBERSORT, showing smaller differences between the true and estimated proportions for each cell type and, consequently, smaller root mean square error (RMSE) (

We ran CDSeq with six cell types,

In estimating GEPs, performances of CDSeq and csSAM were comparable. However, CDSeq still outperformed csSAM with 64% lower RMSE values than csSAM (

(A) RMSEs of sample-specific cell-type proportion estimations; (B) RMSEs of cell-type-specific GEPs estimations.

Our second performance evaluation used data from a designed experiment that created 32 mixture samples using known RNA proportions isolated from four pure cell lines (

We ran CDSeq with six cell types,

(A) RMSEs of sample-specific cell-type proportion estimations; (B) RMSEs of cell-type-specific GEPs estimations.

We evaluated CDSeq using the experimental data set designed for csSAM [

Comparisons with CIBERSORT and csSAM on mixtures of liver, brain and lung cells. (A) Residual of proportion estimation; (B) Radar plot of RMSE for proportion estimation; (C) Residual of GEPs estimation; (D) Radar plot of RMSE for GEPs estimation.

To test the performance of CDSeq on some extreme cases, we applied CDSeq to a set of GEPs from pure cell lines. We chose LM22 designed by Newman et al. [

We evaluated CDSeq against flow-cytometry measurements of leukocyte content in solid tumors. Data comprised GEPs from 14 bulk follicular lymphoma samples and corresponding flow-cytometry measurements [

We ran CDSeq with 22 cell types,

To assess CDSeq’s performance on deep deconvolution, we used a set of 20 PBMC samples [

To improve estimation, we turned to the quasi-unsupervised strategy when running CDSeq by appending the 22 GEPs of LM22 to the 20 samples, 42 samples in total. Using the 0.6 correlation threshold to match CDSeq-identified cell types to the corresponding 22 leukocyte subtypes, we found that the quasi-unsupervised strategy improved CDSeq’s performance (

We applied CDSeq using the quasi-unsupervised learning strategy and ran CDSeq with 22 cell types,

We next compared CDSeq-estimated cell-type proportions of these nine cell subtypes to flow-cytometry measurements. However, since CDSeq could not distinguish between naive B cells and memory B cells, we combined these two types into one overall B cell type, resulting in eight total subtypes (

For six of the eight subtypes, the CDSeq-estimated relative proportions were significantly correlated (

We have been applying CDSeq by fixing the number of cell types at the correct number, since we know it in advance. CDSeq can, however, estimate the number of constituent cell types in a collection of samples, if necessary, by maximizing the posterior distribution (

Applying this method to the synthetic data and to the data on mixed RNA described above correctly estimated number of cell types in each case (

The maximum of the log posterior provides an estimate of the number of cell types. (A) synthetic data; (B) mixed RNA data. In each data set, the method correctly estimated the number of cell types.

As a complete deconvlution method, CDSeq has many advantages over existing partial deconvolution methods, like csSAM [

In addition, our probabilistic model is conceptually more advanced than methods using matrix decomposition [

Our proposed model extended the original LDA model in two primary ways that would be unnecessary in the context of natural language processing, but are crucial for RNA-Seq data. First, we built in a dependence of gene expression on gene length. Second, we accommodated possibly different amounts of RNA per cell from cell types whose cells differ in size when estimating the proportion of cells of each type in the sample. In addition, instead of specifying the number of cell types a priori, we provided an algorithm that allows the data to guide selection of the number of cell types. Finally, we proposed a quasi-unsupervised learning strategy that augments the input data (GEPs from mixed samples) with additional GEPs from pure cell lines that are anticipated to be components of the mixture.

We systematically compared the performance of CDSeq with seven competing deconvolution methods: CIBERSORT [

CDSeq, an unsupervised data mining tool, is fully data-driven and allows simultaneous estimation of both cell-type-specific GEPs and sample-specific cell mixing proportions. In some real data analyses when constituent cell types had highly correlated GEPs, the cell types found by CDSeq lacked a one-to-one correspondence with the known component cell lines. Our quasi-unsupervised approach ameliorates this problem. It involves augmenting the available GEPs from heterogeneous samples with GEPs from pure cultures of the cell types anticipated to be constituents. We showed that this quasi-unsupervised approach can improve CDSeq’s performance in lymphoma and deep deconvolution examples. In practice, whether or not to apply quasi-unsupervised approach would depend on the goal of the study. If a user is interested in deep deconvolution where one would like to know the proportions of related cell subtypes (e.g., different T subpopulations in samples), then the quasi-unsupervised approach would be recommended. In this case, the appended pure cell line GEPs should be those of the T cell subpopulations. Furthermore, inclusion of such cell line GEPs does not exclude identification of cell types other than those appended pure cell lines.

To improve CDSeq’s computational efficiency, we developed a data dilution strategy that can speed up the algorithm while retaining the accuracy of estimation (

A limitation of current CDSeq model is the impossibility of fine tuning the hyperparameters to obtain optimal results without ground truth. In practice, we suggest setting

In addition, the RNA-Seq mixtures generated in this work can serve as a valuable benchmarking dataset for other deconvolution methods.

We expect that CDSeq will prove valuable for analysis of cellular heterogeneity on bulk RNA-Seq data. This computational method provides a practical and promising alternative to methods that require expensive laboratory apparatus and extensive labor to isolate individual cells from heterogeneous samples, which could also entail possible loss of a systems perspective. Application of CDSeq will aid in deciphering complex genomic data from heterogeneous tissues.

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

We are grateful to Dr. Jiajia Wang and Dr. Zongli Xu for their comments and suggestions. We thank the Integrative Bioinformatics Group and the Epige-nomics Core for the assistance on RNA sequencing and data quality control. We thank the Computational Biology Facility for computing time.