Matrix prior for data transfer between single cell data types in latent Dirichlet allocation

doi:10.1371/journal.pcbi.1011049

Fig 1.

Simulation experiments show that the matrix prior improves the concordance between inferred topics and the ground truth compared to the uniform prior.

Experiments from the true matrix simulation and the inferred matrix simulation are shown here for different numbers of cells in the target dataset (different colors). A: MSE from the ground truth to the LDA with a ground truth matrix prior (y-axis) is plotted against the MSE from the ground truth to the uniform symmetric prior LDA (x-axis), for both the cell-topic matrix and B: the topic-gene matrix. Each point represents one independently simulated dataset, with a unique true cell-topic matrix and topic-gene matrix. C: MSE to the ground truth for the LDA with a matrix prior inferred from a simulated reference dataset is shown for different reference data set sizes. MSE is plotted for both the cell-topic matrices and D: the topic-gene matrices. The blue line is the line y = x.

More »

Expand

Fig 2.

Correlation of the target set with the matrix prior versus the joint model.

The Pearson correlation between LDA results on the target dataset for the matrix prior LDA and the full dataset uniform symmetric LDA (the “joint model”) increases as c_B increases. Pearson r values are plotted as a function of c_B for the cell-topic (top row) and topic-gene (bottom row) matrices for LDA experiments on four different datasets: C.elegans scATAC-seq data (first column), SHARE-seq mouse skin scATAC-seq data with the peak vocabulary translated to genes (second column), SHARE-seq mouse skin scRNA-seq data (third column), and SHARE-seq mouse skin scATAC-seq data using the peak vocabulary (fourth column). The dotted horizontal lines indicate the correlation between the uniform prior LDA and the joint model.

More »

Expand

Fig 3.

Qualitative improvements when using the matrix prior.

Increasing the weight of the matrix prior (bottom row) shows a qualitative improvement in the ability of the target dataset LDA to discriminate among cell types compared the uniform prior (top row). UMAP embeddings of the cell-topic matrices from SHARE-seq mouse skin scATAC-seq data using the peak vocabulary are trained with different values of c_B (different columns). Scatter points representing cells are colored by their published cell type annotations.

More »

Expand

Fig 4.

Quantitative improvement of perplexity values with the matrix prior.

Perplexity values (y-axis) demonstrate quantitative improvement of the LDA model after using the matrix prior (darker colors) compared to the uniform prior (lighter colors) for various values of the weight of the prior (x-axis). The same procedure was used for both the SHARE-seq data set (blue) and the C. elegans data set (red). Each point is a separate split of the target data into a test and a training set.

More »

Expand