Fig 1.
Structure of scAAnet for non-linear archetypal analysis with a ZINB reconstruction loss.
NLL: Negative log-likelihood.
Table 1.
Parameters for different distributions.
Fig 2.
Performance of scAAnet and other methods on simulated scRNA-seq datasets.
(a) Results from datasets that were simulated under NB distributions. Figures from left to right are MSE between inferred cell usages and true usages of the 4 GEPs, Pearson correlation between inferred GEPs and true GEPs, and MSE between reconstructed scaled means and true means. (b) Similar results from datasets that were simulated under the ZINB distribution (λ = 0.3). Error bars around the means were drawn with 95% confidence interval. D.E.: differential expression.
Fig 3.
MDS interpolation visualization of archetypal space.
(a) MDS visualization of archetypal space recovered from LDVAE. (b) MDS visualization of archetypal space recovered from scVI. (c) MDS visualization of archetypal space recovered from AAnet. (d) MDS visualization of archetypal space recovered from scAAnet using ZINB loss. Red dots are the locations of the inferred archetypes. Figures in each column are colored by the true cell usage of the corresponding archetype (GEP). Similar results of scAAnet using other distribution-based loss terms can be found in S2 Fig.
Fig 4.
Comparisons of DEG identifications between using continuous usages and discrete group assignment.
(a) ROC AUC results for datasets simulated under NB distributions. (b) Results for datasets simulated under ZINB distributions with λ = 0.3. (c) Results for datasets simulated under ZINB distributions with λ = 0.1. (d) Results for datasets simulated under ZINB distributions with λ = 0.0. The values of ROC AUC were calculated across different signal-to-noise ratio levels. Each box and whisker plot was plotted based on ten simulated datasets. Central lines represent medians, boxes represent the interquartile range (IQR), and the upper/lower whisker represents the largest/smallest value no further than 1.5 × IQR. ROC: receiver operating characteristic; AUC: area under curve.
Fig 5.
scAAnet identified archetypes with correspondence to known cell types in the pancreatic islet dataset.
(a) The UMAP visualization using the top 35 PCs of the observed scRNA-seq data (left), using the inferred cell usage matrix from the encoder (middle), and using the reconstructed expression profile from the output layer of scAAnet (right). UMAPs are colored by 10 known cell clusters. The average Silhouette scores for the three UMAPs are 0.604, 0.398 and 0.613, respectively. Black dots are locations of cells that have the largest usage of the corresponding GEP (marked in Arabic numerals). (b) UMAPs from the observed data colored by inferred cell usage for each GEP. (c) Heatmap showing the usage of all GEPs (rows) in all cells (columns). Cells are ordered by hierarchical clustering. (d) Heatmap showing the percentage of cells with usage > 25% of each GEP (rows) in each cell type (columns). Colors in c and d are coded in the same way as those in a. (e) Normalized gene scores (Methods) of known markers (columns) in each GEP (rows). (f) Mean z-scores of known markers (columns) in each cell cluster (rows).
Fig 6.
Enriched GO terms and top DEGs of GEP 2 and GEP 7 in the pancreatic islet dataset.
(a) The top 10 enriched GO biological process terms using 26 significantly upregulated genes of GEP 2. (b) The expression levels of the top 20 DEGs of GEP 2 with the largest z-scores from the DEG test covaried with the inferred usage of GEP 2. (c) The top 10 enriched GO biological process terms using 84 significantly upregulated genes of GEP 7. (d) The top 20 DEGs of GEP 7 with the largest z-scores. Cells in b and d were sorted in increasing order based on the inferred cell usage of GEP 2 and GEP 7, respectively.
Fig 7.
scAAnet identified 3 GEPs in the lung fibroblast and myofibroblast cells.
(a) The UMAP visualization (Methods) of observed scRNA-seq data colored by cell types. (b) The UMAP visualization of observed scRNA-seq data colored by disease groups. Black dots in a and b are locations of cells that have the largest usage of the corresponding GEP (marked in Arabic numerals). (c) UMAPs colored by inferred cell usage for each GEP. (d) Heatmap showing the usage of all GEPs (rows) in all cells (columns). Cells are ordered by hierarchical clustering. (e) Heatmap showing the percentage of cells with usage > 25% of each GEP (rows) in each cell type and disease group (columns). Colors of cell types and disease groups in d and e are coded in the same way as colors in a and b. (f) Box and whisker plot of the usage of each GEP in cells of fibroblast (top) and myofibroblast (bottom), colored by disease groups (colors are coded in the same way as in b). Central lines represent medians, boxes represent the IQR, and whiskers represent the 5th and 95th quantiles.
Fig 8.
Enriched GO terms and top DEGs of GEPs in the lung fibroblast and myofibroblast cells.
(a) The top 10 enriched GO terms using upregulated DEGs in GEP 1. (b) The expression levels of the top 20 DEGs of GEP 1 covaried with the inferred usage of GEP 1. (c) The top 10 enriched GO terms using upregulated DEGs in GEP 2. (d) The expression levels of the top 20 DEGs of GEP 2 covaried with the inferred usage of GEP 2. (e) The op 10 enriched GO terms using upregulated DEGs in GEP 3. (f) The expression levels of the top 20 DEGs of GEP 3 covaried with the inferred usage of GEP 3. Top 20 genes in b, d and f were selected based on their z-scores. Cells in b, d and f were sorted by increasing order of the inferred cell usage of GEP 1, GEP 2 and GEP 3, respectively. Colors of cell types and disease groups in b, d and f are coded in the same way as colors in Fig 7a and 7b, respectively.
Fig 9.
scAAnet identified 4 GEPs using microglia in the prefrontal cortex dataset.
(a) UMAP of the observed scRNA-seq data colored by microglia subclusters. (b) UMAP of the observed scRNA-seq data colored by AD pathology group. (c) UMAP of the observed scRNA-seq data colored by sex. Microglia subclusters, AD pathology group and sex are provided by the original paper. Black dots in a, b and c are locations of cells that have the largest usage of the corresponding GEP (marked in Arabic numerals). (d) UMAPs colored by inferred cell usage for each GEP. (e) Heatmap showing the enrichment of identified DEGs in each GEP across sets of marker genes of the four microglia subclusters. Colors are negative log of adjusted p-values from hypergeometric tests. P-values were adjusted by Bonferroni over all GEPs and subclusters. (f) Heatmap showing the percentage of cells with usage > 25% of each GEP (rows) in each subcluster and AD group (columns). (g) Heatmap showing the percentage of cells with usage > 25% of each GEP (rows) in each subcluster and sex (columns). Colors of subclusters, AD group and sex in f and g are coded in the same way as colors in a, b and c.
Fig 10.
Enriched GO terms and top DEGs of GEP 2 and GEP 4 in the prefrontal cortex dataset.
(a) The top 10 enriched GO biological process terms using upregulated DEGs in GEP 2. (b) The expression levels of the top 20 DEGs with the largest z-scores covaried with the inferred usage of GEP 2. (c) The top 10 enriched GO biological process terms using upregulated DEGs in GEP 4. (d) The expression levels of the top 20 DEGs of GEP4 covaried with the inferred usage of GEP 4. Cells in b and d were sorted by increasing order of the inferred cell usage of GEP 2 and GEP 4, respectively.