Fig 1.
A conceptual illustration of variations autoencoder architectures.
A: An overview of the system architecture. A pathway-derived prior is generated on the transcript or gene level, as described in the following sections. The three alternative input types (transcript-level, gene-level, and community-level) correspond to three different model variants. B: As a latent variable framework, the model assumes that latent variables (Z) are determinants of the measured data (X). To learn Z, p(Z|X) is approximated by q(Z|X), and modeled as the encoder portion of the network. The latent values represent probability distributions, which are implemented in the bottle-neck layer as values for mu (μ) and sigma (σ). Finally, the decoder is conceptually equivalent to p(X|Z).
Table 1.
An overview of the structures of the five architectures.
Table 2.
Benchmark results for the five architectures on the test set.
Fig 2.
Reconstruction performances using correlation coefficients between input and output transcriptomes.
A-E: The clustered pair-wise correlation heatmaps of the selected input and their reconstructed output for A: simpleAE, B: simpleVAE, C: priorVAE, D: beta-simpleVAE, E: and beta-priorVAE. Selected input samples and their corresponding reconstruction output are enumerated as 1–20. ‘_Train’ represents the input train sample and ‘_Recon’ represents the reconstructed output. F: The average correlation between the input and its corresponding reconstruction output.
Fig 3.
The performance of the AE models across several sample classification tasks.
Sample classification was based on multivariate logistic regression models as a function of the latent representation provided by each of the autoencoder architectures.
Fig 4.
The performance of the KEGG-based models across classification tasks.
KEGG-based models are compared to MSigDB-based models. Sample classification was based on multivariate logistic regression models as a function of the latent representation provided by each of the autoencoder architectures.
Fig 5.
The semantic meaningfulness of the latent variables in the prior-based models, shown as the correlation between the biological priors and the latent μ of prior-based models on the test set.
The correlation of each dimension is shown in A: for priorVAE and B: for beta-priorVAE. Subplot C summarizes these correlations to directly compare the semantic interpretability of the two models.
Fig 6.
Differential latent space analysis of adenocarcinoma vs small cell lung cancer.
Heatmaps show latent values pathways defined by A: MSigDB and B: KEGG. C and D show the top differentially expressed latent variables based on the p-values for MSigDB and KEGG respectively.
Table 3.
Overrepresented KEGG gene sets by traditional differential expression analysis.