A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

doi:10.1371/journal.pcbi.1011198

Fig 1.

A conceptual illustration of variations autoencoder architectures.

A: An overview of the system architecture. A pathway-derived prior is generated on the transcript or gene level, as described in the following sections. The three alternative input types (transcript-level, gene-level, and community-level) correspond to three different model variants. B: As a latent variable framework, the model assumes that latent variables (Z) are determinants of the measured data (X). To learn Z, p(Z|X) is approximated by q(Z|X), and modeled as the encoder portion of the network. The latent values represent probability distributions, which are implemented in the bottle-neck layer as values for mu (μ) and sigma (σ). Finally, the decoder is conceptually equivalent to p(X|Z).

More »

Expand

Table 1.

An overview of the structures of the five architectures.

More »

Expand

Table 2.

Benchmark results for the five architectures on the test set.

More »

Expand

Fig 2.

Reconstruction performances using correlation coefficients between input and output transcriptomes.

A-E: The clustered pair-wise correlation heatmaps of the selected input and their reconstructed output for A: simpleAE, B: simpleVAE, C: priorVAE, D: beta-simpleVAE, E: and beta-priorVAE. Selected input samples and their corresponding reconstruction output are enumerated as 1–20. ‘_Train’ represents the input train sample and ‘_Recon’ represents the reconstructed output. F: The average correlation between the input and its corresponding reconstruction output.

More »

Expand

Fig 3.

The performance of the AE models across several sample classification tasks.

Sample classification was based on multivariate logistic regression models as a function of the latent representation provided by each of the autoencoder architectures.

More »

Expand

Fig 4.

The performance of the KEGG-based models across classification tasks.

KEGG-based models are compared to MSigDB-based models. Sample classification was based on multivariate logistic regression models as a function of the latent representation provided by each of the autoencoder architectures.

More »

Expand

Fig 5.

The semantic meaningfulness of the latent variables in the prior-based models, shown as the correlation between the biological priors and the latent μ of prior-based models on the test set.

The correlation of each dimension is shown in A: for priorVAE and B: for beta-priorVAE. Subplot C summarizes these correlations to directly compare the semantic interpretability of the two models.

More »

Expand

Fig 6.

Differential latent space analysis of adenocarcinoma vs small cell lung cancer.

Heatmaps show latent values pathways defined by A: MSigDB and B: KEGG. C and D show the top differentially expressed latent variables based on the p-values for MSigDB and KEGG respectively.

More »

Expand

Table 3.

Overrepresented KEGG gene sets by traditional differential expression analysis.

More »

Expand