A practical application of generative adversarial networks for RNA-seq analysis to predict the molecular progress of Alzheimer's disease

doi:10.1371/journal.pcbi.1008099

Fig 1.

Overview of the application of the GANs to bulk RNA-seq data.

RNA-seq analysis for the GSE104775 raw data with 36 WT and AD samples (n = 6/group) was performed, yielding 1,208 DEGs between 7M WT and 7M AD. The normalized expression profile for the 36 samples was subjected to a data augmentation procedure, creating 846 augmented samples. The generator network produces fake gene expression data with random variables in a latent space(z). The discriminator network distinguishes between the augmented real and fake data to yield a loss function applied to the training weight parameters of both networks. The transition curves for the 1,208 gene expressions change between WT and AD, showing a virtual simulation of disease progress. Then, these are evaluated by latent space interpolation with the generated fake data. The transition curves were classified into six patterns (P1 to P6) to perform pathway analysis with gene lists of pattern subsets. We identified the order of up- or downregulated pathways that predict the pathway cascades.

More »

Expand

Fig 2.

Generation of fake gene profiles.

(A) The distribution plots of all rescaled RLD values for the 846 augmented real samples (blue) and the 846 generated samples (red). 95% of the real and 93% of the generated values lie within [0, 1]. (B) Correlation coefficient distributions for all pairs within the 846 real data (blue) and for pairs between the 846 real and 846 fake data (red) at the 100k epoch. The two broad peaks represent the correlation coefficients within the same group and between different groups. (C) tSNE plots at four epochs with different colored dots representing the 762 training (red) samples, 84 test (blue) samples and 84 generated samples (orange). Remarkably, we observed well-separated clusters per group in the tSNE plots due to the process of pairwise data augmentation performed within a group. (D) The scatterplots of the rescaled RLD values of the 1,208 genes for six 7M AD samples vs the corresponding resembled fakes. (E) The rescaled RLD values of the Apoe gene for 36 samples and their resembled fakes (generated; mauve color), which were generated at the 100k epoch. The scattered points of the correlation coefficients stand for different evaluations over ten repeated generations of 10,000 fake data. Low variations of correlation coefficients indicate that the fake data averaged in the latent space seem to be robust and have similar values under the given weight parameters of the generator network.

More »

Expand

Fig 3.

Transition curves of gene expression levels.

(A-B) Transition curves of selected 17 genes from 7M WT to 7M AD: some up- and downregulated genes, which are known to be highly related or have similar names, were selected from pathways such as cholesterol metabolism (Apoe, Abca1), microglia pathogen phagocytosis pathway (Trem2, Tyrobp), complement and coagulation cascades (C1qa, C1qb, C1qc), focal adhesion (Col4a1, Col4a2), cholinergic synapse (Kcnq3, Kcnq5, Prkcb, Prkcg), TCA cycle (Mdh1, Mdh2), and dopaminergic synapse (Mapk8, Mapk10); (C) Each curve belongs to a pattern when r (the correlation coefficient with the predefined red colored curve patterns which we proposed) is higher than 0.95 or when the maximum r is above 0.90; (D) Venn diagrams for the number of transition curves belonging to each pattern. The total number of curves in the six patterns is 1,191 (649 upregulated, 542 downregulated). Among the 1,208 DEGs, seventeen genes could not be classified into the six patterns.

More »

Expand

Fig 4.

The false discovery rate and enrichment ratio value for upregulated pathways.

The false discovery rate (FDR) and enrichment ratio (ER) values for the selected pathways, which were estimated by the gene list of each subset (Up, P1, P2, P3). Each pathway's heatmap was estimated by averaging the transition curves of the genes belonging to the Up list. Some pathways exist in which the ER values are very large but whose FDR values are not significant. This is the reason why the number of genes in subset lists such as P1 and the annotated genes in some pathways (such as cell adhesion mediated by integrin) are small; consequently a few overlapping genes seem to enhance the ER values. Conversely, the pathways with many annotated genes tend to have very small FDR values even at low enrichment ratios, such as phagocytosis. Hence, we present both the FDR and ER simultaneously.

More »

Expand

Fig 5.

The false discovery rate and enrichment ratio value for downregulated pathways.

The false discovery rate (FDR) and enrichment ratio (ER) values for the selected pathways estimated by the gene list of each subset (Down, P4, P5, P6). Each pathway heatmap was estimated by averaging the transition curves of the genes belonging to the Down list. Genes were significantly downregulated for pathways involving the exocytosis of neurotransmitters with lower P5 FDR (the green box) and of specific synaptic functions with lower P6 FDR (the purple box).

More »

Expand

Fig 6.

The transition curves and heatmap of genes related to cholesterol biosynthesis and cholesterol metabolism.

(A-B) The transition curves of genes in the cholesterol biosynthesis pathway show very early increases and saturation. (C-D) Cholesterol metabolism, which is a KEGG pathway associated with cholesterol, shows a lower FDR value in P3 and a late increase of transition curves for genes such as Apoe, Abca1, and Ldlr. (E) Schematic diagram of a suggested hypothesis for mutual regulation between Aβ production and cholesterol biosynthesis. Our finding that Aβ regulates cholesterol biosynthesis is denoted by the red outlines. Other parts were constructed based on the referenced literature.

More »

Expand