PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration

doi:10.1371/journal.pcbi.1011814

Fig 1.

Overview of pathway transformation using single sample pathway analysis (ssPA).

A. Pathways are represented as sets of molecules, e.g. genes, proteins, and metabolites. B) Pathway transformation by ssPA facilitates a change of dimension of an omics dataset from a molecular space to a pathway space. C) This transforms a sample-by-molecule expression or abundance matrix to a sample-by-pathway matrix, where values represent the `activity’ of each pathway for each individual sample.

More »

Expand

Fig 2.

Pathway transformation enhances sensitivity to low signal-to-noise signals.

y axis shows proportion of MWU tests significant at Bonferroni p ≤ 0.05, performed either on the pathway-level data or the molecular level data, at varying effect sizes shown on x-axis. Semi-synthetic data based on COVID-19 dataset.

More »

Expand

Fig 3.

PathIntegrate Multi-View (left) and Single-View (right) modelling frameworks for multi-omics pathway-based integration. Frameworks are outlined in terms of their input data, pathway-transformation stage, statistical model, and outputs. Blue data blocks represent omics data which has been transformed from the molecular (X_N×M) space to the pathway (A_N×P) space using ssPA. Both Single-View and Multi-View make use of the same multi-omics pathway set.

More »

Expand

Fig 4.

Performance of PathIntegrate and DIABLO vs. effect size, based on semi-synthetic data measured by AUROC.

COPDgene metabolomics and proteomics data were integrated in each model. A. Ability to correctly predict sample outcomes (case vs. control). We compared PathIntegrate Multi-View and Single-View to DIABLO using both molecular and pathway-level multi-omics data. B. Ability to correctly recall target enriched pathway. We compared DIABLO RGCCA model loadings to the Multi-View MB-PLS VIP and Single-View PLS VIP statistics for pathway importance. C. Comparison of PathIntegrate Multi-View classification performance using KEGG and Reactome pathway databases as well as molecular-level model. D. Effect of sample size on PathIntegrate Multi-View classification performance. For panels a-c error bars indicate 95% confidence intervals on the mean AUROC (in some cases they appear smaller than point sizes).

More »

Expand

Fig 5.

PathIntegrate Multi-View applied to COPDgene multi-omics data.

A. Superscores plot based on multi-omics (metabolomics, proteomics, and transcriptomics) pathways across four latent variables. B. Omics view importances across latent variables. Values represent mean and SEM across 100 bootstrap samples. C. Top five pathways per omics block. D. Top 15 pathways across omics blocks categorised by Reactome parent pathway. E. kPCA ssPA scores from top 15 pathways used to cluster samples using Euclidean distance and Ward linkage. F. Heatmap showing Spearman correlation between superscores across four latent variables and clinical metadata. Asterisks indicate Bonferroni p-value ≤ 0.05. Definitions of clinical variables are in Table B in S1 Supporting Information.

More »

Expand

Table 1.

Number of Reactome/KEGG pathways accessible in COPDgene and COVID-19 multi-omics datasets.

More »

Expand

Table 2.

Performance comparison of PathIntegrate Multi-View using pathways versus using the molecular-level COPDgene dataset (mean AUC and 95% CI, as well as the number of latent variables (LV) used).

In both pathway and molecular-level scenarios the model was used to predict binary COPD status. The molecular-level model was fit both with all molecules available in the datasets, as well as only those mapping to pathways. AUC values are averaged across 5-times repeated 5-fold cross validation.

More »

Expand

Fig 6.

Network visualisation with PathIntegrate interactive network explorer.

PathIntegrate Multi-View was applied to COPDgene multi-omics data. A. Multi-omics network view of global Reactome hierarchy DAG. Only pathways with sufficient coverage (≥ 2 molecules per pathway) are shown as nodes. Edges represent parent-child relationships between pathways as defined by Reactome. Nodes are coloured by Reactome superpathway membership. Node size corresponds to pathway coverage. B. Network view of ‘Carnitine metabolism’ pathway (zoomed-in susbset of (A)) and close neighbourhood within the Reactome pathway hierarchy. Nodes are coloured by p-values obtained from PathIntegrate Multi-View model.

More »

Expand

Fig 7.

PathIntegrate Single-View applied to COVID-19 multi-omics data.

A. Kernel density distribution of log₁₀ pathway sizes in the COVID dataset per omics view. Pathway size refers to the number of molecules annotated to each pathway present in the COVID datasets. B. Number of pathways with sufficient coverage in the COVID dataset in each omics view. C. Multi-omics pathway features identified using recursive feature elimination from the PathIntegrate Single-View random forest model, ranked by Gini importance. D. Molecular level importances derived from the ‘ADORA2B mediated anti-inflammatory cytokines production’ (R-HSA-9660821) SVD pathway scores. Datapoints represent mean and standard deviation of loadings of each molecule on PC1 across 200 bootstrap samples.

More »

Expand

Table 3.

Number of molecules in each omics in COPDgene and COVID-19 datasets after processing and identifier mapping.

More »

Expand