PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration

As terabytes of multi-omics data are being generated, there is an ever-increasing need for methods facilitating the integration and interpretation of such data. Current multi-omics integration methods typically output lists, clusters, or subnetworks of molecules related to an outcome. Even with expert domain knowledge, discerning the biological processes involved is a time-consuming activity. Here we propose PathIntegrate, a method for integrating multi-omics datasets based on pathways, designed to exploit knowledge of biological systems and thus provide interpretable models for such studies. PathIntegrate employs single-sample pathway analysis to transform multi-omics datasets from the molecular to the pathway-level, and applies a predictive single-view or multi-view model to integrate the data. Model outputs include multi-omics pathways ranked by their contribution to the outcome prediction, the contribution of each omics layer, and the importance of each molecule in a pathway. Using semi-synthetic data we demonstrate the benefit of grouping molecules into pathways to detect signals in low signal-to-noise scenarios, as well as the ability of PathIntegrate to precisely identify important pathways at low effect sizes. Finally, using COPD and COVID-19 data we showcase how PathIntegrate enables convenient integration and interpretation of complex high-dimensional multi-omics datasets. PathIntegrate is available as an open-source Python package.


Related work
DIABLO [1] is a supervised method for multi-omics data integration based on generalised canonical correlation analysis (GCCA).It uses singular value decomposition to find a lowerdimensional representation of multiple omics input matrices and selects correlated variables which are associated with the phenotype of interest.It requires the user to specify a design matrix, representing the expected correlation between omics datasets in the model.The inputs for DIABLO are scaled N-by-M omics data matrices, rendering it also compatible with pathway-transformed data matrices.
MOGSA [2] is an unsupervised method for multi-omics data integration, designed to output a matrix of multi-omics single-sample pathway scores.It begins by integrating the data at the molecular level using multiple-factor analysis, followed by projecting a binary matrix of pathway-membership information onto the observations in the latent space, and finally multiplying together the latent space matrices of samples and pathways to produce an N-by-P pathway score matrix.The final pathway score matrix can be decomposed to investigate the contribution of each omics dataset.MOGSA, unlike PathIntegrate and DIABLO, is not a predictive model but rather a method for generating multi-omics pathway scores, which could be used as input to predictive models like PathIntegrate.
Like MOGSA, Multi-Omics Pathway Analysis (MOPA) [3] generates pathway-score matrices using non-negative tensor decomposition.It is designed for gene-based omics data such as mRNA, methylation, and miRNA data.MOPA uses a two-step process for generating pathway scores, firstly it employs a non-negative tensor decomposition to perform feature selection to find genes significantly associated with a phenotype, and secondly computes pathway scores using these genes with a method similar to Gene Set Variation Analysis [4].Like MOGSA, MOPA allows the calculation of an 'omics contribution rate', to understand how different omics contribute to pathway score calculation.
Multi-Omics Factor Analysis (MOFA) [5] is an unsupervised latent-variable method for multi-omics data integration.It uses group factor analysis to decompose multiple omics matrices into loadings and score matrices, which can be sparse.MOFA could be used with ssPA score matrices as input, to form an unsupervised pathway-based multi-omics integration model.Similar to PathIntegrate, users can extract variable importances for each latent factor and the contribution of each omics to each factor.Lilikoi 2.0 [6] is a metabolomics-specific pathway-based deep learning model.It uses Pathifier [7] to produce ssPA scores which are then input to a deep neural network, or other classifiers such as random forest or logistic regression.It offers prognosis prediction using a Cox proportional hazards model, as well as network-based pathway visualisation options for downstream analysis.
PathwayPCA [8] is a toolkit offering multiple pathway-analysis based utilities: 1) testing pathway association with an outcome (similar to conventional pathway analysis), 2) extracting important genes within a pathway using sparse modelling, 3) compute pathway scoring on important genes, which can be used as input for multi-omics analysis.The pathway scores are computed using Adaptive, Elastic-net, Sparse PCA) or Supervised PCA (SuperPCA), introduced by the same authors.Similar to Lilikoi, the pathway-transformed output can be input to various downstream analysis such as survival analysis.
Integrative directed random walk-based method utilizing pathway information (iDRW) [9,10] is a method for generating ssPA scores based on utilising gene-gene topological interactions within pathways.Combining a gene-gene directed graph based on KEGG pathways and a random walk algorithm, iDRW was used to integrate gene-expression and copy number alteration data, resulting in a pathway score matrix.The authors demonstrated using iDRW scores improved survival prediction compared to molecular-level data as well as other ssPA scoring approaches.
Finally, we refer the interested reader to a comprehensive review by Maghsoudi et al. [11] which provides a systematic evaluation of 32 integrative pathway analysis methods.While the aforementioned methods all provide useful functionality for either multi-omics integration at the molecular level (DIABLO, MOFA), or the generation of pathway scores at either the single omics level (Lilikoi, PathwayPCA), or the multi-omics level (MOGSA, MOPA, iDRW), none of these provide a framework for pathway-based multi-omics data integration.PathIntegrate seeks to fill this gap, providing a user-friendly Python implementation of the Multi-View and Single-View frameworks which a) generate multi-omics pathway scores (based on the user's choice of ssPA methods), and b) apply state-of-the-art predictive models to identify perturbed pathways.Furthermore, the majority of methods for generating multiomics pathway scores are not designed to incorporate metabolomics data and are primarily based on gene/protein identifiers.PathIntegrate is specifically designed for (but not limited to) the integration of metabolomics data alongside other omics, providing multi-omics pathways containing gene (ENSEMBL), protein (UniProt), and metabolite (ChEBI) identifiers.Finally, to enhance ease-of-use and seamless integration with other pipelines, PathIntegrate models are compatible SciKit-Learn estimators, enabling the use of various predictive models and parameter optimisation functions available in the SciKit-Learn library.

Pathway database influences model performance
The performance of pathway-based models is strongly dependent on the pathway definitions used.The number and composition of pathways varies between databases, and factors such as size, level of overlap, ratio of compounds to proteins/genes, etc. can all impact the models.We investigated the size of pathways in the Reactome human versus the KEGG human multi-omics pathway databases (those used in this work, where pathways can contain a combination of metabolites, proteins, and genes), and found KEGG to contain on average larger pathways (median size 96 molecules) than Reactome (median size 24 molecules).Reactome however contains more pathways (2,583) than KEGG (352).Importantly, these pathway database statistics are influenced by the molecules profiled in the dataset at hand, as only molecules that map to pathway identifiers will be included in the modelling.We investigated the pathway size distribution in two datasets, COPDgene and COVID-19 and found that the general trend was the same: KEGG pathways are generally larger than Reactome pathways (Fig D in S1 Supporting Information).
We also investigated the pathway annotation levels of genes, proteins, and metabolites, i.e. the percentage of molecules profiled in a dataset with a valid pathway database identifier (ENSEMBL, Uniprot, or ChEBI) assigned to pathways.Although results are highly dataset and assay-dependent, when considering the COPDgene and COVID-19 datasets and the Reactome pathway database (Table A in S1 Supporting Information), we found proteomics data to have the highest percentage of total molecules profiled mapping to pathways (>70% for both datasets).Metabolomics data had the lowest percentage of molecules mapping to pathways (16.9% for COPDgene and 23.9% for COVID-19).This is likely due to the specificity of the ChEBI identifiers, particularly for chemical subclasses such as fatty acids, where molecules i.e. lipids can be annotated to a very high level of specificity depending on side chain composition etc, but these are not yet annotated to pathway databases at such a high level of specificity.Bulk transcriptomics data was not available for the COVID-19 data, but in the COPDgene dataset only 27% of ENSEMBL genes mapped to Reactome pathways, demonstrating that the annotation issue is not specific only to metabolomics data, but can also affect sequencing-based omics such as transcriptomics, where thousands of genes are yet to be added to pathways.

Table A in S1 Supporting Information: Percentage of molecules with a valid identifier (ChEBI, UniProt, or ENSEMBL) in single omics mapping to Reactome human pathways. A lower percentage of molecules mapping to pathways means a greater percentage of molecules do not yet map to pathways and are not incorporated into pathway-based analyses.
Fig B in S1 Supporting Information: Fold changes in COPDgene multi-omics data based on either COPD status or gender outcomes.
Fig E in S1 Supporting Information.Comparison of PathIntegrate methods classification performance using KEGG and Reactome pathway databases as well as molecular-level model based on semi-synthetic COPDgene data.

Fig I in
Fig I in S1 Supporting Information.Comparison of PathIntegrate classification performance using KEGG and Reactome pathway databases as well as molecular-level model based on semisynthetic COVID-19 data.

Fig K in S1 Supporting Information: 5 -
Fig K in S1 Supporting Information: 5-times repeated nested 5-fold cross-validated results for number of latent variables parameter tuning in PathIntegrate Multi-View for COPDgene case study integrating metabolomics, proteomics, and transcriptomics data.X axis shows mean AUC across inner folds.Error bars represent standard deviation.

Fig L in
Fig L in S1 Supporting Information: Preview of PathIntegrate network explorer app (running on a local host server) showing an example of a multi-omics dataset being analysed.Interactive visualisations are facilitated by the open-source Plotly Dash framework (MIT license).Nodes in the network represent pathways and edges represent parent-child relationships between them.Users can zoom in and hover over nodes to see more information about the pathway.

Fig M in S1
Fig M in S1 Supporting Information: Reactome hierarchy network (based on coverage inCOPDgene multi-omics data) coloured by root pathway membership with full legend.In the interactive app users can hover over nodes to see detailed information about pathway name, root pathway, and coverage in a dataset.

Table B in S1 Supporting Information:
Clinical data definitions for significantly correlated clinical variables from COPDgene study shown in Fig 4F.

Table C in S1 Supporting Information:
Table of notation 229