Figures
Abstract
Pathways are composed of proteins forming a network to represent specific biological mechanisms and are often used to measure enrichment scores based on a list of genes in means to measure their biological activity. The pathway analysis is a de facto standard downstream analysis procedure in most genomic and transcriptomic studies. Here, we present MOPA (Multi-Omics Pathway Analysis), which is a multi-omics integrative method that scores individual pathways in a sample wise manner in terms of enriched multi-omics regulatory activity, which we refer to mES (multi-omics Enrichment Score). The mES score reflects the strength of regulatory relations between multi-omics in units of pathways. In addition, MOPA is able to measure how much each omics contribute to mES that may be used to observe what kind of omics are active in a pathway within a sample group (e.g., subtype, gender), which we refer to OCR (Omics Contribution Rate). Using nine different cancer types, 93 clinical features and three types of omics (i.e., gene expression, miRNA and methylation), MOPA was used to search for clinical features that were explainable in context of multi-omics. By evaluating the performance of MOPA, we showed that it yielded higher or at least equal performance compared to previous single and multi-omics pathway analysis tools. We find that the advantage of MOPA is the ability to explain pathways in terms of omics relation using mES and OCR. As one of the results, the TGF-beta signaling pathway was captured as an important pathway that showed distinct mES and OCR values specific to the CMS4 subtype in colon adenocarcinoma. The mES and OCR metrics suggested that the mRNA and miRNA expressions were significantly different from the other subtypes, which was concordant with previous studies. The MOPA software is available at https://github.com/jaeminjj/MOPA.
Citation: Jeon J, Han EY, Jung I (2023) MOPA: An integrative multi-omics pathway analysis method for measuring omics activity. PLoS ONE 18(3): e0278272. https://doi.org/10.1371/journal.pone.0278272
Editor: Tao Huang, Chinese Academy of Sciences, CHINA
Received: August 2, 2022; Accepted: November 13, 2022; Published: March 16, 2023
Copyright: © 2023 Jeon et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The software is available at: https://github.com/jaeminjj/MOPA. A tutorial using the TCGA-COAD colon cancer data is also available at the github repository. The data for the tutorial can be download at: https://cobi.knu.ac.kr/tools/mopa/TCGA_COAD.zip.
Funding: This research was supported by the National Institute of Health in the form of a grant [2022-ER0806-00] and by the Basic Science Research Program, through the National Research Foundation of Korea (NRF), which is funded by the Korea Post-Genome Project, in the form of a grant [NRF-2020M3C9A5085604] to IJ.
Competing interests: NO authors have competing interests.
Introduction
With the growing interest in finding meaningful correlations across different omics data types, the size of multi-omics data is growing exponentially. Such multi-omics data are sampled and analyzed in means to explain certain clinical outcomes or biological phenomena of interest.
Advancement in methods for measuring such omics data is outpacing the development of analytical methods that are able to embrace the many omics types in an integrative manner. The multi-omics research has now become a routine that is rapidly extending into various domains, such as, metagenomics and single-cell [1]. While the advantage of utilizing multi-omics data is evident, the complexity of the analysis is non-trivial. Especially, as multi-omics, or multi-modal, data provide a multi-perspective view to a common event, the interpretation of the result can become relatively difficult. There are a number of valuable resources providing large scale multi-omics data, such as TCGA [2], ENCODE [3] and GTEx [4].
Recently, a number of methods has been proposed that perform integrative analysis using multi-omics data. For example SNF [5], MOFA [6] and iCluster+ [7] are able to integrate a wide range of omics data types as is. MONTI [8], CNAme [9], iGC [10] and MethylMix [11] are also multi-omics analysis methods, but perform the analysis on gene-level transformed data. These methods commonly require the omics data to be sample matched. In many cases, the output is also in units of genes or omics features. To make biological interpretations, a list of omics features need to be further investigated, which frequently involves enrichment tests, such as GSVA [12] or GSEA [13].
Instead of a list of omics features, it would be more helpful if the output was in units of pathways, since a pathway itself represent some biological mechanism, and thus is self explainable. Currently, three multi-omics methods were found whose output is in units of pathways, where one was able to compute pathway scores for each sample. The three methods are ActivePathway [14], multiGSEA [15] and MOGSA (Multi omics pathway Geneset Analysis) [16]. ActivePathway and multiGSEA methods take multi-omics data as input and output p-values for each pathway. ActivePathway computes the p-value of a pathway using a ranked hypergeometric test for each omics, which are later combined as a single p-value. multiGSEA is an extension of GSEA, where it performs the conventional enrichment analysis on each omics data. Similar to ActivePathway, multiGSEA combines the pathway p-values of each omics into a single p-value. On the other hand, MOGSA produces pathway enrichment scores per sample by performing multivariate analysis for dimension reduction to detect latent multi-omics features. In comparison to the multi-omics approaches, GSVA, ssGSEA [17] and z-score [18] are frequently used methods that output pathway enrichment scores based on the single-omics gene expression data. The characteristics of a number of pathway analysis methods are compared in Table 1.
Moreover, since pathway results are used to understand the biological mechanism of a disease or phenotype of interest, pathway scores or p-values are expected to associate with underlying biological semantics of a given experimental condition or clinically available features, such as cancer subtype, gender or age. However, the majority of tools are based on unsupervised methods that aim to discover new groups that exhibit group specific multi-omics profiles.
Here, we propose MOPA (Multi-Omics Pathway Analysis), which is a tool that takes multi-omics data as input and outputs a score of a pathway per sample in association to a phenotype or clinical feature. A high scored pathway implies that the multi-omics profiles of genes within the pathway are different between the clinical feature groups of interest, which therefore may explain their biological difference in units of pathway and in context of multi-omics relation. The performance of MOPA was evaluated using multi-omics data from nine cancer types and 93 clinical features, followed by two detailed use case studies on colon and stomach adenocarcinoma cohorts. Compared to pathway based analysis tools, MOPA showed equal or improved classification performance. Distinct from other methods, MOPA is able to rank pathways by their association strength to a given target label. Most importantly, it provides mES (multi-omics Enrichment Score) and OCR (Omics Contribution Rate) metrics as output that represent the enrichment of multi-omics activity and the contribution of each omics towards the activity in each pathway. The mES and OCR metrics were developed to be used in means to enhance the comprehension of the pathway analysis result.
The mES score is a single value that integrates all three omics. The mES score is computed for each pathway in a single sample. A high mES score indicates that a certain regulatory relationship between the three omics is significantly enriched in a pathway. For example, if genes are strongly suppressed by miRNAs within a pathway, a regulatory relation between mRNA and miRNA omics is present and the pathway will have a high mES. If this was true for the omics relation between mRNA and methylation, a high mES will be computed. Hence, mES indicates the strength of any combination of omics. To supplement the mES, we defined the second metric OCR, which is able to show specifically what type of omics relations are taking part in the mES score.
Using mES and OCR, we illustrated that MOPA was able to detect pathways that were concordant with previously reported results in colon and stomach adenocarcinoma studies in context of multi-omics.
Materials and methods
Ethics statement
The multi-omics data of patients from various cohorts in the TCGA data portal was used. The type of cohorts, number of patients and the clinical features used in this study are provided in the S1 Table. All the data are level 3 data that are publicly available. The controlled access level 1 data was not used. Every patient is de-identified and encoded with TCGA sample codes. The related clinical information of patients also do not contain any private information. The results shown in this study are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. All the data satisfy the freedom-to-public criteria stated at https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/using-tcga/citing-tcga.
Multi-omics data set
The multi-omics data was collected in a case, or patient, matched manner. Thus, cases with matching omics of interest were selected and processed as shown in Fig 1. Here, the groups refer to some clinical features within a cohort that serve as target labels to be predicted by the multi-omics enrichment scores (mES) of pathways computed by MOPA. The details of data preprocessing are described in our previous work [8]. In brief, the multi-omics data from the same set of patients are processed in means to capture gene-regulatory relations between the different omics that significantly differ between the groups.
Multi-omics data are collected per patient in a multi-modal manner.
Each omics layer is composed of different number of features, scale and data type, where the unit of features correspond to the omics specific elements, such as genes, miRNAs or methylation probes. MOPA focuses on detecting gene-regulatory cis-relations across the multi-omics layers. Hence, it is advantageous to convert each omics data into gene-level data, which is also natural for pathway analysis where each node in a pathway network corresponds to a protein or gene. This will result in a transformed data where each omics layer are now in unit of genes and have the same dimensions (i.e., equal set of patients and genes). Due to the massive number of possible combinations of trans-regulatory elements and their targets, it is difficult to create an informative gene × sample × omics tensor, since only a small portion will be meaningful. Hence in this study, the MOPA tool focuses on cis-regulatory elements.
Such property allows us to combine the omics slices into a tensor, or cube, which can be decomposed using a non-negative tensor decomposition method to find latent features. For searching latent features, we used MONTI (Multi-Omics Non-negative Tensor decomposition method for Integrative analysis) [8], which we have recently developed. MONTI takes a multi-omics tensor as input and outputs a subset of selected features that are highly associated to a clinical feature of interest. Based on the selected features, MOPA computes a pathway enrichment score per pathway and per sample, which is finally in units of pathways. In case where target labels are not available, MOPA does not utilize MONTI but searches for clusters with distinct multi-omics patterns. The workflow of MOPA starts by preprocessing omics data to convert them into gene-level data, from which informative features are selected to compute mES and OCR. It is a sequential process of converting the units of omics into gene-level data and finally into pathway-level enrichment scores as depicted in Fig 2.
The sequential conversion of omics to gene and then to pathway level enrichment scores is shown.
MOPA was applied to nine cancer types using mRNA, miRNA and methylation data to predict the outcomes of 93 clinical features using mES. We also aimed to show that the interpretation of the result was advantageous over other methods using OCR. Thus, we selected methods that were able to score pathways in a sample wise manner since the scores reflect the omics activity in a pathway. Collectively, the performance of MOPA was evaluated by comparing the classification accuracy on clinical target outcomes to four pathway scoring methods, of which three are based on single-omics and one multi-omics data. While our interest lies within multi-omics, we included the single-omics pathway analysis tools to show the strength of multi-omics utilization. Among the multi-omics pathway tools, only MOGSA provided a scoring matrix as output. At last, case studies were performed on the colon and stomach adenocarcinoma cohorts data.
Cancer dataset.
Gene expression, miRNA expression and methylation level omics data of nine cancer types were collected from the TCGA [19]. Cases with matching omics data were used. The number of collected samples per cancer type were BRCA (768), COAD (295), HNSC (495), KIRC (317), LUAD (451), PRAD (491), STAD (335), THCA (497) and UCEC (427). Among the many clinical features, 93 categorical features were used for the association study. Since the label of clinical features were partially available, the set of samples for each clinical feature varied. Thus, a set of samples was compiled in respect to available clinical feature labels. Table 2 lists a partial set of the cancer dataset we used, showing the clinical features for each cancer type and their categorical values. The categorical values are referred to as clinical feature groups throughout the study. The detail of the full dataset is provided in the S1 Table.
Use case study dataset.
For validation and presentation of the utility of MOPA, two use case studies were performed on the molecular subtypes in colon and stomach adenocarcinoma cohorts. For each study, we showed that MOPA was able reproduce important biological results specific to the cancer type and its clinical feature groups reported in previous studies. Here, the clinical feature groups refer to attributes from the medical health records, such as cancer subtype, age, gender and cancer stage.
Pathway and annotation data.
The human KEGG [20] pathway database was used as the pathway source. Pathways are comprised of genes and MOPA performs the pathway analysis in a gene-centric manner. Hence, miRNA and methylation beta values were quantified per gene. For miRNA, the average expression of miRNAs that target a specific gene was assigned to that gene. For methylation, the average beta value of probes that are located within 2Kbp upstream of a gene’s transcription start site was assigned to that gene. For genes with no associated miRNAs or methylation probes were assigned with zero. Since pathways are composed of mainly protein coding genes, non-coding genes were excluded. The miRDB [21] was used to group miRNAs per target gene. The methylation probe annotation data from Illumina were used to group probes that are located within the 2kb upstream region of each gene’s transcription start site (TSS).
Methods
The analysis workflow of MOPA consists of three steps as shown in Fig 3. In step 1, multi-omics data are preprocessed and analyzed to detect latent features in the gene-level. If clinical target labels are available, MOPA utilizes the result of MONTI. Otherwise, MOPA performs sample clustering on the mES score to create sample cluster labels. Here, the latent features represent distinct multi-omics patterns across the clinical feature groups or sample clusters in units of genes. In step 2, MOPA computes pathway enrichment scores based on the selected features, which also allows us to measure the contribution of each omics to the pathway enrichment score. In step 3, the resulting pathway enrichment scores are subject to downstream analysis, including ranking pathways and genes by their association strength to the sample labels, constructing multi-omics based pathway networks, survival analysis and visualization of the pathway scores.
The MOPA process is carried out in three steps. Step 1 performs feature selection to remove features with weak multi-omics information. Step 2 computes the mES and OCR based on the selected features in an independent manner. Step 3 performs biological interpretation that includes the construction of group specific pathway-to-pathway networks, survival analysis, UMAP and heatmap plotting.
Step 1: Sample wise gene-level multi-omics feature selection.
Pathway scores are computed based on the latent features that are learned by the gene-level multi-omics integration method MONTI. MONTI takes a set of gene-level converted multi-omics data as input and outputs a list of latent gene features. In case when sample labels, or clinical feature labels, are available, MONTI will further select features that are highly associated to them. MONTI showed to be successful in selecting such features in various cancer types and clinical features. Otherwise, if labels are not available, MOPA skips the feature selection procedure in MONTI and proceeds the analysis in an unsupervised manner. The latent features are computed from the multi-omics tensor using the non-negative tensor factorization method PARAFAC [22], which maps the input multi-omics tensor data into a lower dimensional space. The tensor is decomposed into three feature components, each corresponding to an axis of the original input tensor (i.e., genes, samples and omics). The input to MOPA is a three-way array T, or a tensor, with genes xijk, that is factorized into three loading matrices, S, G, and O with elements sif, gjf, and okf. Here, matrices S, G, and O refer to the decomposed sample, gene, and omics components, respectively. Here, i, j, k and f refer to sample, gene, omics and rank feature indices, respectively. The tensor T is factorized using a predefined number of ranks R, which are referred to as the latent gene-level omics features. An optimal R for each dataset was predetermined based on the residual error of the decomposition and classification accuracy. If sample labels are available, MONTI selects informative features f among the R features, where f ⊂ R. Otherwise, all features are considered for further analysis, where f = R.
MOPA then assigns the features f to samples based on their values in the sample component S. A value sif indicates the association, or weight, of feature f to sample i. Here, the task is to assign features to samples with high association and discard features with weak or no association to any sample. Thus, each sample may be assigned with a different set of features. However, samples within the same clinical feature group or sample cluster are expected to share a common set of features if their multi-omics profile share some common characteristics. To define a threshold for assigning features to samples, first, for each sample i, the vector si is min-max scaled to both normalize the feature values and prevent the case where a sample is not assigned with any feature by Eq 1. Min and Max function in Eq 1 helps to make minimum and maximum value in vector si.
(1)
Then, the probability of features from the cumulative distribution function (CDF) was computed based on the scaled vector by Eq 2. Pkde function means kernel density estimation.
(2)
Now, rank features in sample i have cumulative probabilities, which is denoted as CDFi. The probability of a feature in CDFi reflects its association strength to sample i. By testing a range of CDF thresholds for selecting informative features, 0.6 showed robust results across several different datasets as shown in Fig 4. Thus, 0.6 was used as the threshold for assigning a set of features to sample i, where
. The result will be a list of samples with their assigned rank features
, as shown in Fig 3 (Step 1). The final output of Step 1 is
, which is used to compute the sample wise pathway mES and OCR metrics in Step 2.
Five samples were randomly selected for which the CDF plots were visualized. The horizontal line at 0.6 indicates the CDF threshold, that is used as the criteria for feature selection.
Step 2: Computing multi-omics pathway enrichment score and omics contribution rate.
Here, the multi-omics Enrichment Score (mES) and Omics Contribution Rate (OCR) are computed. mES reflects the multi-omics signal strength of a pathway in each sample, whereas OCR indicates how much each omics contribute to mES.
First, to compute mES, three types of input data are required: 1) the decomposed sample and gene components, S and G, 2) a Gene Matrix Transposed (GMT) file in binary format indicating the gene memberships to pathways and 3) the sample assigned features . The mES is similar to the enrichment score in GSVA, where a high score indicates that a significantly large portion of genes with membership to a pathway of interest are highly activated in terms of multi-omics, compared to non-member genes. Conventionally, the enrichment score is computed based on a list of expression level ranked genes. To extend it to embed multi-omics data, a mixture of multi-omics features in
and
are used in a single sample i. Here,
, is the z-normalized and positive scaled result of
. For each gene j, the sample feature values
is multiplied by
, which are summed for all features
as in Eq 3.
(3)
The above is computed for each gene and each sample, which is the matrix vector product
(Eq 4).
(4)
Here,
is sorted to order genes, which is the input for computing mES using the Kolmogorov–Smirnov (KS) random walk statistic. While multiple omics are used, genes are ordered by a single value
. This is because
embeds the sample specific multi-omics characteristics
as shown in Eq 3. Similar to GSVA, the objective is to measure the similarity of two distributions. Here, the two distributions are one that consists of genes with membership to a pathway, and another with genes without membership. The result will be the cumulative difference, dijt, up to the j-th ordered gene between the two distributions of pathway t in sample i as shown in Eq 5,
(5)
where p(t) is the set of genes in pathway t, I(G(l) ∈ p(t)) the indicator function that outputs 1 if the l-th gene is a member of pathway t and 0 otherwise. q refers to the number of genes in the dataset. dijt in Eq 5 is computed for each sample, gene and pathway. The gene with a high ril value starts calculation and dijt value shows the difference between genes belonging to the pathway and genes not. The smallest dijt is subtracted from the largest dijt, which is the mES of pathway t in sample i (Eq 6),
(6)
Instead of using integer ranks as in GSVA, the continuous values
are used to reflect the difference of multi-omics features.
Second, to compute the omics contribution rate OCR, two types of input data are required: 1) the S, G and O components and 2) the sample assigned features . The contribution of each omics towards the mES is used to interpret how much the pathway is influenced by which omics. Since OCR aims to reflect the activity of omics, it is advantageous to observe it per clinical feature group so that the differently activated omics across the groups can be compared instead of individual samples. Thus, OCR is computed for each clinical feature group, where features that are commonly assigned to samples within a same group are collected. Features shared by 50% of the samples in a group were collected from S′, which are expected to represent the m group’s specific multi-omics profiles and denoted as
. Experimental evaluation results are shown in the results section to show that the assumption of feature sharing tendency is observed in colon and stomach cancer cohorts. The strongest associated feature of each gene is selected from
for every gene in p(t). The omics profiles of features
are then summed to compute
, which is a vector of length L (or number of omics types), using Eq 7.
(7)
The OCR for pathway t of group m for each omics type k is obtained by
(8)
mES embeds each gene’s multi-omics profile per pathway and per sample. Thus, mES is computed without any knowledge of hidden groups in the sample dimension, such as subtypes, gender or cluster. A high mES implies that genes related to the selected features are enriched in the pathway of interest. Each selected feature also has an associated multi-omics profile. Collectively, in biological terms, a high mES implies that genes with a specific multi-omics profile are enriched within the pathway of interest. In the other hand, OCR aggregates group specific features to represent the ratio of their multi-omics profiles per group. Thus, the mES and OCR reflect different biological meanings. While mES reflects the enrichment of a certain multi-omics profile of genes within a pathway in a single sample, OCR shows the activation ratio of each omics for the enriched multi-omics profile in a sample group.
Step 3: Downstream analysis using mES and OCR for aiding biological interpretation.
MOPA provides a number of features to aid interpreting mES and OCR in respect to the biological context of interest. First, survival analysis using mES is supported to further select survival related pathways. For each pathway, the top and bottom 20% samples are labeled and output as high and low mES. Second, using Cytoscape [23], a pathway network is constructed per feature group based on the gene set relation between the pathways (left most figure in Fig 3 Step 3). Here, a node is a pathways in form of a pie chart that visualizes the OCR. The border of a node indicates the survival p-value, whereas the size of the node represent the number of genes in the pathway. A single pathway network is constructed for each sample group. Thus, the pathway networks will have the same topology but will have different forms of pie charts, which is convenient to visualize how the OCR differ between them. Third, to observe if mES represent feature group specific multi-omics characteristics, UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) [24] and heatmap plots are output. Additionally, a list of ranked genes are output per pathway. Among the genes in a pathway, the significance of their multi-omics profile difference between the clinical feature groups may differ, which can be tested by performing ANOVA analysis on , which is the rank weights of gene j in all the samples. The rj is tested whether it is significantly different between the clinical feature group samples (e.g., subtypes). Each gene is tested with Bonferroni corrected p-value. The genes are by the adjusted p-values, which may serve as source for further downstream analysis. The rank of genes implies the multi-omics activity related to the sample i associated features, which is expected to be similar across samples of the same clinical feature group.
Results
The multi-omics pathway analysis of nine cancer types
To validate the results of MOPA, the association between mES, OCR and clinical features was investigated. Three types of omics were used: gene expression (Ge), methylation level (Me) and miRNA expression (Mi). Here, we tested if the results were coherent with previous biological findings and also made effort to observe what types of clinical features were explainable using mES and OCR.
The results were compared to other pathway enrichment tools, which were GSVA, ssGSEA, z-score and MOGSA. While MOPA is able to utilize clinical label information to compute mES and OCR, the supervised feature selection procedure was skipped for fair comparison since the comparing tools are all unsupervised methods. The unsupervised version of MOPA is referred to as MOPA_UN, which was used for every performance evaluation task involving comparison to other tools. The input data for GSVA, ssGSEA and z-score was the gene expression count matrix of the same samples used by MOPA. For MOGSA, the same data used by MOPA was used as input data, which is the gene-level data.
For each cancer type, MOPA analysis was performed on each clinical feature group. A total of 93 clinical features were tested where MOPA was executed for each feature. By training a Multi-Layer Perceptron (MLP) model on the acquired mES matrix for each clinical feature target label, the 10-cross validated average F1 score of sample classification results was measured. The train and test data were split by 9:1 ratio. The pan-cancer clinical feature group classification results as radar charts in Fig 5. Also, the F1-scores were tested for significance between the methods, which are provided in S2 Table.
The F1 score of 93 different clinical features in nine cancer types was measured for each pathway scoring method.
From the result, we observed that not all clinical feature groups were well classified. Among the clinical features, subtype, mutation clusters showed high F1 scores in all the tools. MOPA showed the highest F1 scores in COAD, STAD subtypes, mutation and hypermethylation clusters. Single-omics tools also performed well on subtype and hypermutated categories using gene expression data. While MOGSA also utilizes multi-omics data, the performance on the hypermethylation category in COAD was lower than the single-omics tools. In the STAD hypermethylation case, MOGSA performed better than single-omics tools but was below MOPA. A clear advantage of using multi-omics data was observed in classifying the hypermethylation groups in COAD, STAD, LUAD and PRAD, where single-omics tools showed poor performance, which may be due to the difficulty of predicting methylation clusters through the gene expression data. Especially, gender groups were well classified in the HNSC and KIRC datasets by MOPA compared to the other tools. From the results, we learned that noise is accumulated when incorporating more omics types that may hinder learning group specific features. Thus, how to integrate and analyze the heterogeneous multi-omics type is important in terms of not to loose structural information of each individual omics and further learn relations among them. For performance evaluation, we used four different classification methods as shown in Table 3. Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Multi-layer Perceptron (MLP) classifiers were used to predict the clinical features groups in each cancer type. Overall, MOPA_UN achieved higher F1-scores in all the tasks. While gene expression data was used as input data for the single-omics methods, other single-omics data were also used for performance comparison. The results show that utilizing multi-omics data is advantageous over single-omics data (S1 Fig). There were not enough methods to compare besides MOGSA which outputs a pathway activity score per sample. To compare with a state-of-the-art method, the ActivePathways method was used. ActivePathways is a method that computes a p-value of a pathway for each omics in an individual manner. It takes the p-value of each omics entity as input instead of actual expression data, which has to be provided by the user. The choice of statistical test for computing the p-values are not constrained. Then it combines the p-values to acquire a single p-value, which is then used to measure the significance of a pathway’s activity. Thus, it does not consider the cis-relations between the omics elements and also is not able to provide weight-like pathway score. Since the output of ActivePathways is a list of pathways, we made effort to investigate if the results agreed with the pathways output by MOPA. Using the COAD dataset, ActivePathways reported seven pathways, with significant adj. p-values, as shown in Table 4. Then, the mES scores output by MOPA were tested for significance for the same seven pathways for comparison. Among the seven pathways, six agreed to be significant in both methods. The “Lysosome” pathway had the least significant adj. p-value (0.049) in ActivePathways, whose adj. p-value (0.135) was also least significant in MOPA. Collectively, the output pathways by the two methods were found to agree. ActivePathways also reports the omics that has significant influence to the pathway. However, all pathways showed to be influenced by GE (i.e., gene expression) except the “Focal adhesion” and the “Lysosome” pathways. In comparison, MOPA was able to present the influence of each omics in more detail (pie chart ratios) based on the Omics Contribution Score (OCR) metric (S2 Fig).
While the performance of MOPA is not superior, its performance was at least equal or higher than the comparing methods. Furthermore, the mES and OCR metrics are able to interpret the pathways with richer multi-omics context, which will be discussed later.
We further investigated how the classification performance was effected by different combinations of omics. For the same pan-cancer dataset, all possible combinations of omics, including single-omics, were generated and tested, which is provided in Fig 6. As expected, the multi-omics landscape differed between the cancer types and omics combinations, which showed specific combinations to perform better than others. The adj. p-values for each task in provided in the S3 Table. However, in every cancer type, utilizing all omics (i.e., mRNA, miRNA and methylation) showed robust results and the combination of at least two omics showed significantly improved results than single-omics data in all cancer types. For example, the GeMe (i.e., gene expression and methylation omics) combination in STAD showed significant F1 score improvement in the molecular subtype group in Fig 6. Also, here it can be seen that both Ge and Mi single-omics showed lower F1 scores compared to the combined GeMi dataset, indicating that gene and miRNA relations together better represent STAD subtype specific features.
For each cancer type, the F1 score was measured on the clinical feature with best classification accuracy by utilizing different sets of omics combinations. ‘ALL’ indicates the combination where all three omics were used. Overall, ‘ALL’ showed robust performance.
Experimental verification
To compute mES and OCR, a number of hyperparameters are involved, which were tested for possible fine tuning.
Parametric test of mES.
mES depends on two hyperparameters. The first is the number of features in the input data. The input to MOPA is the decomposed loading matrices output by MONTI, which requires a user provided R value that determines the number of features in the decomposed components. By observing the changing residual of the decomposed matrices using a range of different R values, an optimal R was chosen based on the elbow method for each cancer type. As shown in S3 Fig, the R differed in the COAD, STAD and LUAD cohorts. Interestingly, a large R was expected to yield better performance since more features are learned, which was not the case. R values of 160 and 200 showed good performance in the three cancer types.
The second is the set of selected features , which is a subset of features from the input data and is determined by the CDF threshold. After fixing R, a range of CDF thresholds were tested. A higher threshold causes less number of features to be selected, which will cause loss of information when computing mES. In the other hand, a low threshold will select many of features that will weaken clinical feature group specific features. A threshold of 0.6 showed the most robust results in the three cancer cohorts, as shown in the S4 Fig.
Evaluation of OCR.
OCR is computed based on the features shared among the samples within the same clinical feature group. Such criteria is based on the assumption that each group share a common multi-omics profile. To test if this is true, instead of using the shared set of group features , all the features
were used to measure similarity between the samples with jaccard similarity. As shown in Fig 7, even if we did not limit the set of features to group specific features, subtype sample groups in COAD and STAD similar features.
(A) The jaccard similarity of sample features of COAD and STAD subtypes without using labels are shown. (B) The proportion of group shared and group specific features in COAD and STAD in MOPA_UN. (C) The proportion of group shared and group specific features in COAD and STAD in MOPA using label information.
Such pattern was more evident using the group shared features. The Venn diagram in Fig 7C, shows the overlap of features between the groups, where it can be seen that the majority of features were assigned specifically to each group.
Performance test without clinical feature labels
mES and OCR are computed using the preselected features by MONTI with clinical label information. Such supervised approach limits the analysis to association studies and cannot be used to detect novel sample subgroups. Thus, MOPA was further tested if it was able to detect group specific pathways in unsupervised manner. For such purpose, the feature selection process within MONTI was skipped, so that MOPA only used the CDF threshold for filtering low informative features. The samples were clustered using the mES matrix using the K-means algorithm to detect multi-omics sample clusters. The cluster numbers were then used as labels to measure the classification performance. Similarly, the pathway scores of the comparing tools were used for clustering in Fig 8. The clustering quality was compared by the Adjusted Rand Index (ARI) using the molecular subtypes as ground truth labels since they were shown to be most well discriminated. For robust evaluation, 30% of samples of each subtype group were bootstrap sampled for 1,000 times from which the ARI was measured. As shown in Fig 9, MOPA and the unlabeled version MOPA_UN showed significantly higher ARI scores than the other tools.
The (A) COAD and (B) STAD ground truth subtype labels were well recovered.
The ARI of COAD and STAD samples clusters was measured using their molecular subtype labels as ground truth labels (color coded).
Similarly, the UMAP plot indicates that the subtype group samples were successfully identified by MOPA_UN. Meanwhile, ssGSEA also showed a high ARI score and well discrimination of subtype groups in COAD (Fig 10A). Also in STAD, subtype groups were well distinguished (Fig 10B).
(A) COAD subtype color coded UMAPs and (B) STAD subtype color coded UMAPs.
In case feature labels are available, MOPA performs feature selection in a semi-supervised manner, which involves the feature selection process of MONTI in addition to the feature filtering by CDF. However, MOPA is able to compute mES and OCR without labels in case clinical feature label information is not available. While mES does not require any label information, OCR does. Thus, by performing simple K-means clustering on the mES matrix, sample clusters can be identified, which will serve as sample group labels that are required to compute OCR. In Fig 10, it can be seen that the actual COAD and STAD sample clusters well recovered the subtype sample groups.
Use case study
From two case studies, COAD and STAD, we present a detailed analysis procedure based on mES and OCR to show how MOPA was able to reproduce previously reported biologically important results.
Use case Study 1—COAD.
From the COAD data, a total of 106 pathways with significant survival p-values were detected to be associated with COAD molecular subtypes. For the survival analysis, two group of samples were defined in terms of the mES score. The top 20% and bottom 20% samples in respect to the mES score were defined as high mES and low mES groups. As shown in Fig 11D, the most significant pathways were the ‘Salivary secretion’, ‘Complement and coagulation cascades’ and ‘Staphylococcus aureus infection’.
(A), (B) and (C) represent the OCR of the pathways, respectively. (D) The survival plots of the three pathways.
To reason why these three pathways were selected, we investigated how the high mES and low mES groups were associated with the COAD CMS1, CMS2, CMS3 and CMS4 molecular subtypes. The number of samples and percentage per subtype were counted in each high and low mES groups as show in Table 5. The sample composition of all pathways are provided in the S4 Table. Here we observed that the subtype sample composition significantly differed between the subtypes. Especially, the majority of CMS4 samples were found in the high mES group, while CMS2 and CMS3 samples were both found in the low mES groups in the three pathways. Whereas, CMS1 did not show clear preference to either group, which implies it is located between the high and low mES range. Furthermore, it is observed that the sample composition of CMS2 and CMS3 subtypes are similar. Such observation was also reported in the previous study [25], where it was shown that the CMS2 and CMS3 subtypes were difficult to disambiguate. While such result only used gene expression data, other studies showed that the two subtypes had distinctively different methylation profiles [26]. Such observation was also detected by MOPA’s OCR metric, which will be discussed later.
From the result, we observed that the survival probability of high mES group (or CMS4 samples) was significantly lower than other subtype samples, which was also reported in [27]. While examining the relationships among the pathways, we found that the pathways were related to each other through the ‘TGF-beta signaling’ pathway. The over expression of TGF-beta signaling was shown to influence the salivary gland development [28], while similar studies were also present for the other pathways [29, 30]. While the survival p-value of the ‘TGF-beta signaling’ pathway was not as significant as the others, the high and low mES sample composition was very similar to them.
We further examined the ‘Salivary secretion’ pathway to observe what genes contributed to the mES score the most. Each gene of the pathway were subject to ANOVA testing using , which reflects the mES score in units of genes. As a result, 32 genes showed to be significantly associated with one or more CMS subtypes, among which BEST2 (Bestrophin 2) was the top gene (Bonferroni p.adj = 1.58E-33). BEST2 is identified as one of the methylation markers for detecting the prognosis of colon cancer [31].
According to the OCR of the ‘TGF-beta signaling pathway’ in Fig 12, we observed that the CMS4 subtype had a distinctively different ratio of omics activation. The gene and miRNA expression significantly differed between CMS4 and the other subtypes, which caused CMS4 to have a high mES. This was also observed in three other pathways, which were the ‘Salivary secretion pathway’, ‘Staphylococcus aureus infection pathway’ and the ‘Complement and coagulation cascades pathway’. Collectively, it implies that the CMS4 subtype yields a very different multi-omics landscape in Fig 13.
The OCR and mES of the ‘TGF-beta signaling’ pathway is shown. (A) The OCR of TGF-beta signaling pathway for each subtype. (B) The average gene omics values in the pathway (left) and the mES value for each subtype (right). The statistical significance is indicated.
Here, each node represents a pathway, where the pie chart indicates the OCR and the border the survival p-value. Each edge represents how many genes are shared within each pathways (node). The OCR values according to each group were visualized from the COAD data.
To compare the complete set of pathways with significant survival p-values, a pathway network specific to each subtype was constructed using Cytoscape as shown in Figs 14 and 15. From the networks, a clear difference in OCR among the four subtypes was captured. While the gene expression profile of CMS2 and CMS3 did not show significant difference as reported in [26], the two subtypes clearly differed in terms of OCR, especially by methylation.
Here, each node represents a pathway, where the pie chart indicates the OCR and the border the survival p-value. Each edge represents how many genes are shared within each pathways(node). The OCR values according to each group were visualized from the COAD data.
(A), (B) and (C) represent the OCR of the pathways, respectively. (D) The survival plots of the three pathways.
Use case Study 2—STAD.
In the STAD molecular subtype (i.e., CIN, EBV, GS and MSI) data, 34 pathways showed significant survival p-values (Fig 15D). Similar to case study 1, two of the top three pathways showed significant inverse high and low mES sample compositions between the EBV and GS subtypes (S5 Table), which were the ‘Adipocytokine signaling’ and the ‘Systemic lupus erythematosuslineage’ pathways, which are known to be biologically related [32, 33]. The OCR of each subtype of the three pathways are shown in Fig 15. The EBV and GS subtypes were previously reported to have different survival rates [34] Interestingly, the OCR ratio of miRNA in all pathways, including the three mentioned above, was very small only in the CIN subtype. Such observation was also made in the previous [35]. As important genes in the ‘Adipocytokine signaling’ pathway, NFKB1 and STATS were found, which are known to affect the immunity of gastric cancer [36]. Since, the EBV and GS subtypes showed to be very different form each other, pathway networks of the two were constructed and visualized as shown in Fig 16.
Here, each node represents a pathway, where the pie chart indicates the OCR and the border the survival p-value. The OCR values of the STAD subtypes (A) CIN and (B) GS were visualized.
Collectively, through the two case studies, we showed that the mES and OCR metrics were useful to interpret the biological functions of the pathways in relation to survival. Both mES and OCR output a result in units of pathways. However, mES takes into account the feature values of each gene with membership to a pathway. Thus, the mES score reflects the multi-omics activity of a certain set of genes. In the other hand, OCR reflects the multi-omics activity in a more broader aspect. If the each different omics are similarly activated in a pathway, the OCR will show equal ratio of each. However, this does not imply that the same genes in the pathway have similar multi-omics activity. Instead, OCR represents the global activity of each omics in a pathway from which we can quickly see what types of omics are effecting the pathway.
Discussion
MOPA aims to provide a view that aids in interpreting a pathway in terms of omics activation. By comparing two or more conditions, it is difficult to determine how the multi-omics activity differ between them. To provide such insight, the mES and OCR metrics were developed, which reflects the activity of each omics in unit of pathway. Such approach is easier to understand compared to an output that is a list of genes, which requires further enrichment analysis. Using a nine different cancer multi-omics data sets MOPA showed equal or higher performance to competing tools. Through the two case studies, the mES and OCR metrics showed to well reflect previous biological findings of colon and gastric cancer types.
While only three omics types was used in this study, MOPA does not limit the type of and also is not tailored to cancer studies. However, due to the nature of the tensor decomposition technique, MOPA may not perform well in cases where the dataset is small since number of ranks is constrained to the minimum dimension. In case of a small sample dataset, only a few features will be available to work with, which may not be enough to embed the complex multi-omics relations. By rigorous testing with different number of ranks in S5 Fig, we found that at least a rank of 120 was required to capture biological meaningful relations across the three omics types used in this study.
As a drawback, we found that the running time of MOPA was relatively longer than the compared methods. MOPA took approximately 13 minutes when using the COAD data, which was comprised of 14,464 genes, three omics types and 234 patients. ActivePathways came second with approximately nine minutes. The other tools completed under a minute. The longer execution time in MOPA was mainly due to the tensor decomposition task and the mES score computation.
The multi-omics domain is quickly expanding to other domains such as single-cell COVID studies, we expect that the concept of MOPA can also be applied to such large and more complicated datasets. We find that an easy interpretation of high dimensional and complex omics data will be helpful for interpreting the underlying biology in a multi-perspective view.
Supporting information
S1 Table. Pan-cancer dataset.
The datasets used in this study.
https://doi.org/10.1371/journal.pone.0278272.s001
(XLSX)
S2 Table. Adjusted p-values computed for the F1-score difference between methods.
https://doi.org/10.1371/journal.pone.0278272.s002
(XLSX)
S3 Table. Adjusted p-values computed for the F1-score difference between the multi-omics combinations(Ge,Me,Mi).
https://doi.org/10.1371/journal.pone.0278272.s003
(XLSX)
S4 Table. Survival data of COAD.
According to mES, the top 20 percent and the bottom 20 percent of samples were chosen and labels as mES and and mES low respectively. They were used for the survival analysis in COAD. The p-value is calculated according to each pathway, and the lower the value, the greater the difference in people’s survival according to mES. It shows how many samples belong to the top or bottom according to each clinical data type.
https://doi.org/10.1371/journal.pone.0278272.s004
(XLSX)
S5 Table. Survival data of STAD.
According to mES, the top 20 percent and the bottom 20 percent of samples were chosen and labels as mES and and mES low respectively. They were used for the survival analysis in STAD. The p-value is calculated according to each pathway, and the lower the value, the greater the difference in people’s survival according to mES. It shows how many samples belong to the top or bottom according to each clinical data type.
https://doi.org/10.1371/journal.pone.0278272.s005
(XLSX)
S1 Fig. Comparison of F1-score between the multi-omics and single-omics data.
(A) compares multi-omics to methylation, and (B) compares multi-omics to miRNA single-omics data. (C) compares multi-omics to methylation, and (D) compares multi-omics to miRNA single-omics data.
https://doi.org/10.1371/journal.pone.0278272.s006
(DOCX)
S3 Fig. F1-scores with varying rank numbers during tensor decomposition.
Depending on the number of samples and cancer, the number of ranks affects performance. Performance was compared according to the number of ranks.
https://doi.org/10.1371/journal.pone.0278272.s008
(DOCX)
S4 Fig. F1-score according to the threshold.
When CDF threshold uses MOPA, it becomes a cut-off criterion for each sample. Performance is significantly affected when it is determined how many rank features are selected for each sample according to this threshold.
https://doi.org/10.1371/journal.pone.0278272.s009
(DOCX)
S5 Fig. Performance changes according to the number of ranks.
The cancer subtype classification accuracy of COAD, and STAD was measured using features selected by the L1 and L2 method with different ranks in MONTI.
https://doi.org/10.1371/journal.pone.0278272.s010
(DOCX)
References
- 1. Lee J, Hwang D, et al. Single-cell multiomics: technologies and data analysis methods. Experimental & Molecular Medicine. 2020;52(9):1428–1442. pmid:32929225
- 2. Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary oncology. 2015;19(1A):A68. pmid:25691825
- 3. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, et al. Landscape of transcription in human cells. Nature. 2012;489(7414):101–108. pmid:22955620
- 4. Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, et al. The genotype-tissue expression (GTEx) project. Nature genetics. 2013;45(6):580–585.
- 5. Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nature methods. 2014;11(3):333–337. pmid:24464287
- 6. Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, et al. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Molecular systems biology. 2018;14(6):e8124. pmid:29925568
- 7. Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proceedings of the National Academy of Sciences. 2013;110(11):4245–4250. pmid:23431203
- 8. Jung I, Kim M, Rhee S, Lim S, Kim S. MONTI: A multi-omics non-negative tensor decomposition framework for gene-level integrative analysis. Frontiers in Genetics. 2021;12.
- 9. Louhimo R, Hautaniemi S. CNAmet: an R package for integrating copy number, methylation and expression data. Bioinformatics. 2011;27(6):887–888. pmid:21228048
- 10. Lai YP, Wang LB, Wang WA, Lai LC, Tsai MH, Lu TP, et al. iGC—an integrated analysis package of gene expression and copy number alteration. BMC bioinformatics. 2017;18(1):1–9. pmid:28088185
- 11. Cedoz PL, Prunello M, Brennan K, Gevaert O. MethylMix 2.0: an R package for identifying DNA methylation genes. Bioinformatics. 2018;34(17):3044–3046. pmid:29668835
- 12. Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC bioinformatics. 2013;14(1):1–15. pmid:23323831
- 13. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102(43):15545–15550. pmid:16199517
- 14. Paczkowska M, Barenboim J, Sintupisut N, Fox NS, Zhu H, Abd-Rabbo D, et al. Integrative pathway enrichment analysis of multivariate omics data. Nature communications. 2020;11(1):1–16. pmid:32024846
- 15. Canzler S, Hackermüller J. multiGSEA: A GSEA-based pathway enrichment analysis for multi-omics data. BMC bioinformatics. 2020;21(1):1–13. pmid:33287694
- 16. Meng C, Basunia A, Peters B, Gholami AM, Kuster B, Culhane AC. MOGSA: integrative single sample gene-set analysis of multiple omics data. Molecular & Cellular Proteomics. 2019;18(8):S153–S168. pmid:31243065
- 17. Yi M, Nissley DV, McCormick F, Stephens RM. ssGSEA score-based Ras dependency indexes derived from gene expression data reveal potential Ras addiction mechanisms with possible clinical implications. Scientific reports. 2020;10(1):1–16. pmid:32581224
- 18. Tomfohr J, Lu J, Kepler TB. Pathway level analysis of gene expression using singular value decomposition. BMC bioinformatics. 2005;6(1):1–11. pmid:16156896
- 19. Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, et al. Toward a shared vision for cancer genomic data. New England Journal of Medicine. 2016;375(12):1109–1112. pmid:27653561
- 20. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research. 2000;28(1):27–30. pmid:10592173
- 21. Chen Y, Wang X. miRDB: an online database for prediction of functional microRNA targets. Nucleic acids research. 2020;48(D1):D127–D131. pmid:31504780
- 22. Bro R. PARAFAC. Tutorial and applications. Chemometrics and intelligent laboratory systems. 1997;38(2):149–171.
- 23. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research. 2003;13(11):2498–2504. pmid:14597658
- 24.
McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426. 2018.
- 25. Guinney J, Dienstmann R, Wang X, De Reynies A, Schlicker A, Soneson C, et al. The consensus molecular subtypes of colorectal cancer. Nature medicine. 2015;21(11):1350–1356. pmid:26457759
- 26. van Den Berg I, Smid M, Coebergh van den Braak RR, van de Wiel MA, Van Deurzen CH, de Weerd V, et al. A panel of DNA methylation markers for the classification of consensus molecular subtypes 2 and 3 in patients with colorectal cancer. Molecular oncology. 2021;15(12):3348–3362. pmid:34510716
- 27. Thanki K, Nicholls ME, Gajjar A, Senagore AJ, Qiu S, Szabo C, et al. Consensus molecular subtypes of colorectal cancer and their clinical implications. International biological and biomedical journal. 2017;3(3):105. pmid:28825047
- 28. Hall BE, Zheng C, Swaim WD, Cho A, Nagineni CN, Eckhaus MA, et al. Conditional overexpression of TGF-β 1 disrupts mouse salivary gland development and function. Laboratory investigation. 2010;90(4):543–555. pmid:20142803
- 29. Liang JZ, Liang XL, Zhong LY, Wu CT, Zhang J, Wang Y. Comparative Proteome Identifies Complement Component 3-Mediated Immune Response as Key Difference of Colon Adenocarcinoma and Rectal Adenocarcinoma. Frontiers in oncology. 2021;10:3314. pmid:33732636
- 30. Wu J, Ding Y, Wang J, Wang F. Staphylococcus aureus induces TGF-β1 and bFGF expression through the activation of AP-1 and NF-κB transcription factors in bovine mammary epithelial cells. Microbial pathogenesis. 2018;117:276–284. pmid:29452196
- 31. Li BQ, Huang T, Liu L, Cai YD, Chou KC. Identification of colorectal cancer related genes with mRMR and shortest path in protein-protein interaction network. PloS one. 2012;7(4):e33393. pmid:22496748
- 32. Nakajima TE, Yamada Y, Hamano T, Furuta K, Gotoda T, Katai H, et al. Adipocytokine levels in gastric cancer patients: resistin and visfatin as biomarkers of gastric cancer. Journal of gastroenterology. 2009;44(7):685–690. pmid:19430715
- 33. Draborg AH, Duus K, Houen G. Epstein-Barr virus and systemic lupus erythematosus. Clinical and Developmental Immunology. 2012;2012. pmid:22811739
- 34. Sohn BH, Hwang JE, Jang HJ, Lee HS, Oh SC, Shim JJ, et al. Clinical significance of four molecular subtypes of gastric cancer identified by the cancer genome atlas project. Clinical Cancer Research. 2017;23(15):4441–4449. pmid:28747339
- 35. Bass AJ, Thorsson V, Shmulevich I, Reynolds SM, Miller M, Bernard B, et al. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014;513(7517):202.
- 36. O’Reilly LA, Putoczki TL, Mielke LA, Low JT, Lin A, Preaudet A, et al. Loss of NF-κB1 causes gastric cancer with aberrant inflammation and expression of immune checkpoint regulators in a STAT-1-dependent manner. Immunity. 2018;48(3):570–583. pmid:29562203