A generalized higher-order correlation analysis framework for multi-omics network inference

doi:10.1371/journal.pcbi.1011842

Fig 1.

Comparison between SmCCA, TCCA, and GTCCA.

Visualization of the comparison between Sparse multiple Canonical Correlation Analysis (SmCCA), Tensor Canonical Correlation Analysis (TCCA), and Generalized Tensor Canonical Correlation Analysis (GTCCA). SmCCA only captures pairwise correlations (e.g., genes and proteins) and TCCA only captures the highest order correlations (e.g., gene, protein and metabolites),. GTCCA considers the combination of highest order correlations ((TCCA) and lower-order correlations (SmCCA).

More »

Expand

Fig 2.

SGTCCA-Net conceptual diagram.

Conceptual diagram of Sparse Generalized Tensor Canonical Correlation Network Analysis (SGTCCA-Net). It consists of four steps: (1) Sparse Generalized Tensor Canonical Correlation Analysis (SGTCCA) algorithm for feature selection; (2) Global networks construction based on results from (1); (3) Global network pruning; (4) Correlation edge filtering.

More »

Expand

Fig 3.

SGTCCA-Net workflow.

Workflow of SGTCCA-Net pipeline for multi-omics network inference (made with BioRender: https://www.biorender.com). The workflow consists of four example input data: transcriptomics (gene), proteomics (protein), metabolomics (metabolite), and phenotype, each with N subjects. Transcriptomics, proteomics, and metabolomics data have , , and features, respectively, and phenotype data has one feature. The pipeline first calculates the covariance density of each molecular feature, then the algorithm randomly selects a subset of the molecular features, biased towards features with high covariance density (Algorithm 1 in S3 Text, molecular features with high covariance density more likely to be selected). Based on the selected features, Generalized Tensor Canonical Correlation Analysis (GTCCA, Algorithm 2) is run to identify molecular features involved in higher/lower-order correlation of interest. Based on the GTCCA result, an affinity matrix between molecular features is constructed to identify the interaction between selected molecular features (Algorithm 2 in S5 Text). Network pruning then filters out weaker molecular features and edges in the affinity matrix, and network pruning (Algorithm 2 in S5 Text).

More »

Expand

Table 1.

Simulated multi-omics data correlation structure for cases 1–3.

Red and “*” mean that features that are simulated with this latent factor are considered signal features. In addition to the existing latent factors and random noise, as shown in the table, additional random noise will be added to all simulated molecular features and phenotype data. The first table is for simulation case 1, where all types of phenotype-specific correlation structures are simulated and considered signal; the second table is for simulation case 2, where only 4-way phenotype-specific correlation structures are simulated and considered signal; the third table is for simulation case 3, where all 3-way, pairwise phenotype-specific correlation structure is simulated and considered signal.

More »

Expand

Table 2.

Simulation results.

Performance is evaluated through the AUC of the precision-recall curve generated by applying different thresholds to the maximal connection of molecular features to each other. For this simulation, 20 replications are the AUC median and interquartile range in parenthesis of is reported. “Best” AUC for SmCCNet and DIABLO denotes that in each replication, 9 SmCCNet/DIABLO models are run and only the highest AUC score is recorded. The first table is the simulation study for setting 1, which uses latent factors simulated from multivariate normal distribution; the second panel is the simulation study for setting 2, where latent factors are simulated with a highly right-skewed distribution; the third panel is the simulation study for setting 3, where latent factors are simulated with a multivariate normal distribution (same as case 1), but strong random noise is enforced on omics data. Each simulation study contains 3 cases: Case 1 means that the signal molecular features are defined as all 4-way, 3-way, and pairwise phenotype-specific correlation structure; case 2 removes the phenotype-specific 4-way correlation structure; case 3 removes the phenotype-specific 3-way and pairwise correlation structure. In each case, the data is simulated with 100 or 200 subjects.

More »

Expand

Table 3.

Top 5 molecular features from each molecular profile and their individual correlation with respect to tumor purity for TCGA breast cancer data (with p-value).

More »

Expand

Fig 4.

Enrichment analysis results for TCGA breast cancer data with respect to tumor purity.

(a) The top pathways that are associated with the final network. (b) Protein-protein interaction (PPI) network for the multi-omics network from SGTCCA-Net with respect to tumor purity colored by clusters. Clusters are generated based on the Molecular Complex Detection (MCODE) algorithm (Cluster 0.0 has been hidden because of the cluster size).

More »

Expand

Fig 5.

SGTCCA-Net network with top 10 molecular features from each molecular profile.

Multi-omics network module for TCGA breast cancer data with respect to tumor purity. Nodes are genes (red), miRNAs (blue) and RPPAs (green). The edge color denotes positive correlation (red) or negative correlation (blue) between molecular features with the width denoting the strength of the connection. Edges are filtered based on Pearson correlation with a threshold of 0.2.

More »

Expand

Table 4.

Top 5 molecular features from each molecular profile for SGTCCA-Net based on the PageRank and their individual correlation with respect to for COPDGene data (with p-value).

Ensembl gene ID is included for genes and SomaScan sequence ID is included for proteins.

More »

Expand

Fig6.

Enrichment analysis results for COPDGene data with respect to .

(a) The top pathways that are associated with the final network. (b) Protein-protein interaction (PPI) network for the multi-omics network from SGTCCA-Net with respect to colored by clusters. Clusters are generated based on the Molecular Complex Detection (MCODE) algorithm.

More »

Expand

Fig 7.

SGTCCA-Net network with top 10 molecular features from each molecular profile. Multi-omics network module for COPDGene data with respect to .

The red nodes stand for genes, the purple nodes stand for proteins, and the black nodes stand for metabolites. The width and color depth of the edge stands for the strength of the connection between two molecular features and the type of color stands for whether two nodes are positively correlated (red) or negatively correlated (blue). Edges are filtered based on Pearson correlation with a threshold of 0.2.

More »

Expand