Canonical correlation analysis for multi-omics: Application to cross-cohort analysis

Integrative approaches that simultaneously model multi-omics data have gained increasing popularity because they provide holistic system biology views of multiple or all components in a biological system of interest. Canonical correlation analysis (CCA) is a correlation-based integrative method designed to extract latent features shared between multiple assays by finding the linear combinations of features–referred to as canonical variables (CVs)–within each assay that achieve maximal across-assay correlation. Although widely acknowledged as a powerful approach for multi-omics data, CCA has not been systematically applied to multi-omics data in large cohort studies, which has only recently become available. Here, we adapted sparse multiple CCA (SMCCA), a widely-used derivative of CCA, to proteomics and methylomics data from the Multi-Ethnic Study of Atherosclerosis (MESA) and Jackson Heart Study (JHS). To tackle challenges encountered when applying SMCCA to MESA and JHS, our adaptations include the incorporation of the Gram-Schmidt (GS) algorithm with SMCCA to improve orthogonality among CVs, and the development of Sparse Supervised Multiple CCA (SSMCCA) to allow supervised integration analysis for more than two assays. Effective application of SMCCA to the two real datasets reveals important findings. Applying our SMCCA-GS to MESA and JHS, we identified strong associations between blood cell counts and protein abundance, suggesting that adjustment of blood cell composition should be considered in protein-based association studies. Importantly, CVs obtained from two independent cohorts also demonstrate transferability across the cohorts. For example, proteomic CVs learned from JHS, when transferred to MESA, explain similar amounts of blood cell count phenotypic variance in MESA, explaining 39.0% ~ 50.0% variation in JHS and 38.9% ~ 49.1% in MESA. Similar transferability was observed for other omics-CV-trait pairs. This suggests that biologically meaningful and cohort-agnostic variation is captured by CVs. We anticipate that applying our SMCCA-GS and SSMCCA on various cohorts would help identify cohort-agnostic biologically meaningful relationships between multi-omics data and phenotypic traits.

systematically applied to multi-omics data in large cohort studies, which has only recently become available.Here, we adapted sparse multiple CCA (SMCCA), a widely-used derivative of CCA, to proteomics and methylomics data from the Multi-Ethnic Study of Atherosclerosis (MESA) and Jackson Heart Study (JHS).To tackle challenges encountered when applying SMCCA to MESA and JHS, our adaptations include the incorporation of the Gram-Schmidt (GS) algorithm with SMCCA to improve orthogonality among CVs, and the development of Sparse Supervised Multiple CCA (SSMCCA) to allow supervised integration analysis for more than two assays.Effective application of SMCCA to the two real datasets reveals important findings.Applying our SMCCA-GS to MESA and JHS, we identified strong associations between blood cell counts and protein abundance, suggesting that adjustment of blood cell composition should be considered in protein-based association studies.Importantly, CVs obtained from two independent cohorts also demonstrate transferability across the cohorts.For example, proteomic CVs learned from JHS, when transferred to MESA, explain similar amounts of blood cell count phenotypic variance in MESA, explaining 39.0% ~50.0%variation in JHS and 38.9% ~49.1% in MESA.Similar transferability was observed for other omics-CV-trait pairs.This suggests that biologically meaningful and cohort-agnostic variation is captured by CVs.We anticipate that applying our SMCCA-GS and SSMCCA on various cohorts would help identify cohort-agnostic biologically meaningful relationships between multi-omics data and phenotypic traits.

Author summary
Comprehensive understanding of human complex traits may benefit from incorporation of molecular features from multiple biological layers such as genome, epigenome, transcriptome, proteome, and metabolome.CCA is a correlation-based method for multi-omics data which reduces the dimension of each omic assay to several orthogonal components-commonly referred to as canonical variables (CVs).The widely-used SMCCA method allows

Introduction
In recent years, there has been rapid growth in high-dimensional multi-omics datasets (including DNA methylation, RNA-sequencing, metabolomics, proteomics, genomics, microbiome, etc.).However, careful analyses with integrative methods are needed to fully utilize these rich datasets and provide mechanistic insights into health and disease related outcomes.While many methods have been published [1][2][3], few studies have evaluated these methods on largescale datasets from human samples.In addition, despite quite a few successful examples of integrating two omics data-types [4][5][6][7][8], particularly detection of quantitative trait loci using genomic data, there are much fewer such examples of integrative analyses across more than two omics data types.
One promising method for using multi-omics data to explain phenotypic variation in health outcomes is canonical correlation analysis (CCA) [9].CCA is a statistical technique to identify associations among two assays where each assay contains multiple variables.Specifically, CCA finds a linear combination of variables in each assay that leads to the maximal correlation of the two linear combinations.Principal component analysis (PCA) can be considered as a special case of CCA as the optimization objective is the same in the case that the same data is used for the two assays.CCA is a commonly adopted dimension reduction and information extraction method in genomic studies [1,[10][11][12][13] as increasingly more modern genomic studies collect data from multiple assays.
An extension of CCA by Witten & Tibshirani [1] called sparse multiple CCA (SMCCA) allows for the input of multiple assays.We hypothesized that this method would be helpful for highdimensional multi-omics data exploration and for understanding and extracting omics signatures that reflect biologically relevant variations.Specifically, we here leverage our CCA-based method extended from Witten & Tibshirani's SMCCA to extract low-dimensional latent variables from high-dimensional multi-omics data and use them to explain phenotypic traits, focusing on blood cell indices, along with basic demographic and anthropometric characteristics.We perform CCA-based analyses in two studies with rich multi-omics data in hundreds of individuals, the Multi-Ethnic Study of Atherosclerosis (MESA) and the Jackson Heart Study (JHS).

CCA pipeline
A typical CCA-based method generates orthogonal canonical variables (CVs), which are lowdimensional summaries to represent latent variables underlying the multi-assay input data.Fig 1 is a cartoon illustration where we have three assays (X, Y, and Z) for three samples.Features are assumed to be continuous with no distributional assumptions.For presentation brevity, we only show how we obtain the top 4 CVs.For each assay, CCA infers 4 vectors of weights (e.g., W X1 , W X2 , W X3 , and W X4 for assay X), which leads to four CVs.For example, CV X1 , the top CV for assay X, is obtained by X×W X1 .The weights are inferred by maximizing the correlation of CVs across three assays.Note that in the rightmost CV matrices, each column of a CV matrix is one CV of the corresponding assay.In addition, CVs corresponding to the same column cross assays are expected to have maximal correlation (for instance, CV X1 , CV Y1 , CV Z1 ,are most correlated), while CVs in different columns are expected to be orthogonal or independent from each other in the same assay.

Modified gram-schmidt algorithm improves orthogonality
SMCCA implemented in the PMA R package does not always provide the expected orthogonal CVs, preventing effective extraction of independent CVs and sometimes causing serious multicollinearity issues in subsequent association analysis.For example, Fig 2A and 2B shows results from PMA's implementation of unsupervised SMCCA when applied to MESA proteomics and methylomics data (detailed in Methods) where we observe extensive correlation among the CVs.In the presence of undesired correlated CVs, users will have to perform a secondary filtering step to generate a list of non-redundant CVs, or else variation in omics data captured by the later CVs may overlap with variance captured by former CVs.Therefore, we sought to improve orthogonality among generated CVs for capturing distinct information from the integrated multi-omics data.Specifically, we follow the Gram-Schmidt (GS) strategy [14] which generates CVs sequentially by progressively subtracting the previous CV from the input matrices (detailed in Methods).Fig 2C and 2D shows substantially improved orthogonality among the CVs when applied to the same MESA proteomics and methylomics data.Similar patterns were observed when SMCCA was applied to JHS data (S1 Fig).

Proteomics CVs explain considerable amounts of variation in blood cell counts
We also applied our implementation to proteomics and methylomics data in JHS.As these unsupervised CVs are anticipated to capture shared latent variables underlying the proteomics and methylation datasets, we hypothesized that the CVs may explain a non-negligible amount of variation in various phenotypes.Our primary phenotypes of interest in this work are blood Cartoon illustration of a typical CCA-based method for three assays.X, Y, and Z are three assays with 4, 5, and 6 features respectively.When applying a CCA-based method on them to compute 4 canonical variables (CVs), we would first get their weight matrices W X , W Y , W Z , each of which contains 4 weight vectors.By multiplying each assay matrix (left panel) and its corresponding weight matrix (middle panel), we obtain the CV matrix for the assay (right panel) where each column corresponds to one CV.https://doi.org/10.1371/journal.pgen.1010517.g001cell traits, including white blood cell count (WBC), red blood cell count (RBC) and platelet count (PLT).We also considered age, sex, and body mass index (BMI), as "control" phenotypes which have been widely reported to explain considerable variability in proteomics and methylomics data.For each of the six outcome phenotypes, we fit regression models to estimate the percent of variation explained by the top 50 CVs from each of the two omics data, namely proteomics and methylomics (detailed in Methods).For each cohort (MESA or JHS), we had two sets of CVs, one derived from the cohort's own omics data, the other derived from applying the CV weights inferred from the other cohort.
We found that top CVs, from each of the two omics data, explain considerable amounts of variation in almost all of the outcomes evaluated (Fig 3).For example, top 50 methylomics CVs inferred in JHS explained 72%, 100%, 35%, 37%, 34%, 30% of variation in age, sex, BMI, WBC, RBC, and PLT respectively, in JHS (Fig 3A).We also observe high transferability between MESA and JHS, by first applying SMCCA-GS separately to each cohort and then transferring the inferred CVs to the other cohort.For example, the top 50 methylomics CVs inferred in MESA explained similar amounts of variation in RBC: 33% in MESA (itself) (Fig 3C) and 30% when applied to JHS (Fig 3D).Such high transferability suggests that latent variables learned by CCA might reflect biological processes shared across cohorts.We also note that these r 2 's from methylomics data were most likely under-estimated because the CVs were constructed using the top 10,000 most variable CpG sites (see Methods) instead of the entire ~700,000 sites, for computational reasons.These findings are not surprising: for instance, blood cell composition (notably for white blood cell subtypes) has been long known to influence the methylome.For that reason, in epigenome-wide association studies (EWAS), it has been standard practice to first estimate the leukocyte proportions from methylomics data and adjust for these cell type proportions in subsequent association analysis [15].Given shared precursors for all hematological cell types, we found it relatively unsurprising that RBC and PLT also had a high percent variation explained by methylomics CVs.Similarly, age [16], sex [17,18] and BMI [19] have been known to explain substantial variability in methylomics data, and are commonly adjusted for as covariates.
More interestingly, the amounts of variation in various outcomes explained by top 50 proteomics CVs are even higher, ranging 39% -100% in JHS and 39% -100% in MESA.Large r 2 for age, sex, and BMI are expected since all have been reported to rather broadly affect protein profiles [20,21].However, strikingly, r 2 for blood cell traits are also considerable, and comparable to BMI, 50%, 45%, 39% respectively for WBC, RBC and PLT in JHS using CVs inferred in JHS.Confirming these results, when applying CV weights inferred from MESA to JHS, we obtained similar r 2 's: 44%, 47%, 39% for WBC, RBC and PLT respectively.Similar patterns were also observed in MESA using both MESA and JHS derived weights.These considerable amounts of variations in blood cell counts explained by top proteomics CVs have important implications for association studies involving proteomics data: we should consider adjusting for blood cell proportions in these association studies, under the same rationale in EWAS (variability driven by blood cell subtype abundance is likely not of interest for many disease outcomes of interest whose association with proteomics data is being examined).

CVs vs Principal Components (PCs)
Although CVs are inferred jointly from multi-omics data, we have focused on analyzing CVs from each omics data type separately for their predictive power of outcomes of interest.Thus, we naturally are interested in comparing the CCA-based approach with the standard PCA approach since we can obtain PCs separately for each omics data.Note first that we expect larger and more assay-specific batch effects in JHS than MESA.For example, JHS proteomics data was generated in 3 batches [22], and separately from the methylomics data.In contrast, MESA proteomics and methylomics data were all generated through the MESA TOPMed pilot over a short time period [23,24].Results shown in Fig 4 supported our expectations: overall we observe that a lower number of JHS inferred CVs are needed to explain the outcomes with higher r 2 compared to JHS inferred PCs, indicating that top CVs inferred from JHS data tend to capture biological variations while top PCs tend to reflect more assay-specific technical variations.We note that this is supported by the stronger association for CVs vs PCs with technical factors (S6 and S7 Figs), notably for proteomics data which has been subjected to less pre-processing to account for technical effects related to batch/plate (prior to any of the analyses conducted here).The contrasts are most pronounced with age and WBC for proteomics data, and with age for methylomics data.For example, in JHS, proteomics-CV1 explained 33% variation in WBC (blue "+" on the leftmost side of Similar advantages of CVs over PCs were observed in MESA, but were less pronounced as expected due to the smaller and less assay-specific batch effects in MESA.Reassuringly, applying JHS inferred CV weights to MESA showed advantages similar to those in JHS, more pronounced than using CVs inferred in MESA itself, further demonstrating the power of CVs to capture biologically relevant variations under the presence of assay-specific batch effects.

Supervised sparse multiple CCA
Extending supervised sparse CCA to supervised sparse Multiple CCA.So far, we have generated and evaluated unsupervised CCA where the CVs are inferred from multi-omics data only, without considering any outcomes of interest.Although we assessed the relationship between unsupervised CVs and several outcomes of interest, the CVs themselves were inferred without knowledge of the outcomes.In practice, when we are primarily interested in a particular outcome, supervised approaches can be more effective and powerful.The PMA R package implements a sparse supervised CCA (SSCCA) method.However, this implementation only accepts two omics data at a time, which limits our capabilities in real datasets where there are more than two assays.For instance, in both MESA and JHS, we also have whole genome sequencing (WGS) data [25].We implemented a sparse supervised multiple CCA (SSMCCA) method to accommodate more than two assays of omics data.Our implementation follows the idea in Witten et al., (2009) [1] where a feature selection step is performed within each assay to retain (by default) top ~80% features most correlated with the outcome of interest.Features selected from each assay form new input matrices to which we then apply our implementation of unsupervised SMCCA with the adapted Gram-Schmidt algorithm.
To ensure our SSMCCA implementation generates sensible supervised CVs, we first compared results from PMA's SSCCA implementation, when there are two assays of data.Specifically, we compared correlations between inferred supervised CV1 and the corresponding outcomes of interest.We compared SSCCA and our SSMCCA by running two methods with 100 different random seeds and for each seed, testing the variation of each outcome explained by supervised proteomics CVs and supervised methylomics CVs (Fig 5).We found that in most cases, the amount of variation in outcomes captured by SSMCCA CVs is comparable or significantly higher than SSCCA, indicated by large red circles.For example, Fig 5A third row third column (red "×" on Fig 5A ) shows a large red circle which annotates a case where our SSMCCA outperforms the original SSCCA.In this example, SSMCCA proteomics CV1 explains 4.17% variation in PLT in MESA, while SSCCA 3.48% (p-value = 8E-9 for difference).The larger the difference, the darker the color.In a few cases, the amount of variation captured by SSMCCA CV1 is significantly smaller than SSCCA CV1.For example, Fig 5B row 2 column 1 shows a large blue circle (light blue "+" on Fig 5B ) which indicates a case where the original SSCCA outperforms our SSMCCA.However, although the difference in terms of percent variation explained in RBC by SSCCA vs SSMCCA methylomics CV1 is highly significant (p-value = 3E-28), the absolute difference (4.27E-8 percent variance explained) is tiny, suggesting the difference between the performance of two methods is negligible.

Biologically meaningful features detected by SSMCCA
We applied SSMCCA to three assays-proteomics, methylomics, and genotypes-from MESA to obtain 50 CVs for each assay, and then used standard regression models to assess associations with phenotypes-age, BMI, WBC, RBC, and PLT.CV-phenotype pairs were considered to be significantly associated when p-value < 1E-4 (Bonferroni correction), adjusting for covariates detailed in S2 Table .In MESA, we identified 58 significant CV-phenotype pairs, and 5 of them were validated in JHS with the same p-value threshold of 8.62E-4 and same direction of effect (S3 Table ).For example, WBC and proteomics CV3 were strongly associated in both cohorts (p-value = 2.7E-15 in MESA, 6.8E-16 in JHS, S3 Table ).Features with high absolute weight coefficients in this CV (S5 Table) are biologically relevant for WBC.For example, stem cell factor soluble receptor, which has the highest weight, is known to play a key role in hematopoiesis [26].Lipocalin 2, with the second highest weight, has been reported to be associated with human neutrophil granules [27].
For each phenotype, we then assembled all features from each assay (i.e., both methylomics and proteomics) with non-zero weight for phenotype-associated CVs in S3 and S4 Tables, and annotated each feature to a gene, on which we performed pathway enrichment analysis (described in Methods).For comparison, we also performed the same pathway enrichment analysis using features individually associated with each phenotype, where association is declared when FDR < 5% for each assay-phenotype-cohort combination.Comparing these two sets of pathway enrichment results, we found several pathways only revealed (p.adjust < 0.05) by our SSMCCA, including the growth factor binding gene ontology (GO) [28,29] term and the DisGeNET [30] progressive chronic graft-versus-host disease (GVHD) and polypoidal choroidal vasculopathy genesets.All of these pathways have been reported to be related to BMI in previous literature [31][32][33][34].Assigning CpG sites to genes is a challenging task.We adopted the simple nearest gene approach.Other reasonable approaches include promoter-centric assignment [35], leveraging differentially methylated regions [36], or using expression quantitative trait methylation (eQTM) [37] information.We explored the eQTM approach as we have both methylation and gene expression measurements in a subset of samples in JHS and MESA.However, due to limited number of CpGs included in significant CVs, we had only 25-257 genes (with the number of genes implicated by CpGs varying across different outcomes, detailed in S6 Table) based on significant eQTMs (Methods) for pathway enrichment analysis (results summarized in S7 and S8 Tables).We anticipate to benefit more from this approach when eQTM sample size increases.

Discussion
Large quantities of data across multiple omics (transcriptomics, proteomics, metabolomics, genomics, methylomics, etc.) modalities is currently being generated, for example through efforts funded by NIH's Precision Medicine Initiative [25,35] as well as other large federally funded studies [36].These high dimensional and complex multi-assay data are unfortunately still too often analyzed only separately (e.g., applying PCA separately to genotype, gene expression, or methylation data) or in a pairwise manner (for example mQTL analysis examining relationships between genome and methylome, or pQTL analysis examining relationships between genome and proteome).Many innovative methods have been proposed (https:// github.com/mikelove/awesome-multi-omics[accessed on 2022-07-25]) for integrative analysis but evaluations in large-scale real omics data are still lacking, with fewer impartial appraisals available to guide method choice in practice.
In the work presented here, we apply CCA-based methods to complex multi-omics datasets to assess their capabilities and limitations.In particular, for the widely used PMA implementation of the SMCCA methods, we identified two limitations: non-orthogonal CVs and inability to accommodate more than two assays for supervised analysis.We provide method extensions, SMCCA-GS and SSMCCA, to address the two limitations.Applying SMCCA-GS to real data in MESA and JHS, we found that CVs are consistent and transferable across cohorts, suggesting that CVs capture constitutive biological relationships shared across cohorts, and are not driven primarily by assay-specific technical variation.This cross-cohort consistency, to our knowledge, has not been well explored in the literature and has important implications for making method choices (e.g., CCA vs PCA) for multi-omics data with or without extensive assay-specific batch effects.
Importantly, our CCA-based analyses reveal that blood cell indices are substantially associated with multiple omics assays including methylomics and proteomics.The former association has been widely appreciated and exerted paradigm-shifting impact on analysis: in methylation association studies, white blood cell composition is adjusted for in methylation analyses in standard practice.The latter association, where CVs from proteomics data showed even more pronounced association with blood cell indices, has been under-appreciated, with blood cell traits not considered in most current proteomic analyses [22,[37][38][39].Our findings indicate that blood cell composition should be accounted for (or at least considered) in protein association studies where feasible, similar to what is standard practice for methylation studies.Each row corresponds to a phenotype (from bottom to top, Age, BMI, WBC, RBC, and PLT).Circle size reflects the significance of the difference in variation explained between two methods.Color reflects the size of difference between the variation of phenotype explained by SSCCA and our SSMCCA.Therefore, a larger circle means a more significant difference between the two methods.Note that we use rectangles for insignificant difference with p > 0.01.Red means that our SSMCCA explains more phenotypic variation while blue means that SSCCA explains more.The darker the color, the larger the difference (the scale is different for parts A and B, annotated in "diff" column on side of figure).https://doi.org/10.1371/journal.pgen.1010517.g005 As demonstrated in Fig 4 , our SMCCA-GS is in some cases more useful than PCA in explaining variability in phenotypes, using an identical number of PCs/CVs.However, there are also many cases where the methods are nearly equivalent.We hypothesize that our SMCCA-GS demonstrates more consistent advantages in explaining trait variability in JHS versus MESA due to the presence of more substantial JHS batch effects.Due to funding limitations, JHS proteomics and metabolomics data was generated in multiple batches across several years, while the MESA data used here was generated concurrently, funded by NHLBI's TOPMed program.Thus, for proteomics in particular, more batch effects are anticipated in JHS; our SMCCA-GS is particularly advantageous in cases where there is increased assay-specific technical variation.
In multi-omics data, it is commonplace to have drastic differences in the dimension of different omics data.For example, methylomics data, when generated by the widely used Illumina MethylationEPIC BeadChip array, contains almost 10 6 features; transcriptomics data are commonly summarized into ~10 4 expressed genes; and metabolomics and proteomics typically even smaller: only ~10 2 −10 3 features depending on the platforms used.Witten et al. (2009) [1], introducing the SMCCA method, analyzed data with 19,672 gene expression measurements and 2,149 comparative genomic hybridization measurements, showing that their method could accommodate such imbalance.Our methods, derived from SMCCA, are also expected to accommodate omics dimension imbalance.In our analyses, results using ~700K CpG sites, while computationally challenging to fit repeatedly, led to similar conclusions as using top 10,000 CpG sites (detailed in Methods and S4 and S5 Figs), further suggesting the robustness of sparse CCA methods to imbalance in omics dimension.
We note that CCA-based methods as implemented in our analyses still have several key limitations.Notably, we had to considerably reduce the dimensionality of methylation array and sequencing data in order for our CCA-based method to be computationally feasible (at least for the repeated analyses necessary for methods development and testing).While we were able to fit models for the entire set of CpG sites a single time, with similar overall results in terms of phenotype variance explained (S4 and S5 Figs), our SMCCA-GS approach will require further innovation to be scalable for large-scale datasets.Recently developed methods allow for efficient calculation of generalized CCA solutions across reduced dimensions of each distinct assay, which alleviates some of the computational issues that arise, though sparse identification of individual omics features from the original assay data may still be desired [40].

Cohorts
Ethics statement.All participants included in this analysis provided written, informed consent for use of genetic and multi-omics data, and all study protocols conform to the 1975 Declaration of Helsinki guidelines.The Jackson Heart Study (JHS) and Multi-Ethnic Study of Atherosclerosis (MESA) studies were approved by the Institutional Review Boards of all participating institutions.
JHS. JHS recruited 5,306 African American participants from the Jackson, Mississippi, metropolitan tri-county area (Hinds, Madison, and Rankin) into a prospective, community-based cohort designed to investigate risk factors for cardiovascular disease among African Americans [41][42][43].Demographics of JHS individuals involved in the analysis are displayed in S1A Table.
Multi-omics data utilized in JHS analyses including methylomics (n = 1,750, Illumina Methyla-tionEPIC BeadChip array) [44] and proteomics (n = 2,144, SOMAscan 1.3k array) [22], both from the baseline visit, and whole genome sequencing (WGS) data as described below.Methylation levels are quantified by beta values [45].Traits examined include age, sex, BMI, and hematological traits (WBC, RBC, and PLT).We limited our analyses in JHS to individuals with complete data for proteomics, methylomics, and traits examined (total n = 881, S2A Fig) .MESA.The MESA study was initiated in July 2000 to investigate the prevalence, correlates, and progression of subclinical cardiovascular disease (CVD) in a population-based sample of 6,814 men and women aged 45-84 years.The cohort was selected from six US field centers.Based on self-reported race/ethnicity, approximately 38% of the cohort are White, 28% African American, 23% Hispanic, and 11% Chinese American.More demographic information of MESA individuals involved in the analysis is in S1B Table.
Longitudinal multi-omics data was generated in MESA through a pilot program from NHLBI's Trans-Omics for Precision Medicine Initiative (TOPMed) at exam 1 (2000-2002) and exam 5 (2010-2011), including ~1,000 participants for each exam with methylomics data (Illumina MethylationEPIC BeadChip array) [45] and proteomics (SOMAscan 1.3k array) [22,23].Methylation levels are quantified by beta values [45].WGS data are described below.Basic covariates examined include age, sex, BMI, recruitment site, self-reported race/ethnicity, and the same hematological traits as in JHS.We limited our analyses in MESA to individuals with complete data for proteomics, methylomics, and phenotypes examined (total n = 777, S2B Fig) .Use of the same platforms for multi-omics assessment as in JHS allowed comparison analyses for CVs derived by SMCCA-GS or SSMCCA across cohorts.

Whole Genome Sequencing (WGS) data
Genotypes are derived from TOPMed WGS data (freeze 8).Data harmonization, variant discovery, and genotype calling were previously described [25,46].In our analysis, to reduce data dimensionality, we first extracted SNPs associated with blood cell traits from Chen et al. (2020) [47] and highly correlated (linkage disequilibrium r 2 > 0.8 where r 2 is the in-sample squared Pearson correlation between the corresponding genotype vectors) variants were removed, resulting in 3,789 SNPs for JHS and 3,562 SNPs for MESA in our supervised CCA analysis.Genotypes are coded into numerical values 0, 1, and 2 for our analysis.Population principal components calculated by PC-AiR [48] were adjusted for as covariates.In addition, for WBC, we additionally adjusted for the Duffy null polymorphism (SNP rs2814778 at chromosome 1q23.2) [49].

Transcriptomics
We involve transcriptomics data in eQTM analysis to map our selected 10k CpG sites to genes for pathway enrichment analysis, but we do not include transcriptomics in our multi-omics integration analysis because a considerable number of individuals could not be included in the analysis if we incorporate transcriptomics (S2 Fig) .For both JHS and MESA, RNA-seq was measured from peripheral blood mononuclear cells and normalized to transcript per million (by Northwest Genomics Center for MESA, as previously described [50], NWGC for JHS using similar pipelines).

Initial quality control (QC) and transformation of multi-omics data
In both cohorts, we applied QC on each assay including sample outlier removal and feature filtering.For each protein in the proteomics data, we first applied log transformation, followed by inverse normal transformation.After QC, we had 1,317 proteins measured in both cohorts, which made validation across cohorts straightforward.
Methylomics of JHS [44] was normalized using the "noob" normalization method implemented in minfi R package [58,59].We further removed batch, plate, row, and column effects using the ChAMP R package [51].For MESA methylation data, which had already been subjected to functional normalization to reduce batch effects [52], we excluded samples with (1) call rate < 95%; (2) sex mismatches; and (3) concordance between SNP probes and genotypes < 0.8.Methylation levels were marked as missing when the detection p-value was > 0.01, and we imputed these missing values using ChAMP R package [51], as our CCAlike methods cannot accommodate missing data.For both JHS and MESA, CpG sites whose probes overlap any SNP with minor allele frequency (MAF) > 1% were also excluded [53].After QC, we had 754,767 and 741,727 CpG sites for MESA and JHS respectively.For building validation across cohorts, we only kept the 721,334 CpG sites which passed QC in both cohorts.
Finally, we only kept samples with complete data including proteomics, methylomics, and phenotypes (age, BMI, WBC, RBC, PLT, site, race, sex for MESA; age, BMI, WBC, RBC, PLT, sex for JHS), which led to 881 samples for JHS and 777 samples for MESA.We For each assay, we removed the sex chromosome related proteins and CpG sites.We further removed features that are highly correlated [54], at a squared Pearson correlation 0.8 threshold.We adopted a greedy algorithm (Algorithm 1 below) to achieve the dual goal of no highly correlated pairs among a maximal number of features retained.For methylomics, we calculated Pearson correlation using the Python package Deep Graph [54] and after removing highly correlated, further kept 10k CpG sites with the highest variance for the computational efficiency.Our CCA-based methods are computationally intensive.For example, even with these 10k CpG sites (~1.3% of all available CpG sites), on a single core of E5-2680 v3 @ 2.50GHz, the wall time of calculating 50 CVs with our SMCCA-GS on proteomics and methylomics is about 8 ~14 hours for MESA (774 samples) and about 8 ~20 hours for JHS (877 samples); with 20k CpG, the wall time is about 14 ~36 hours for MESA and 16 ~47 hours for JHS.For validating our variance-based feature selection strategy, we also performed the same analysis as

___________________________________________________________________________
Input: Any assay X =(x 1 , x 2 ,� � �,x p ) ⊳ Assay X with p features Initiation: S ¼ fðx i ; x j Þj1 � i; j � p; i 6 ¼ j; corr 2 ðx i ; x j Þ � 0:8g ⊳ Each element of S is a pair of features from X whose correlation square is no less than 0.8 Definition: Features in each pair (x i , x j ) in S are viewed as "neighbors".while S 6 ¼ ; do dict {} ⊳ Initiate an empty dictionary storing number of neighbors of each feature x i for x i (x 1 ,� � �,x p ) do ⊳ Loop each feature x i of assay X Count neighbors of x i dict {x i : number of neighbors} end for Identify x j 2 dict with minimal number of neighbors.Remove any (x j ,�) from S. Remove x j from X. end while Output: X ⊳ After removing features above, features remaining in X are in low correlations while maximizing the feature size.

Association analysis between outcomes and CVs/PCs
To quantify the relationship between outcomes and CVs or PCs, we used regression models.Specifically, for continuous outcomes (including age, BMI, WBC, RBC, PLT), we estimated the proportion of variation in outcome that can be explained by CVs or PCs using linear regression models, implemented with the R function "lm", with covariate adjustments outlined in S2 Table .For the binary outcome sex, we employed logistic regression using the "glm" R function and calculated McFadden's pseudo-R-squared using the "PseudoR2" function from the DescTools [55] R package.

Modified gram-schmidt strategy
With PMA implementation, we observed that with our real data where features have complex correlation structure, the weight vectors are sometimes correlated.To mitigate this nonorthogonality issue, we adopt a strategy inspired by Woojoo et al., (2011) [14].In our implementation, we infer CVs sequentially and remove the effects of the former CVs from the input matrix before calculating weights for the next CVs.In particular, we first follow the PMA approach to generate weights for CV1's of all assays, update input matrices following Eq (1) as the new inputs for calculating weights for CV2's, and sequentially update until we obtain prespecified numbers of CVs.Eq (1) and Eq (2) show the procedures for inferring the (j+1)'s CVs with input matrices {X ij } i = 1,� � �,S .
Specifically, {X ij } i = 1,� � �,S are original input matrices for assay i = 1,� � �,S where S is the total number of assays, from which W i1 's and CV i1 's, the first weights and CV1's, are inferred by SMCCA implemented in the PMA R package.

eQTM analysis
We used expression quantitative trait methylation (eQTM) results to alternatively map CpG sites to genes (instead of simply mapping to nearest genes).We used transcriptomic and methylomic measurements for 650 and 496 samples from MESA and JHS, respectively, to perform eQTM analysis for the 316 CpG sites contributing to CVs significantly associated with outcomes.We employed the MatrixEQTL R [56] package to assess the association of each CpG site with its nearby genes in the +/-1Mb neighborhood, while adjusting for age, and sex separately in MESA and JHS.For the multi-ethnic MESA samples, we additionally adjusted for self-reported race/ethnicity and recruitment site.We then conducted meta-analysis using METAL [57], and used a Bonferroni threshold to define significance, identifying 515 significant CpG-gene pairs.Our eQTM analysis successfully mapped 44-112 CpG sites, for each CV significantly associated with outcome, to 25-257 genes (detailed in S6 Table ), based on which we further performed pathway enrichment analysis, following the same process detailed in the section below.

Pathway enrichment analysis
For each CCA-prioritized feature of each assay, we first mapped them to genes, and then performed pathway enrichment analysis on these genes utilizing three databases-DisGeNET [30,58] (enrichDGN function in DOSE R package, with default settings), Gene Ontology (GO) [28,29,59,60] (enrichGO function in clusterProfiler R package, with default settings) and Kyoto Encyclopedia of Genes and Genomes (KEGG) [59][60][61][62][63] (enrichKEGG function in clus-terProfiler R package, with default settings).For methylomics, we explored two methods for mapping CpG sites to genes: (1) mapping them to the nearest genes using annotations provided by Illumina, and (2) mapping CpG sites to genes with significant signals identified in the eQTM analysis presented above.For proteomics, we mapped proteins to genes using annotations released by SomaScan.For background genes in the enrichment analysis, we included genes annotated from features that are associated with outcome, identified in the feature selection step of our SSMCCA (detailed in Section "Extending Supervised Sparse CCA to Supervised Sparse Multiple CCA" above).

Fig 1 .
Fig 1.Cartoon illustration of a typical CCA-based method for three assays.X, Y, and Z are three assays with 4, 5, and 6 features respectively.When applying a CCA-based method on them to compute 4 canonical variables (CVs), we would first get their weight matrices W X , W Y , W Z , each of which contains 4 weight vectors.By multiplying each assay matrix (left panel) and its corresponding weight matrix (middle panel), we obtain the CV matrix for the assay (right panel) where each column corresponds to one CV.

Fig 2 .
Fig 2. Improved orthogonality among CVs by adopting the Gram-Schmidt (GS) strategy.CVs are inferred from MESA proteomics and methylomics data using unsupervised SMCCA.Each row and column represent one CV, ranging from CV1 to CV50.(A-B) Results from the PMA R package, implementation of the original SMCCA methods without the incorporation of GS algorithm.(C-D) Results from our SMCCA-GS, with the GS strategy incorporated.Left panel (A and C) show proteomics CVs, and right panel (B and D) methylomics CVs.https://doi.org/10.1371/journal.pgen.1010517.g002

Fig 3 .
Fig 3. Proportion of variation in outcomes explained by CVs.(A) CVs were inferred using proteomics and methylomics in JHS.The top 50 CVs were used to calculate the r 2 (Y-axis) for each outcome (X-axis).(B) We obtained CVs in JHS by applying the weights inferred from MESA, and then calculated r 2 in the same way as in A. (C) CVs were inferred using proteomics and methylomics in MESA.(D) CVs were obtained in MESA by applying the weights inferred from JHS. https://doi.org/10.1371/journal.pgen.1010517.g003 Fig 4A3) while proteomics-PC1 only explained 7.7% (purple "+" on the leftmost side of Fig 4A3).This noticeable advantage continued until ~20 CVs/PCs.For instance, the top 15 proteomics-CVs in JHS explained 44% variation in WBC (blue "×" in Fig 4A3) while top 15 proteomics-PCs only explained 29% (purple "×" in Fig 4A3).

Fig 4 .
Fig 4. Comparison of r 2 , PCs vs CVs.Each column corresponds to one outcome.Within each panel, top row (JHS) shows results in JHS using JHS-inferred CVs.Second row (JHS->MESA) shows results in MESA, also using JHS-inferred weights.Third row (MESA) shows results in MESA, this time using MESA-inferred CVs.Last row (MESA->JHS) shows results in JHS, also using MESA-inferred weights.(A) Proteomics.Proteomics CVs explain more variation in white blood cell count (WBC) than PCs.For example, proteomics-CV1 explains 33% of the variation in WBC (blue "+" in Fig 4A3), while proteomics-PC1 only explains 7.7% (purple "+" in Fig 4A3).This pattern persists until approximately 20 CVs/PCs.The top 15 proteomics-CVs in JHS explain 44% of the variation in WBC (blue "×"

Fig 5 .
Fig 5. Comparison of SSCCA and SSMCCA.(A) proteomics, and (B) methylomics.Each row corresponds to a phenotype (from bottom to top, Age, BMI, WBC, RBC, and PLT).Circle size reflects the significance of the difference in variation explained between two methods.Color reflects the size of difference between the variation of phenotype explained by SSCCA and our SSMCCA.Therefore, a larger circle means a more significant difference between the two methods.Note that we use rectangles for insignificant difference with p > 0.01.Red means that our SSMCCA explains more phenotypic variation while blue means that SSCCA explains more.The darker the color, the larger the difference (the scale is different for parts A and B, annotated in "diff" column on side of figure).
further identified sample outliers by PCA-IQR plot (Section 2 in S1 Text and S3 Fig).Four outliers in JHSone sample with the largest proteomics IQR (wedge pointed on S3A Fig) and three samples with largest methylomics IQR (wedges pointed on S3B Fig)-were removed; and three outliers in MESA were removed-all three with largest methylomics IQR (wedges pointed on S3D Fig).

Figs 3 and 4 ___________________________________________________________________________Algorithm 1 .
on proteomics and all ~700k CpG sites.The results (S4 and S5 Figs)show similar patterns as those from top 10k CpG sites (Figs3 and 4).Remove Highly Correlated Features within Each Assay.