Skip to main content
Advertisement
  • Loading metrics

Canonical correlation analysis for multi-omics: Application to cross-cohort analysis

  • Min-Zhi Jiang,

    Roles Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Applied Physical Sciences, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America

  • François Aguet,

    Roles Writing – review & editing

    Affiliation Illumina Artificial Intelligence Laboratory, Illumina, Inc., San Diego, California, United States of America

  • Kristin Ardlie,

    Roles Writing – review & editing

    Affiliation The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America

  • Jiawen Chen,

    Roles Visualization, Writing – review & editing

    Affiliation Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America

  • Elaine Cornell,

    Roles Writing – review & editing

    Affiliation Laboratory for Clinical Biochemistry Research, University of Vermont, Burlington, Vermont, United States of America

  • Dan Cruz,

    Roles Writing – review & editing

    Affiliation Department of Medicine, Cardiology, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America

  • Peter Durda,

    Roles Writing – review & editing

    Affiliation Department of Pathology & Laboratory Medicine, University of Vermont, Colchester, Vermont, United States of America

  • Stacey B. Gabriel,

    Roles Writing – review & editing

    Affiliation The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America

  • Robert E. Gerszten,

    Roles Writing – review & editing

    Affiliation Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America

  • Xiuqing Guo,

    Roles Writing – review & editing

    Affiliation Department of Pediatrics, The Institute for Translational Genomics and Population Sciences, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, University of California at Los Angeles, Torrance, California, United States of America

  • Craig W. Johnson,

    Roles Writing – review & editing

    Affiliation Department of Biostatistics, University of Washington at Seattle, Seattle, Washington, United States of America

  • Silva Kasela,

    Roles Writing – review & editing

    Affiliation New York Genome Center, New York, New York, United States of America

  • Leslie A. Lange,

    Roles Writing – review & editing

    Affiliation Department of Epidemiology, Department of Medicine, Division of Biomedical Informatics and Personalized Medicine, Lifecourse Epidemiology of Adiposity & Diabetes Center, Aurora, Colorado, United States of America

  • Tuuli Lappalainen,

    Roles Writing – review & editing

    Affiliation New York Genome Center, New York, New York, United States of America

  • Yongmei Liu,

    Roles Writing – review & editing

    Affiliation Department of Medicine, Cardiology and Neurology, Duke University Medical Center, Durham, North Carolina, United States of America

  • Alex P. Reiner,

    Roles Writing – review & editing

    Affiliation Department of Epidemiology, University of Washington, Seattle, Washington, United States of America

  • Josh Smith,

    Roles Writing – review & editing

    Affiliation Northwest Genomic Center, University of Washington, Seattle, Washington, United States of America

  • Tamar Sofer,

    Roles Writing – review & editing

    Affiliation Department of Biostatistics, Harvard Medical School, Medicine-Brigham and Women’s Hospital, Boston, Massachusetts, United States of America

  • Kent D. Taylor,

    Roles Writing – review & editing

    Affiliation Department of Pediatrics, The Institute for Translational Genomics and Population Sciences, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, University of California at Los Angeles, Torrance, California, United States of America

  • Russell P. Tracy,

    Roles Writing – review & editing

    Affiliation Department of Pathology & Laboratory Medicine, University of Vermont, Colchester, Vermont, United States of America

  • David J. VanDenBerg,

    Roles Writing – review & editing

    Affiliation Department of Preventive Medicine, University of Southern California, Los Angeles, California, United States of America

  • James G. Wilson,

    Roles Resources, Writing – review & editing

    Affiliation Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America

  • Stephen S. Rich,

    Roles Resources, Writing – review & editing

    Affiliation Center for Public Health Genomics, Department of Public Health Sciences, University of Virginia, Charlottesville, Virginia, United States of America

  • Jerome I. Rotter,

    Roles Resources, Writing – review & editing

    Affiliation Department of Pediatrics, Genomic Outcomes, The Institute for Translational Genomics and Population Sciences, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, University of California at Los Angeles, Torrance, California, United States of America

  • Michael I. Love ,

    Contributed equally to this work with: Michael I. Love, Laura M. Raffield, Yun Li

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    milove@email.unc.edu (MIL); laura_raffield@unc.edu (LMR); yunli@med.unc.edu (YL)

    Affiliations Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America, Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America

  • Laura M. Raffield ,

    Contributed equally to this work with: Michael I. Love, Laura M. Raffield, Yun Li

    Roles Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    milove@email.unc.edu (MIL); laura_raffield@unc.edu (LMR); yunli@med.unc.edu (YL)

    Affiliation Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America

  • Yun Li ,

    Contributed equally to this work with: Michael I. Love, Laura M. Raffield, Yun Li

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    milove@email.unc.edu (MIL); laura_raffield@unc.edu (LMR); yunli@med.unc.edu (YL)

    Affiliations Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America, Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America

  •  [ ... ],
  • NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, TOPMed Analysis Working Group

    Members are listed in S1 Acknowledgement.

    Members are listed in S1 Acknowledgement.

  • [ view all ]
  • [ view less ]

Abstract

Integrative approaches that simultaneously model multi-omics data have gained increasing popularity because they provide holistic system biology views of multiple or all components in a biological system of interest. Canonical correlation analysis (CCA) is a correlation-based integrative method designed to extract latent features shared between multiple assays by finding the linear combinations of features–referred to as canonical variables (CVs)–within each assay that achieve maximal across-assay correlation. Although widely acknowledged as a powerful approach for multi-omics data, CCA has not been systematically applied to multi-omics data in large cohort studies, which has only recently become available. Here, we adapted sparse multiple CCA (SMCCA), a widely-used derivative of CCA, to proteomics and methylomics data from the Multi-Ethnic Study of Atherosclerosis (MESA) and Jackson Heart Study (JHS). To tackle challenges encountered when applying SMCCA to MESA and JHS, our adaptations include the incorporation of the Gram-Schmidt (GS) algorithm with SMCCA to improve orthogonality among CVs, and the development of Sparse Supervised Multiple CCA (SSMCCA) to allow supervised integration analysis for more than two assays. Effective application of SMCCA to the two real datasets reveals important findings. Applying our SMCCA-GS to MESA and JHS, we identified strong associations between blood cell counts and protein abundance, suggesting that adjustment of blood cell composition should be considered in protein-based association studies. Importantly, CVs obtained from two independent cohorts also demonstrate transferability across the cohorts. For example, proteomic CVs learned from JHS, when transferred to MESA, explain similar amounts of blood cell count phenotypic variance in MESA, explaining 39.0% ~ 50.0% variation in JHS and 38.9% ~ 49.1% in MESA. Similar transferability was observed for other omics-CV-trait pairs. This suggests that biologically meaningful and cohort-agnostic variation is captured by CVs. We anticipate that applying our SMCCA-GS and SSMCCA on various cohorts would help identify cohort-agnostic biologically meaningful relationships between multi-omics data and phenotypic traits.

Author summary

Comprehensive understanding of human complex traits may benefit from incorporation of molecular features from multiple biological layers such as genome, epigenome, transcriptome, proteome, and metabolome. CCA is a correlation-based method for multi-omics data which reduces the dimension of each omic assay to several orthogonal components–commonly referred to as canonical variables (CVs). The widely-used SMCCA method allows effective dimension reduction and integration of multi-omics data, but suffers from potentially highly correlated CVs when applied to high-dimensional omics data. Here, we improve the statistical independence among the CVs by adopting a variation of the GS algorithm. We applied our SMCCA-GS method to proteomic and methylomic data from two cohort studies, MESA and JHS. Our results reveal a pronounced effect of blood cell counts on protein abundance, suggesting blood cell composition adjustment in protein-based association studies may be necessary. Finally, we present SSMCCA which allows supervised CCA analysis for the association between one phenotype of interest and more than two assays. We anticipate that SMCCA-GS would help reveal meaningful system-level factors from biological processes involving features from multiple assays; and SSMCCA would further empower interrogation of these factors for phenotypic traits related to health and diseases.

Introduction

In recent years, there has been rapid growth in high-dimensional multi-omics datasets (including DNA methylation, RNA-sequencing, metabolomics, proteomics, genomics, microbiome, etc.). However, careful analyses with integrative methods are needed to fully utilize these rich datasets and provide mechanistic insights into health and disease related outcomes. While many methods have been published [13], few studies have evaluated these methods on large-scale datasets from human samples. In addition, despite quite a few successful examples of integrating two omics data-types [48], particularly detection of quantitative trait loci using genomic data, there are much fewer such examples of integrative analyses across more than two omics data types.

One promising method for using multi-omics data to explain phenotypic variation in health outcomes is canonical correlation analysis (CCA) [9]. CCA is a statistical technique to identify associations among two assays where each assay contains multiple variables. Specifically, CCA finds a linear combination of variables in each assay that leads to the maximal correlation of the two linear combinations. Principal component analysis (PCA) can be considered as a special case of CCA as the optimization objective is the same in the case that the same data is used for the two assays. CCA is a commonly adopted dimension reduction and information extraction method in genomic studies [1,1013] as increasingly more modern genomic studies collect data from multiple assays.

An extension of CCA by Witten & Tibshirani [1] called sparse multiple CCA (SMCCA) allows for the input of multiple assays. We hypothesized that this method would be helpful for high-dimensional multi-omics data exploration and for understanding and extracting omics signatures that reflect biologically relevant variations. Specifically, we here leverage our CCA-based method extended from Witten & Tibshirani’s SMCCA to extract low-dimensional latent variables from high-dimensional multi-omics data and use them to explain phenotypic traits, focusing on blood cell indices, along with basic demographic and anthropometric characteristics. We perform CCA-based analyses in two studies with rich multi-omics data in hundreds of individuals, the Multi-Ethnic Study of Atherosclerosis (MESA) and the Jackson Heart Study (JHS).

Results

CCA pipeline

A typical CCA-based method generates orthogonal canonical variables (CVs), which are low-dimensional summaries to represent latent variables underlying the multi-assay input data. Fig 1 is a cartoon illustration where we have three assays (X, Y, and Z) for three samples. Features are assumed to be continuous with no distributional assumptions. For presentation brevity, we only show how we obtain the top 4 CVs. For each assay, CCA infers 4 vectors of weights (e.g., WX1, WX2, WX3, and WX4 for assay X), which leads to four CVs. For example, CVX1, the top CV for assay X, is obtained by X×WX1. The weights are inferred by maximizing the correlation of CVs across three assays. Note that in the rightmost CV matrices, each column of a CV matrix is one CV of the corresponding assay. In addition, CVs corresponding to the same column cross assays are expected to have maximal correlation (for instance, CVX1, CVY1, CVZ1,are most correlated), while CVs in different columns are expected to be orthogonal or independent from each other in the same assay.

thumbnail
Fig 1. Cartoon illustration of a typical CCA-based method for three assays.

X, Y, and Z are three assays with 4, 5, and 6 features respectively. When applying a CCA-based method on them to compute 4 canonical variables (CVs), we would first get their weight matrices WX, WY, WZ, each of which contains 4 weight vectors. By multiplying each assay matrix (left panel) and its corresponding weight matrix (middle panel), we obtain the CV matrix for the assay (right panel) where each column corresponds to one CV.

https://doi.org/10.1371/journal.pgen.1010517.g001

Modified gram-schmidt algorithm improves orthogonality

SMCCA implemented in the PMA R package does not always provide the expected orthogonal CVs, preventing effective extraction of independent CVs and sometimes causing serious multicollinearity issues in subsequent association analysis. For example, Fig 2A and 2B shows results from PMA’s implementation of unsupervised SMCCA when applied to MESA proteomics and methylomics data (detailed in Methods) where we observe extensive correlation among the CVs. In the presence of undesired correlated CVs, users will have to perform a secondary filtering step to generate a list of non-redundant CVs, or else variation in omics data captured by the later CVs may overlap with variance captured by former CVs. Therefore, we sought to improve orthogonality among generated CVs for capturing distinct information from the integrated multi-omics data. Specifically, we follow the Gram–Schmidt (GS) strategy [14] which generates CVs sequentially by progressively subtracting the previous CV from the input matrices (detailed in Methods). Fig 2C and 2D shows substantially improved orthogonality among the CVs when applied to the same MESA proteomics and methylomics data. Similar patterns were observed when SMCCA was applied to JHS data (S1 Fig).

thumbnail
Fig 2. Improved orthogonality among CVs by adopting the Gram–Schmidt (GS) strategy.

CVs are inferred from MESA proteomics and methylomics data using unsupervised SMCCA. Each row and column represent one CV, ranging from CV1 to CV50. (A-B) Results from the PMA R package, implementation of the original SMCCA methods without the incorporation of GS algorithm. (C-D) Results from our SMCCA-GS, with the GS strategy incorporated. Left panel (A and C) show proteomics CVs, and right panel (B and D) methylomics CVs.

https://doi.org/10.1371/journal.pgen.1010517.g002

Proteomics CVs explain considerable amounts of variation in blood cell counts

We also applied our implementation to proteomics and methylomics data in JHS. As these unsupervised CVs are anticipated to capture shared latent variables underlying the proteomics and methylation datasets, we hypothesized that the CVs may explain a non-negligible amount of variation in various phenotypes. Our primary phenotypes of interest in this work are blood cell traits, including white blood cell count (WBC), red blood cell count (RBC) and platelet count (PLT). We also considered age, sex, and body mass index (BMI), as “control” phenotypes which have been widely reported to explain considerable variability in proteomics and methylomics data. For each of the six outcome phenotypes, we fit regression models to estimate the percent of variation explained by the top 50 CVs from each of the two omics data, namely proteomics and methylomics (detailed in Methods). For each cohort (MESA or JHS), we had two sets of CVs, one derived from the cohort’s own omics data, the other derived from applying the CV weights inferred from the other cohort.

We found that top CVs, from each of the two omics data, explain considerable amounts of variation in almost all of the outcomes evaluated (Fig 3). For example, top 50 methylomics CVs inferred in JHS explained 72%, 100%, 35%, 37%, 34%, 30% of variation in age, sex, BMI, WBC, RBC, and PLT respectively, in JHS (Fig 3A). We also observe high transferability between MESA and JHS, by first applying SMCCA-GS separately to each cohort and then transferring the inferred CVs to the other cohort. For example, the top 50 methylomics CVs inferred in MESA explained similar amounts of variation in RBC: 33% in MESA (itself) (Fig 3C) and 30% when applied to JHS (Fig 3D). Such high transferability suggests that latent variables learned by CCA might reflect biological processes shared across cohorts. We also note that these r2‘s from methylomics data were most likely under-estimated because the CVs were constructed using the top 10,000 most variable CpG sites (see Methods) instead of the entire ~700,000 sites, for computational reasons. These findings are not surprising: for instance, blood cell composition (notably for white blood cell subtypes) has been long known to influence the methylome. For that reason, in epigenome-wide association studies (EWAS), it has been standard practice to first estimate the leukocyte proportions from methylomics data and adjust for these cell type proportions in subsequent association analysis [15]. Given shared precursors for all hematological cell types, we found it relatively unsurprising that RBC and PLT also had a high percent variation explained by methylomics CVs. Similarly, age [16], sex [17,18] and BMI [19] have been known to explain substantial variability in methylomics data, and are commonly adjusted for as covariates.

thumbnail
Fig 3. Proportion of variation in outcomes explained by CVs.

(A) CVs were inferred using proteomics and methylomics in JHS. The top 50 CVs were used to calculate the r2 (Y-axis) for each outcome (X-axis). (B) We obtained CVs in JHS by applying the weights inferred from MESA, and then calculated r2 in the same way as in A. (C) CVs were inferred using proteomics and methylomics in MESA. (D) CVs were obtained in MESA by applying the weights inferred from JHS.

https://doi.org/10.1371/journal.pgen.1010517.g003

More interestingly, the amounts of variation in various outcomes explained by top 50 proteomics CVs are even higher, ranging 39% - 100% in JHS and 39% - 100% in MESA. Large r2 for age, sex, and BMI are expected since all have been reported to rather broadly affect protein profiles [20,21]. However, strikingly, r2 for blood cell traits are also considerable, and comparable to BMI, 50%, 45%, 39% respectively for WBC, RBC and PLT in JHS using CVs inferred in JHS. Confirming these results, when applying CV weights inferred from MESA to JHS, we obtained similar r2‘s: 44%, 47%, 39% for WBC, RBC and PLT respectively. Similar patterns were also observed in MESA using both MESA and JHS derived weights. These considerable amounts of variations in blood cell counts explained by top proteomics CVs have important implications for association studies involving proteomics data: we should consider adjusting for blood cell proportions in these association studies, under the same rationale in EWAS (variability driven by blood cell subtype abundance is likely not of interest for many disease outcomes of interest whose association with proteomics data is being examined).

CVs vs Principal Components (PCs)

Although CVs are inferred jointly from multi-omics data, we have focused on analyzing CVs from each omics data type separately for their predictive power of outcomes of interest. Thus, we naturally are interested in comparing the CCA-based approach with the standard PCA approach since we can obtain PCs separately for each omics data. Note first that we expect larger and more assay-specific batch effects in JHS than MESA. For example, JHS proteomics data was generated in 3 batches [22], and separately from the methylomics data. In contrast, MESA proteomics and methylomics data were all generated through the MESA TOPMed pilot over a short time period [23,24]. Results shown in Fig 4 supported our expectations: overall we observe that a lower number of JHS inferred CVs are needed to explain the outcomes with higher r2 compared to JHS inferred PCs, indicating that top CVs inferred from JHS data tend to capture biological variations while top PCs tend to reflect more assay-specific technical variations. We note that this is supported by the stronger association for CVs vs PCs with technical factors (S6 and S7 Figs), notably for proteomics data which has been subjected to less pre-processing to account for technical effects related to batch/plate (prior to any of the analyses conducted here). The contrasts are most pronounced with age and WBC for proteomics data, and with age for methylomics data. For example, in JHS, proteomics-CV1 explained 33% variation in WBC (blue “+” on the leftmost side of Fig 4A3) while proteomics-PC1 only explained 7.7% (purple “+” on the leftmost side of Fig 4A3). This noticeable advantage continued until ~20 CVs/PCs. For instance, the top 15 proteomics-CVs in JHS explained 44% variation in WBC (blue “×” in Fig 4A3) while top 15 proteomics-PCs only explained 29% (purple “×” in Fig 4A3). Similar advantages of CVs over PCs were observed in MESA, but were less pronounced as expected due to the smaller and less assay-specific batch effects in MESA. Reassuringly, applying JHS inferred CV weights to MESA showed advantages similar to those in JHS, more pronounced than using CVs inferred in MESA itself, further demonstrating the power of CVs to capture biologically relevant variations under the presence of assay-specific batch effects.

thumbnail
Fig 4. Comparison of r2, PCs vs CVs.

Each column corresponds to one outcome. Within each panel, top row (JHS) shows results in JHS using JHS-inferred CVs. Second row (JHS->MESA) shows results in MESA, also using JHS-inferred weights. Third row (MESA) shows results in MESA, this time using MESA-inferred CVs. Last row (MESA->JHS) shows results in JHS, also using MESA-inferred weights. (A) Proteomics. Proteomics CVs explain more variation in white blood cell count (WBC) than PCs. For example, proteomics-CV1 explains 33% of the variation in WBC (blue “+” in Fig 4A3), while proteomics-PC1 only explains 7.7% (purple “+” in Fig 4A3). This pattern persists until approximately 20 CVs/PCs. The top 15 proteomics-CVs in JHS explain 44% of the variation in WBC (blue “×” in Fig 4A3), while the top 15 proteomics-PCs explain only 29% (purple “×” in Fig 4A3). (B) Methylomics. In each sub-figure, X-axis indicates the number of CVs or PCs used and Y-axis the proportion of variation explained in the outcome (i.e., r2).

https://doi.org/10.1371/journal.pgen.1010517.g004

Supervised sparse multiple CCA

Extending supervised sparse CCA to supervised sparse Multiple CCA.

So far, we have generated and evaluated unsupervised CCA where the CVs are inferred from multi-omics data only, without considering any outcomes of interest. Although we assessed the relationship between unsupervised CVs and several outcomes of interest, the CVs themselves were inferred without knowledge of the outcomes. In practice, when we are primarily interested in a particular outcome, supervised approaches can be more effective and powerful. The PMA R package implements a sparse supervised CCA (SSCCA) method. However, this implementation only accepts two omics data at a time, which limits our capabilities in real datasets where there are more than two assays. For instance, in both MESA and JHS, we also have whole genome sequencing (WGS) data [25]. We implemented a sparse supervised multiple CCA (SSMCCA) method to accommodate more than two assays of omics data. Our implementation follows the idea in Witten et al., (2009) [1] where a feature selection step is performed within each assay to retain (by default) top ~80% features most correlated with the outcome of interest. Features selected from each assay form new input matrices to which we then apply our implementation of unsupervised SMCCA with the adapted Gram-Schmidt algorithm.

To ensure our SSMCCA implementation generates sensible supervised CVs, we first compared results from PMA’s SSCCA implementation, when there are two assays of data. Specifically, we compared correlations between inferred supervised CV1 and the corresponding outcomes of interest. We compared SSCCA and our SSMCCA by running two methods with 100 different random seeds and for each seed, testing the variation of each outcome explained by supervised proteomics CVs and supervised methylomics CVs (Fig 5). We found that in most cases, the amount of variation in outcomes captured by SSMCCA CVs is comparable or significantly higher than SSCCA, indicated by large red circles. For example, Fig 5A third row third column (red “×” on Fig 5A) shows a large red circle which annotates a case where our SSMCCA outperforms the original SSCCA. In this example, SSMCCA proteomics CV1 explains 4.17% variation in PLT in MESA, while SSCCA 3.48% (p-value = 8E-9 for difference). The larger the difference, the darker the color. In a few cases, the amount of variation captured by SSMCCA CV1 is significantly smaller than SSCCA CV1. For example, Fig 5B row 2 column 1 shows a large blue circle (light blue “+” on Fig 5B) which indicates a case where the original SSCCA outperforms our SSMCCA. However, although the difference in terms of percent variation explained in RBC by SSCCA vs SSMCCA methylomics CV1 is highly significant (p-value = 3E-28), the absolute difference (4.27E-8 percent variance explained) is tiny, suggesting the difference between the performance of two methods is negligible.

thumbnail
Fig 5. Comparison of SSCCA and SSMCCA.

(A) proteomics, and (B) methylomics. Each row corresponds to a phenotype (from bottom to top, Age, BMI, WBC, RBC, and PLT). Circle size reflects the significance of the difference in variation explained between two methods. Color reflects the size of difference between the variation of phenotype explained by SSCCA and our SSMCCA. Therefore, a larger circle means a more significant difference between the two methods. Note that we use rectangles for insignificant difference with p > 0.01. Red means that our SSMCCA explains more phenotypic variation while blue means that SSCCA explains more. The darker the color, the larger the difference (the scale is different for parts A and B, annotated in “diff” column on side of figure).

https://doi.org/10.1371/journal.pgen.1010517.g005

Biologically meaningful features detected by SSMCCA

We applied SSMCCA to three assays–proteomics, methylomics, and genotypes–from MESA to obtain 50 CVs for each assay, and then used standard regression models to assess associations with phenotypes–age, BMI, WBC, RBC, and PLT. CV-phenotype pairs were considered to be significantly associated when p-value < 1E-4 (Bonferroni correction), adjusting for covariates detailed in S2 Table. In MESA, we identified 58 significant CV-phenotype pairs, and 5 of them were validated in JHS with the same p-value threshold of 8.62E-4 and same direction of effect (S3 Table). For example, WBC and proteomics CV3 were strongly associated in both cohorts (p-value = 2.7E-15 in MESA, 6.8E-16 in JHS, S3 Table). Features with high absolute weight coefficients in this CV (S5 Table) are biologically relevant for WBC. For example, stem cell factor soluble receptor, which has the highest weight, is known to play a key role in hematopoiesis [26]. Lipocalin 2, with the second highest weight, has been reported to be associated with human neutrophil granules [27].

For each phenotype, we then assembled all features from each assay (i.e., both methylomics and proteomics) with non-zero weight for phenotype-associated CVs in S3 and S4 Tables, and annotated each feature to a gene, on which we performed pathway enrichment analysis (described in Methods). For comparison, we also performed the same pathway enrichment analysis using features individually associated with each phenotype, where association is declared when FDR < 5% for each assay-phenotype-cohort combination. Comparing these two sets of pathway enrichment results, we found several pathways only revealed (p.adjust < 0.05) by our SSMCCA, including the growth factor binding gene ontology (GO) [28,29] term and the DisGeNET [30] progressive chronic graft-versus-host disease (GVHD) and polypoidal choroidal vasculopathy genesets. All of these pathways have been reported to be related to BMI in previous literature [3134].

Assigning CpG sites to genes is a challenging task. We adopted the simple nearest gene approach. Other reasonable approaches include promoter-centric assignment [35], leveraging differentially methylated regions [36], or using expression quantitative trait methylation (eQTM) [37] information. We explored the eQTM approach as we have both methylation and gene expression measurements in a subset of samples in JHS and MESA. However, due to limited number of CpGs included in significant CVs, we had only 25–257 genes (with the number of genes implicated by CpGs varying across different outcomes, detailed in S6 Table) based on significant eQTMs (Methods) for pathway enrichment analysis (results summarized in S7 and S8 Tables). We anticipate to benefit more from this approach when eQTM sample size increases.

Discussion

Large quantities of data across multiple omics (transcriptomics, proteomics, metabolomics, genomics, methylomics, etc.) modalities is currently being generated, for example through efforts funded by NIH’s Precision Medicine Initiative [25,35] as well as other large federally funded studies [36]. These high dimensional and complex multi-assay data are unfortunately still too often analyzed only separately (e.g., applying PCA separately to genotype, gene expression, or methylation data) or in a pairwise manner (for example mQTL analysis examining relationships between genome and methylome, or pQTL analysis examining relationships between genome and proteome). Many innovative methods have been proposed (https://github.com/mikelove/awesome-multi-omics [accessed on 2022-07-25]) for integrative analysis but evaluations in large-scale real omics data are still lacking, with fewer impartial appraisals available to guide method choice in practice.

In the work presented here, we apply CCA-based methods to complex multi-omics datasets to assess their capabilities and limitations. In particular, for the widely used PMA implementation of the SMCCA methods, we identified two limitations: non-orthogonal CVs and inability to accommodate more than two assays for supervised analysis. We provide method extensions, SMCCA-GS and SSMCCA, to address the two limitations. Applying SMCCA-GS to real data in MESA and JHS, we found that CVs are consistent and transferable across cohorts, suggesting that CVs capture constitutive biological relationships shared across cohorts, and are not driven primarily by assay-specific technical variation. This cross-cohort consistency, to our knowledge, has not been well explored in the literature and has important implications for making method choices (e.g., CCA vs PCA) for multi-omics data with or without extensive assay-specific batch effects.

Importantly, our CCA-based analyses reveal that blood cell indices are substantially associated with multiple omics assays including methylomics and proteomics. The former association has been widely appreciated and exerted paradigm-shifting impact on analysis: in methylation association studies, white blood cell composition is adjusted for in methylation analyses in standard practice. The latter association, where CVs from proteomics data showed even more pronounced association with blood cell indices, has been under-appreciated, with blood cell traits not considered in most current proteomic analyses [22,3739]. Our findings indicate that blood cell composition should be accounted for (or at least considered) in protein association studies where feasible, similar to what is standard practice for methylation studies.

As demonstrated in Fig 4, our SMCCA-GS is in some cases more useful than PCA in explaining variability in phenotypes, using an identical number of PCs/CVs. However, there are also many cases where the methods are nearly equivalent. We hypothesize that our SMCCA-GS demonstrates more consistent advantages in explaining trait variability in JHS versus MESA due to the presence of more substantial JHS batch effects. Due to funding limitations, JHS proteomics and metabolomics data was generated in multiple batches across several years, while the MESA data used here was generated concurrently, funded by NHLBI’s TOPMed program. Thus, for proteomics in particular, more batch effects are anticipated in JHS; our SMCCA-GS is particularly advantageous in cases where there is increased assay-specific technical variation.

In multi-omics data, it is commonplace to have drastic differences in the dimension of different omics data. For example, methylomics data, when generated by the widely used Illumina MethylationEPIC BeadChip array, contains almost 106 features; transcriptomics data are commonly summarized into ~104 expressed genes; and metabolomics and proteomics typically even smaller: only ~102−103 features depending on the platforms used. Witten et al. (2009) [1], introducing the SMCCA method, analyzed data with 19,672 gene expression measurements and 2,149 comparative genomic hybridization measurements, showing that their method could accommodate such imbalance. Our methods, derived from SMCCA, are also expected to accommodate omics dimension imbalance. In our analyses, results using ~700K CpG sites, while computationally challenging to fit repeatedly, led to similar conclusions as using top 10,000 CpG sites (detailed in Methods and S4 and S5 Figs), further suggesting the robustness of sparse CCA methods to imbalance in omics dimension.

We note that CCA-based methods as implemented in our analyses still have several key limitations. Notably, we had to considerably reduce the dimensionality of methylation array and sequencing data in order for our CCA-based method to be computationally feasible (at least for the repeated analyses necessary for methods development and testing). While we were able to fit models for the entire set of CpG sites a single time, with similar overall results in terms of phenotype variance explained (S4 and S5 Figs), our SMCCA-GS approach will require further innovation to be scalable for large-scale datasets. Recently developed methods allow for efficient calculation of generalized CCA solutions across reduced dimensions of each distinct assay, which alleviates some of the computational issues that arise, though sparse identification of individual omics features from the original assay data may still be desired [40].

Methods

Cohorts

Ethics statement.

All participants included in this analysis provided written, informed consent for use of genetic and multi-omics data, and all study protocols conform to the 1975 Declaration of Helsinki guidelines. The Jackson Heart Study (JHS) and Multi-Ethnic Study of Atherosclerosis (MESA) studies were approved by the Institutional Review Boards of all participating institutions.

JHS.

JHS recruited 5,306 African American participants from the Jackson, Mississippi, metropolitan tri-county area (Hinds, Madison, and Rankin) into a prospective, community-based cohort designed to investigate risk factors for cardiovascular disease among African Americans [4143]. Demographics of JHS individuals involved in the analysis are displayed in S1A Table.

Multi-omics data utilized in JHS analyses including methylomics (n = 1,750, Illumina MethylationEPIC BeadChip array) [44] and proteomics (n = 2,144, SOMAscan 1.3k array) [22], both from the baseline visit, and whole genome sequencing (WGS) data as described below. Methylation levels are quantified by beta values [45]. Traits examined include age, sex, BMI, and hematological traits (WBC, RBC, and PLT). We limited our analyses in JHS to individuals with complete data for proteomics, methylomics, and traits examined (total n = 881, S2A Fig).

MESA.

The MESA study was initiated in July 2000 to investigate the prevalence, correlates, and progression of subclinical cardiovascular disease (CVD) in a population-based sample of 6,814 men and women aged 45–84 years. The cohort was selected from six US field centers. Based on self-reported race/ethnicity, approximately 38% of the cohort are White, 28% African American, 23% Hispanic, and 11% Chinese American. More demographic information of MESA individuals involved in the analysis is in S1B Table.

Longitudinal multi-omics data was generated in MESA through a pilot program from NHLBI’s Trans-Omics for Precision Medicine Initiative (TOPMed) at exam 1 (2000–2002) and exam 5 (2010–2011), including ~ 1,000 participants for each exam with methylomics data (Illumina MethylationEPIC BeadChip array) [45] and proteomics (SOMAscan 1.3k array) [22,23]. Methylation levels are quantified by beta values [45]. WGS data are described below. Basic covariates examined include age, sex, BMI, recruitment site, self-reported race/ethnicity, and the same hematological traits as in JHS. We limited our analyses in MESA to individuals with complete data for proteomics, methylomics, and phenotypes examined (total n = 777, S2B Fig). Use of the same platforms for multi-omics assessment as in JHS allowed comparison analyses for CVs derived by SMCCA-GS or SSMCCA across cohorts.

Whole Genome Sequencing (WGS) data

Genotypes are derived from TOPMed WGS data (freeze 8). Data harmonization, variant discovery, and genotype calling were previously described [25,46]. In our analysis, to reduce data dimensionality, we first extracted SNPs associated with blood cell traits from Chen et al. (2020) [47] and highly correlated (linkage disequilibrium r2 > 0.8 where r2 is the in-sample squared Pearson correlation between the corresponding genotype vectors) variants were removed, resulting in 3,789 SNPs for JHS and 3,562 SNPs for MESA in our supervised CCA analysis. Genotypes are coded into numerical values 0, 1, and 2 for our analysis. Population principal components calculated by PC-AiR [48] were adjusted for as covariates. In addition, for WBC, we additionally adjusted for the Duffy null polymorphism (SNP rs2814778 at chromosome 1q23.2) [49].

Transcriptomics

We involve transcriptomics data in eQTM analysis to map our selected 10k CpG sites to genes for pathway enrichment analysis, but we do not include transcriptomics in our multi-omics integration analysis because a considerable number of individuals could not be included in the analysis if we incorporate transcriptomics (S2 Fig). For both JHS and MESA, RNA-seq was measured from peripheral blood mononuclear cells and normalized to transcript per million (by Northwest Genomics Center for MESA, as previously described [50], NWGC for JHS using similar pipelines).

Initial quality control (QC) and transformation of multi-omics data

In both cohorts, we applied QC on each assay including sample outlier removal and feature filtering. For each protein in the proteomics data, we first applied log transformation, followed by inverse normal transformation. After QC, we had 1,317 proteins measured in both cohorts, which made validation across cohorts straightforward.

Methylomics of JHS [44] was normalized using the “noob” normalization method implemented in minfi R package [58,59]. We further removed batch, plate, row, and column effects using the ChAMP R package [51]. For MESA methylation data, which had already been subjected to functional normalization to reduce batch effects [52], we excluded samples with (1) call rate < 95%; (2) sex mismatches; and (3) concordance between SNP probes and genotypes < 0.8. Methylation levels were marked as missing when the detection p-value was > 0.01, and we imputed these missing values using ChAMP R package [51], as our CCA-like methods cannot accommodate missing data. For both JHS and MESA, CpG sites whose probes overlap any SNP with minor allele frequency (MAF) > 1% were also excluded [53]. After QC, we had 754,767 and 741,727 CpG sites for MESA and JHS respectively. For building validation across cohorts, we only kept the 721,334 CpG sites which passed QC in both cohorts.

Finally, we only kept samples with complete data including proteomics, methylomics, and phenotypes (age, BMI, WBC, RBC, PLT, site, race, sex for MESA; age, BMI, WBC, RBC, PLT, sex for JHS), which led to 881 samples for JHS and 777 samples for MESA. We further identified sample outliers by PCA-IQR plot (Section 2 in S1 Text and S3 Fig). Four outliers in JHS–one sample with the largest proteomics IQR (wedge pointed on S3A Fig) and three samples with largest methylomics IQR (wedges pointed on S3B Fig)–were removed; and three outliers in MESA were removed–all three with largest methylomics IQR (wedges pointed on S3D Fig).

For each assay, we removed the sex chromosome related proteins and CpG sites. We further removed features that are highly correlated [54], at a squared Pearson correlation 0.8 threshold. We adopted a greedy algorithm (Algorithm 1 below) to achieve the dual goal of no highly correlated pairs among a maximal number of features retained. For methylomics, we calculated Pearson correlation using the Python package Deep Graph [54] and after removing highly correlated, further kept 10k CpG sites with the highest variance for the computational efficiency. Our CCA-based methods are computationally intensive. For example, even with these 10k CpG sites (~1.3% of all available CpG sites), on a single core of E5-2680 v3 @ 2.50GHz, the wall time of calculating 50 CVs with our SMCCA-GS on proteomics and methylomics is about 8 ~ 14 hours for MESA (774 samples) and about 8 ~ 20 hours for JHS (877 samples); with 20k CpG, the wall time is about 14 ~ 36 hours for MESA and 16 ~ 47 hours for JHS. For validating our variance-based feature selection strategy, we also performed the same analysis as Figs 3 and 4 on proteomics and all ~700k CpG sites. The results (S4 and S5 Figs) show similar patterns as those from top 10k CpG sites (Figs 3 and 4).

-----------------------------------------------------------------------------------------------------

Algorithm 1. Remove Highly Correlated Features within Each Assay.

-----------------------------------------------------------------------------------------------------

Input: Any assay X =(x1, x2,⋯,xp) ⊳ Assay X with p features

Initiation:

⊳ Each element of S is a pair of features from X whose correlation square is no less than 0.8

Definition: Features in each pair (xi, xj) in S are viewed as “neighbors”.

while S ≠ ∅ do

 dict←{} ⊳ Initiate an empty dictionary storing number of neighbors of each feature xi

   for xi←(x1,⋯,xp) do ⊳ Loop each feature xi of assay X

     Count neighbors of xi

     dict←{xi: number of neighbors}

   end for

   Identify xj ∈ dict with minimal number of neighbors.

   Remove any (xj,∙) from S.

   Remove xj from X.

end while

Output: X ⊳ After removing features above, features remaining in X are in low correlations while maximizing the feature size.

Association analysis between outcomes and CVs/PCs

To quantify the relationship between outcomes and CVs or PCs, we used regression models. Specifically, for continuous outcomes (including age, BMI, WBC, RBC, PLT), we estimated the proportion of variation in outcome that can be explained by CVs or PCs using linear regression models, implemented with the R function “lm”, with covariate adjustments outlined in S2 Table. For the binary outcome sex, we employed logistic regression using the “glm” R function and calculated McFadden’s pseudo-R-squared using the “PseudoR2” function from the DescTools [55] R package.

Modified gram-schmidt strategy

With PMA implementation, we observed that with our real data where features have complex correlation structure, the weight vectors are sometimes correlated. To mitigate this non-orthogonality issue, we adopt a strategy inspired by Woojoo et al., (2011) [14]. In our implementation, we infer CVs sequentially and remove the effects of the former CVs from the input matrix before calculating weights for the next CVs. In particular, we first follow the PMA approach to generate weights for CV1’s of all assays, update input matrices following Eq (1) as the new inputs for calculating weights for CV2’s, and sequentially update until we obtain pre-specified numbers of CVs. Eq (1) and Eq (2) show the procedures for inferring the (j+1)‘s CVs with input matrices {Xij}i = 1,⋯,S.

(1)(2)

Specifically, {Xij}i = 1,⋯,S are original input matrices for assay i = 1,⋯,S where S is the total number of assays, from which Wi1‘s and CVi1‘s, the first weights and CV1’s, are inferred by SMCCA implemented in the PMA R package.

eQTM analysis

We used expression quantitative trait methylation (eQTM) results to alternatively map CpG sites to genes (instead of simply mapping to nearest genes). We used transcriptomic and methylomic measurements for 650 and 496 samples from MESA and JHS, respectively, to perform eQTM analysis for the 316 CpG sites contributing to CVs significantly associated with outcomes. We employed the MatrixEQTL R [56] package to assess the association of each CpG site with its nearby genes in the +/- 1Mb neighborhood, while adjusting for age, and sex separately in MESA and JHS. For the multi-ethnic MESA samples, we additionally adjusted for self-reported race/ethnicity and recruitment site. We then conducted meta-analysis using METAL [57], and used a Bonferroni threshold to define significance, identifying 515 significant CpG-gene pairs. Our eQTM analysis successfully mapped 44–112 CpG sites, for each CV significantly associated with outcome, to 25–257 genes (detailed in S6 Table), based on which we further performed pathway enrichment analysis, following the same process detailed in the section below.

Pathway enrichment analysis

For each CCA-prioritized feature of each assay, we first mapped them to genes, and then performed pathway enrichment analysis on these genes utilizing three databases–DisGeNET [30,58] (enrichDGN function in DOSE R package, with default settings), Gene Ontology (GO) [28,29,59,60] (enrichGO function in clusterProfiler R package, with default settings) and Kyoto Encyclopedia of Genes and Genomes (KEGG) [5963] (enrichKEGG function in clusterProfiler R package, with default settings). For methylomics, we explored two methods for mapping CpG sites to genes: (1) mapping them to the nearest genes using annotations provided by Illumina, and (2) mapping CpG sites to genes with significant signals identified in the eQTM analysis presented above. For proteomics, we mapped proteins to genes using annotations released by SomaScan. For background genes in the enrichment analysis, we included genes annotated from features that are associated with outcome, identified in the feature selection step of our SSMCCA (detailed in Section “Extending Supervised Sparse CCA to Supervised Sparse Multiple CCA” above).

Supporting information

S1 Fig. Improved orthogonality among CVs by adopting the Gram–Schmidt (GS) strategy.

CVs are inferred from JHS proteomics and methylomics data using unsupervised SMCCA. Each row and column represent one CV, ranging from CV1 to CV50. (A-B) Results from the PMA package, implementation of the original SMCCA methods without the incorporation of GS algorithm. (C-D) Results from our SMCCA-GS, with the GS strategy incorporated. Left panel (A and C) show proteomics CVs, and right panel (B and D) from methylomics CVs.

https://doi.org/10.1371/journal.pgen.1010517.s001

(TIF)

S2 Fig. Sample size for each cohort.

(A) JHS: 881 participants have complete proteomics, methylation, and phenotype information; 496 participants have complete transcriptomics, methylation, and phenotype information. (B) MESA: 777 participants have complete proteomics, methylation, and phenotype information; 650 participants have complete transcriptomics, methylation, and phenotype information.

https://doi.org/10.1371/journal.pgen.1010517.s002

(TIF)

S3 Fig. PCA-IQR plots.

Each dot in the plot represents one individual. X-axis is the interquartile range (IQR) while Y-axis is the top principal component (PC). (A) JHS proteomics: one outlier was detected, marked by the wedge pointer; (B) JHS methylomics: three outliers were detected; (C) MESA proteomics: MESA: no outliers; (D) MESA methylomic: three outliers were detected.

https://doi.org/10.1371/journal.pgen.1010517.s003

(TIF)

S4 Fig. Proportion of variation in outcomes explained by CVs inferred with all CpG sites included.

(A) CVs were inferred using proteomics and all ~700k CpG sites in JHS. The top 50 CVs were used to calculate the r2 (Y-axis) for each outcome (X-axis). (B) We obtained CVs in JHS by applying the weights inferred from MESA, and then calculated r2 in the same way as in A. (C) CVs were inferred using proteomics and all ~700k CpG sites in MESA. (D) CVs were obtained in MESA by applying the weights inferred from JHS.

https://doi.org/10.1371/journal.pgen.1010517.s004

(TIF)

S5 Fig. Comparison of r2, PCs vs CVs, inferred with all CpG sites included.

Each column is for one outcome. Top row (JHS) shows results in JHS using JHS-inferred CVs. Second row (JHS->MESA) shows results in MESA, also using JHS-inferred CV weights. Third row (MESA) shows results in MESA, this time using MESA-inferred CVs. Last row (MESA->JHS) shows results in JHS, also using MESA-inferred CV weights. (A) Proteomics. (B) Methylomics. In each sub-figure, X-axis indicates the number of CVs or PCs used and Y-axis the proportion of variation explained in the (i.e., r2).

https://doi.org/10.1371/journal.pgen.1010517.s005

(TIF)

S6 Fig. Association with proteomics-specific technical variables, CVs vs PCs.

For JHS, the proteomics technical variable is batch-plate combination status. For MESA, the proteomics technical variable is plate.

https://doi.org/10.1371/journal.pgen.1010517.s006

(TIF)

S7 Fig. Association with methylation-specific technical variables, CVs vs PCs.

For JHS, the methylomics technical variable is group-plate combination status. For MESA, the methylomics technical variables are (A) “Batch Scan”, and (B) “Level1 Batch”.

https://doi.org/10.1371/journal.pgen.1010517.s007

(TIF)

S1 Table. Demographics of (A) JHS and (B) MESA.

https://doi.org/10.1371/journal.pgen.1010517.s008

(XLSX)

S2 Table. Covariate adjustments of each omics data of each cohort.

https://doi.org/10.1371/journal.pgen.1010517.s009

(XLSX)

S3 Table. Supervised CVs inferred in MESA significantly associated with each phenotype and validated in JHS.

https://doi.org/10.1371/journal.pgen.1010517.s010

(XLSX)

S4 Table. Supervised CVs inferred in JHS significantly associated with each phenotype and validated in MESA.

https://doi.org/10.1371/journal.pgen.1010517.s011

(XLSX)

S5 Table. Proteins identified in CV3 with non-zero weights in MESA.

https://doi.org/10.1371/journal.pgen.1010517.s012

(XLSX)

S7 Table. Pathway Enrichment Analysis Results of JHS.

https://doi.org/10.1371/journal.pgen.1010517.s014

(XLSX)

S8 Table. Pathway Enrichment Analysis Results of MESA.

https://doi.org/10.1371/journal.pgen.1010517.s015

(XLSX)

S1 Acknowledgement. Members of NHLBI TOPMed Consortium and TOPMed Analysis Working Group.

https://doi.org/10.1371/journal.pgen.1010517.s017

(PDF)

Acknowledgments

Molecular data for the TOPMed program was supported by the National Heart, Lung and Blood Institute (NHLBI). Genome sequencing for “NHLBI TOPMed: The Jackson Heart Study” (phs000964.v1.p1) was performed at the Northwest Genomics Center (HHSN268201100037C). Genome sequencing for “NHLBI TOPMed: The Multi-Ethnic Study of Atherosclerosis” (phs001416) was performed at Broad Genomics (3U54HG003067-13S1, HHSN268201600034I). Methylomics for “NHLBI TOPMed: The Multi-Ethnic Study of Atherosclerosis” (phs001416) was performed at the Keck MGC (HHSN268201600034I). RNA-Seq for “NHLBI TOPMed: The Multi-Ethnic Study of Atherosclerosis” (phs001416) was performed at the Northwest Genomics Center (HHSN268201600032I). Core support including centralized genomic read mapping and genotype calling, along with variant quality metrics and filtering, were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1; contract HHSN268201800002I). Core support including phenotype harmonization, data management, sample-identity QC, and general program coordination, were provided by the TOPMed Data Coordinating Center (R01HL-120393; U01HL-120393; contract HHSN268201800001I), and TOPMed MESA Multi-Omics (HHSN2682015000031/HSN26800004). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed.

The Jackson Heart Study (JHS) is supported and conducted in collaboration with Jackson State University (HHSN268201300049C and HHSN268201300050C), Tougaloo College (HHSN268201300048C), and the University of Mississippi Medical Center (HHSN268201300046C and HHSN268201300047C) contracts from the National Heart, Lung, and Blood Institute (NHLBI) and the National Institute for Minority Health and Health Disparities (NIMHD). The authors also wish to thank the staffs and participants of the JHS.

MESA and the MESA SHARe projects are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts 75N92020D00001, HHSN268201500003I, N01-HC-95159, 75N92020D00005, N01-HC-95160, 75N92020D00002, N01-HC-95161, 75N92020D00003, N01-HC-95162, 75N92020D00006, N01-HC-95163, 75N92020D00004, N01-HC-95164, 75N92020D00007, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, and UL1-TR-001420, UL1TR001881, DK063491, and R01HL105756. Funding for SHARe genotyping was provided by NHLBI Contract N02-HL-64278. Genotyping was performed at Affymetrix (Santa Clara, California, USA) and the Broad Institute of Harvard and MIT (Boston, Massachusetts, USA) using the Affymetrix Genome-Wide Human SNP Array 6.0. The authors thank the other investigators, the staff, and the participants of the MESA study for their valuable contributions. A full list of participating MESA investigators and institutes can be found at http://wwwmesa-nhlbi.org.

The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services.

References

  1. 1. Witten DM, Tibshirani RJ. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol. 2009;8: Article28. pmid:19572827
  2. 2. Lock EF, Hoadley KA, Marron JS, Nobel AB. JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. Ann Appl Stat. 2013;7: 523–542. pmid:23745156
  3. 3. Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, et al. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 2018;14: e8124. pmid:29925568
  4. 4. Consortium GTEx. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369: 1318–1330. pmid:32913098
  5. 5. Võsa U, Claringbould A, Westra H-J, Bonder MJ, Deelen P, Zeng B, et al. Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat Genet. 2021;53: 1300–1310. pmid:34475573
  6. 6. Folkersen L, Gustafsson S, Wang Q, Hansen DH, Hedman ÅK, Schork A, et al. Genomic and drug target evaluation of 90 cardiovascular proteins in 30,931 individuals. Nat Metab. 2020;2: 1135–1148. pmid:33067605
  7. 7. Sun BB, Maranville JC, Peters JE, Stacey D, Staley JR, Blackshaw J, et al. Genomic atlas of the human plasma proteome. Nature. 2018;558: 73–79. pmid:29875488
  8. 8. Zhang J, Dutta D, Köttgen A, Tin A, Schlosser P, Grams ME, et al. Plasma proteome analyses in individuals of European and African ancestry identify cis-pQTLs and models for proteome-wide association studies. Nat Genet. 2022;54: 593–602. pmid:35501419
  9. 9. Hotelling H. The most predictable criterion. J Educ Psychol. 1935;26: 139–142.
  10. 10. Parkhomenko E, Tritchler D, Beyene J. Sparse canonical correlation analysis with application to genomic data integration. Stat Appl Genet Mol Biol. 2009;8: Article 1.
  11. 11. Lin D, Zhang J, Li J, Calhoun VD, Deng H-W, Wang Y- P. Group sparse canonical correlation analysis for genomic data integration. BMC Bioinformatics. 2013;14: 245. pmid:23937249
  12. 12. Cichonska A, Rousu J, Marttinen P, Kangas AJ, Soininen P, Lehtimäki T, et al. metaCCA: summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis. Bioinformatics. 2016;32: 1981–1989. pmid:27153689
  13. 13. Tini G, Marchetti L, Priami C, Scott-Boyer M- P. Multi-omics integration-a comparison of unsupervised clustering methodologies. Brief Bioinform. 2019;20: 1269–1279. pmid:29272335
  14. 14. Woojoo L, Donghwan L, Youngjo L, Yudi P. Sparse Canonical Covariance Analysis for High-throughput Data. Stat Appl Genet Mol Biol. 2011;10: 1–24.
  15. 15. Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13: 86. pmid:22568884
  16. 16. Horvath S, Raj K. DNA methylation-based biomarkers and the epigenetic clock theory of ageing. Nat Rev Genet. 2018;19: 371–384. pmid:29643443
  17. 17. Gatev E, Inkster AM, Negri GL, Konwar C, Lussier AA, Skakkebaek A, et al. Autosomal sex-associated co-methylated regions predict biological sex from DNA methylation. Nucleic Acids Res. 2021;49: 9097–9116. pmid:34403484
  18. 18. Grant OA, Wang Y, Kumari M, Zabet NR, Schalkwyk L. Characterising sex differences of autosomal DNA methylation in whole blood using the Illumina EPIC array. Clin Epigenetics. 2022;14: 62. pmid:35568878
  19. 19. Wahl S, Drong A, Lehne B, Loh M, Scott WR, Kunze S, et al. Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature. 2017;541: 81–86. pmid:28002404
  20. 20. Zaghlool SB, Sharma S, Molnar M, Matías-García PR, Elhadad MA, Waldenberger M, et al. Revealing the role of the human blood plasma proteome in obesity using genetic drivers. Nat Commun. 2021;12: 1279. pmid:33627659
  21. 21. Lehallier B, Gate D, Schaum N, Nanasi T, Lee SE, Yousef H, et al. Undulating changes in human plasma proteome profiles across the lifespan. Nat Med. 2019;25: 1843–1850. pmid:31806903
  22. 22. Katz DH, Tahir UA, Bick AG, Pampana A, Ngo D, Benson MD, et al. Whole Genome Sequence Analysis of the Plasma Proteome in Black Adults Provides Novel Insights Into Cardiovascular Disease. Circulation. 2022;145: 357–370. pmid:34814699
  23. 23. Schubert R, Geoffroy E, Gregga I, Mulford AJ, Aguet F, Ardlie K, et al. Protein prediction for trait mapping in diverse populations. PLoS One. 2022;17: e0264341. pmid:35202437
  24. 24. Raffield LM, Dang H, Pratte KA, Jacobson S, Gillenwater LA, Ampleford E, et al. Comparison of Proteomic Assessment Methods in Multiple Cohort Studies. Proteomics. 2020;20: e1900278. pmid:32386347
  25. 25. Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590: 290–299. pmid:33568819
  26. 26. Broudy VC. Stem Cell Factor and Hematopoiesis. Blood. 1997;90: 1345–1364. pmid:9269751
  27. 27. Kjeldsen L, Bainton DF, Sengeløv H, Borregaard N. Identification of neutrophil gelatinase-associated lipocalin as a novel matrix protein of specific granules in human neutrophils. Blood. 1994;83: 799–807. pmid:8298140
  28. 28. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25: 25–29.
  29. 29. Gene Ontology Consortium. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res. 2021;49: D325–D334. pmid:33290552
  30. 30. Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, et al. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2020;48: D845–D855. pmid:31680165
  31. 31. Rahman A, Hammad MM, Al Khairi I, Cherian P, Al-Sabah R, Al-Mulla F, et al. Profiling of Insulin-Like Growth Factor Binding Proteins (IGFBPs) in Obesity and Their Association With Ox-LDL and Hs-CRP in Adolescents. Front Endocrinol. 2021;12: 727004. pmid:34394011
  32. 32. Saidu NEB, Bonini C, Dickinson A, Grce M, Inngjerdingen M, Koehl U, et al. New Approaches for the Treatment of Chronic Graft-Versus-Host Disease: Current Status and Future Directions. Front Immunol. 2020;11: 578314. pmid:33162993
  33. 33. Woo SJ, Ahn J, Morrison MA, Ahn SY, Lee J, Kim KW, et al. Analysis of Genetic and Environmental Risk Factors and Their Interactions in Korean Patients with Age-Related Macular Degeneration. PLoS One. 2015;10: e0132771. pmid:26171855
  34. 34. Kikuchi M, Nakamura M, Ishikawa K, Suzuki T, Nishihara H, Yamakoshi T, et al. Elevated C-reactive protein levels in patients with polypoidal choroidal vasculopathy and patients with neovascular age-related macular degeneration. Ophthalmology. 2007;114: 1722–1727. pmid:17400294
  35. 35. All of Us Research Program Investigators, Denny JC, Rutter JL, Goldstein DB, Philippakis A, Smoller JW, et al. The “All of Us” Research Program. N Engl J Med. 2019;381: 668–676. pmid:31412182
  36. 36. Sanford JA, Nogiec CD, Lindholm ME, Adkins JN, Amar D, Dasari S, et al. Molecular Transducers of Physical Activity Consortium (MoTrPAC): Mapping the Dynamic Responses to Exercise. Cell. 2020;181: 1464–1474. pmid:32589957
  37. 37. Png G, Barysenka A, Repetto L, Navarro P, Shen X, Pietzner M, et al. Mapping the serum proteome to neurological diseases using whole genome sequencing. Nat Commun. 2021;12: 7042. pmid:34857772
  38. 38. Pietzner M, Wheeler E, Carrasco-Zanini J, Cortes A, Koprulu M, Wörheide MA, et al. Mapping the proteo-genomic convergence of human diseases. Science. 2021;374: eabj1541. pmid:34648354
  39. 39. Williams SA, Kivimaki M, Langenberg C, Hingorani AD, Casas JP, Bouchard C, et al. Plasma protein patterns as comprehensive indicators of health. Nat Med. 2019;25: 1851–1857. pmid:31792462
  40. 40. Brown BC, Wang C, Kasela S, Aguet F, Nachun DC, Taylor KD, et al. Multiset correlation and factor analysis enables exploration of multi-omic data. bioRxiv. 2022. p. 2022.07.18.500246.
  41. 41. Taylor HA Jr, Wilson JG, Jones DW, Sarpong DF, Srinivasan A, Garrison RJ, et al. Toward resolution of cardiovascular health disparities in African Americans: design and methods of the Jackson Heart Study. Ethn Dis. 2005;15: S6-4–17. pmid:16320381
  42. 42. Wilson JG, Rotimi CN, Ekunwe L, Royal CDM, Crump ME, Wyatt SB, et al. Study design for genetic analysis in the Jackson Heart Study. Ethn Dis. 2005;15: S6-30–37. pmid:16317983
  43. 43. Carpenter MA, Crow R, Steffes M, Rock W, Heilbraun J, Evans G, et al. Laboratory, reading center, and coordinating center data management methods in the Jackson Heart Study. Am J Med Sci. 2004;328: 131–144. pmid:15367870
  44. 44. Lu AT, Seeboth A, Tsai P- C, Sun D, Quach A, Reiner AP, et al. DNA methylation-based estimator of telomere length. Aging. 2019;11: 5895–5923. pmid:31422385
  45. 45. Do WL, Nguyen S, Yao J, Guo X, Whitsel EA, Demerath E, et al. Associations between DNA methylation and BMI vary by metabolic health status: a potential link to disparate cardiovascular outcomes. Clin Epigenetics. 2021;13: 230. pmid:34937574
  46. 46. TOPMed whole genome sequencing methods: Freeze 8. [cited 2 Mar 2022]. Available: https://topmed.nhlbi.nih.gov/topmed-whole-genome-sequencing-methods-freeze-8
  47. 47. Chen M- H, Raffield LM, Mousas A, Sakaue S, Huffman JE, Moscati A, et al. Trans-ethnic and Ancestry-Specific Blood-Cell Genetics in 746,667 Individuals from 5 Global Populations. Cell. 2020;182: 1198–1213.e14. pmid:32888493
  48. 48. Conomos MP, Miller MB, Thornton TA. Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet Epidemiol. 2015;39: 276–293. pmid:25810074
  49. 49. Reich D, Nalls MA, Kao WHL, Akylbekova EL, Tandon A, Patterson N, et al. Reduced neutrophil count in people of African descent is due to a regulatory variant in the Duffy antigen receptor for chemokines gene. PLoS Genet. 2009;5: e1000360. pmid:19180233
  50. 50. Kurniansyah N, Wallace DA, Zhang Y, Yu B, Cade B, Wang H, et al. An integrated multi-omics analysis of sleep-disordered breathing traits across multiple blood cell types. medRxiv. 2022. p. 2022.07.09.
  51. 51. Morris TJ, Butcher LM, Feber A, Teschendorff AE, Chakravarthy AR, Wojdacz TK, et al. ChAMP: 450k Chip Analysis Methylation Pipeline. Bioinformatics. 2014;30: 428–430. pmid:24336642
  52. 52. Fortin J- P, Labbe A, Lemire M, Zanke BW, Hudson TJ, Fertig EJ, et al. Functional normalization of 450k methylation array data improves replication in large cancer studies. Genome Biol. 2014;15: 503. pmid:25599564
  53. 53. Zhou W, Laird PW, Shen H. Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes. Nucleic Acids Res. 2017;45: e22. pmid:27924034
  54. 54. Traxl D, Boers N, Kurths J. Deep Graphs—a general framework to represent and analyze heterogeneous complex systems across scales. arXiv [physics.data-an]. 2016. Available: http://arxiv.org/abs/1604.00971
  55. 55. Signorell A, Aho K, Alfons A, Anderegg N, Aragon T, Arachchige C, et al. DescTools: Tools for Descriptive Statistics. 2017. Available: https://cran.r-project.org/package=DescTools
  56. 56. Shabalin AA. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012;28: 1353–1358. pmid:22492648
  57. 57. Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010;26: 2190–2191. pmid:20616382
  58. 58. Yu G, Wang L- G, Yan G- R, He Q- Y. DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics. 2015;31: 608–609. pmid:25677125
  59. 59. Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation (Camb). 2021;2: 100141. pmid:34557778
  60. 60. Yu G, Wang L- G, Han Y, He Q- Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16: 284–287. pmid:22455463
  61. 61. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28: 27–30. pmid:10592173
  62. 62. Kanehisa M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 2019;28: 1947–1951. pmid:31441146
  63. 63. Kanehisa M, Furumichi M, Sato Y, Kawashima M, Ishiguro-Watanabe M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2022. pmid:36300620