Figures
Abstract
Modern genomic data sets often involve multiple data-layers (e.g., DNA-sequence, gene expression), each of which itself can be high-dimensional. The biological processes underlying these data-layers can lead to intricate multivariate association patterns. We propose and evaluate two methods to determine the proportion of variance of an output data set that can be explained by an input data set when both data panels are high dimensional. Our approach uses random-effects models to estimate the proportion of variance of vectors in the linear span of the output set that can be explained by regression on the input set. We consider a method based on an orthogonal basis (Eigen-ANOVA) and one that uses random vectors (Monte Carlo ANOVA, MC-ANOVA) in the linear span of the output set. Using simulations, we show that the MC-ANOVA method gave nearly unbiased estimates. Estimates produced by Eigen-ANOVA were also nearly unbiased, except when the shared variance was very high (e.g., >0.9). We demonstrate the potential insight that can be obtained from the use of MC-ANOVA and Eigen-ANOVA by applying these two methods to the study of multi-locus linkage disequilibrium in chicken (Gallus gallus) genomes and to the assessment of inter-dependencies between gene expression, methylation, and copy-number-variants in data from breast cancer tumors from humans (Homo sapiens). Our analyses reveal that in chicken breeding populations ~50,000 evenly-spaced SNPs are enough to fully capture the span of whole-genome-sequencing genomes. In the study of multi-omic breast cancer data, we found that the span of copy-number-variants can be fully explained using either methylation or gene expression data and that roughly 74% of the variance in gene expression can be predicted from methylation data.
Citation: de los Campos G, Pook T, Gonzalez-Reymundez A, Simianer H, Mias G, Vazquez AI (2020) ANOVA-HD: Analysis of variance when both input and output layers are high-dimensional. PLoS ONE 15(12): e0243251. https://doi.org/10.1371/journal.pone.0243251
Editor: Roberto Fritsche-Neto, University of Sao Paulo/Luiz de Queiroz Agriculture College, BRAZIL
Received: June 22, 2020; Accepted: November 17, 2020; Published: December 14, 2020
Copyright: © 2020 de los Campos et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: - The wheat data set is publicly available with the BGLR R-package (doi:10.1534/genetics.114.164442) - The Breast Cancer Data can be obtained from The Cancer Genome Atlas (https://www.cancer.gov/tcga) - The whole-genome chicken sequence data was previously used by Ni et al. (2015) and is available at DOI: 10.6070/H47H1GKK - Array data from chromosome 1 of the chicken genomes was previously used by Pook et al. (Genetics, 2019) and is available at https://github.com/tpook92/HaploBlocker/blob/master/Data/chicken_breeding_chromo1.RData.
Funding: GDLC was supported by a Mercator-fellowship of the German Research Foundation (DFG) within the Research Training Group 1644, “Scaling problems in statistics” (grant no. 152112243) at the University of Goettingen. AIV, AGR and GDLC received support from a research grant sponsored by Zoetis. The funder provided support in the form of a research grant, the funds were used to cover university salaries for GDLC, AIV, and AGR. The funders did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.
Competing interests: AIV, AGR, and GDLC received financial support from Zoetis. The funders did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The funding provided by Zoetis did not alter our adherence to PLOS ONE policies on sharing data and materials.
Introduction
Modern genomic data often combine information from multiple data-layers, each of which itself can be high-dimensional. Examples of this include data sets comprising of information from several omics, or those combining genomic information with high-throughput phenotyping (e.g., crop-imaging, milk infrared spectra data). The biological processes underlying each of the data-layers can induce complex dependencies between features within each layer (e.g., linkage disequilibrium among single nucleotide polymorphisms, SNPs) as well as between layers (e.g., the association between DNA and gene expression, GE). The main goal of this study is to develop and to evaluate methods to quantify multivariate-associations in settings in which both the input and output sets are high dimensional.
The methods proposed in this study can be used to answer ubiquitous questions such as: How much of the inter-individual differences in whole-genome sequence genotypes can be predicted using a low-density SNP array? What proportion of variance in GE can be explained by differences in DNA methylation (ME)? How much of the variance in image-derived phenotypes can be predicted from DNA genotypes?
Canonical Correlation Analysis (CCA, [1]), Multivariate-Analysis of Variance (MANOVA, [2]) and Reduced Rank-Regressions, (e.g., Partial Least Squares, PLS, [3]) are three methodologies often used to assess associations in multi-dimensional problems. However, these approaches have limitations that make some of them inadequate for estimating the proportion of variance explained when both the output and input layers are high-dimensional.
Canonical Correlation Analysis extends the concept of the correlation between two random variables to a multivariate context. However, correlation is symmetric by nature. Therefore, CCA cannot address questions regarding the proportion of variance explained when the proportion of variance of one set (e.g., X) that is explained by another set (W) is not equal to the reciprocal (i.e., the proportion of variance of W that can be explained by X). Many multi-layered data sets are not expected to have a symmetric variance-decomposition (we will illustrate this using simulated and experimental data).
Multivariate Analyses of Variance (MANOVA, [4]) is often used for ANOVA when both the response and the explanatory data sets are multi-dimensional. However, MANOVA is based on least-squares projections; therefore, the methodology is not well-suited for cases when data is high dimensional, including rank-deficient cases. Most of the problems that we focus on involve high-dimensional data where the number of features exceeds sample size; thus, making least-squares methods such as MANOVA inadequate.
Reduced-rank regressions [5] and penalized multivariate analysis methods [6] are often used to analyze high-dimensional data. However, the results that one can obtain using regularized methods rely on regularization decisions (e.g., the number of dimensions used in PLS or CCA, or the sparsity parameters in sparse CCA) which cannot be made using fitness (e.g., the likelihood function) or lack-of-fit measures (e.g., residual sum of squares) evaluated in the training data. Thus, these parameters are often tuned to maximize prediction accuracy in testing sets. However, solutions derived by maximizing cross-validation prediction accuracy are not necessarily optimal for inferences because prediction accuracy is highly dependent on the relationship between sample size (n) and model complexity (e.g., number of parameters, p). Thus, in cases where p>>n, optimal prediction accuracy may be achieved with a highly parsimonious model (e.g., a principal component regression based on a few axes) that can produce severely biased estimates of effects. Therefore, to overcome the limitations of existing methods, in this study, we developed and evaluated approaches for estimating the proportion of variance explained when both the input and output sets are high-dimensional.
Results
We developed two methods that use random-effects models to estimate the proportion of variance of independent vectors in the linear span of an output layer that can be explained by regression on an input layer. We considered two approaches for generating a sequence of independent vectors in the linear span of the output layer: A Monte Carlo method (MC-ANOVA) which uses random vectors, and one based on eigenvectors (Eigen-ANOVA).
Setting the stage
Consider a data set consisting of two numeric matrices, Xn×p and Wn×q, holding data for n individuals (rows) and p (X) and q (W) features in columns, respectively. For instance, X may be a matrix with genotype codes at p SNPs and W may be a matrix providing GE levels assessed at q genes. The columns of X = {x1,x2,…,xp} and of W = {w1,w2,…,wq} can be viewed as axes spanning two linear spaces (LX and LW, respectively). The linear spans (extensions to nonlinear settings will be addressed in the discussion) of X and W consist of all the vectors that can be obtained by forming linear combinations of the columns of each of these sets, that is and
, for all real-valued vectors αs = {αs1,…,αsp} and δs = {δs1,…,δsq}. In the following, we will use W as the input set and X as the output set; however, the methods proposed are not symmetric and additional knowledge can be gained by switching the roles of X and W.
For each vector xs∈LX, one can estimate the proportion of variance that can be explained by linear regression on LW using a model of the form
(1)
For cases where q is large, the proportion of variance of xs that can be explained by regression on LX can be estimated by regarding both β and ε as Gaussian independent random variables,
and,
. Upon appropriate scaling of the columns of X (see Materials and Methods for details)
can be interpreted as the proportion of variance of xs that could be explained by regression on the features included in W. The variance parameters involved (
and
) can be estimated using Bayesian or Likelihood methods (e.g., restricted maximum likelihood, REML, [7]), as such methods were designed to handle common challenges of overfitting and collinearity in high dimensional data.
In the preceding paragraph we describe how one can estimate the proportion of variance of a vector in LX (xs) that can be explained by regression on W. Next, we generalize the idea to all vectors in LX. However, LX contains an infinite number of vectors; therefore, some approximation is needed. Perhaps the most natural approach for estimating the proportion of variance of vectors in LX that can be explained by regression on LW is to regress each of the columns of X on W. Such an analysis would produce a sequence of R2 estimates , and the average R2,
, could be used to estimate the overall proportion of variance of X that could be explained by regression on W. However, one limitation of this approach is that the columns of X are not necessarily independent. Many features may cluster (e.g., genes may be co-expressed, or SNPs may be in high linkage-disequilibrium) leading to groups of highly unbalanced sizes. When some features are highly-correlated, the simple average of individual R2-values may be driven by a few clusters of the columns of X. Furthermore, when X is ultra-high dimensional (e.g., hundreds of thousands or million features) estimating
(s = 1,…,p) one-feature-at-a-time will be computationally challenging. Therefore, to address these problems, we propose two methods that use independent vectors from the span of the output set; each of these methods are explained next.
Monte Carlo analysis of variance (MC-ANOVA)
Since LX is infinite, one cannot estimate for all vectors in LX. However, one can ‘explore’ the linear span of the output set by generating random vectors in LX of the form xs = Xαs, where αs is sampled from some distribution. This can be repeated for a large number of vectors in LX to produce a sequence of estimates
, and the resulting sequence can be used to estimate the average proportion of variance explained as well as other features of the distribution of the sequence. The method is summarized in Box 1. Importantly, if αs and αs′ are independent, so are xs and xs′. Indeed, noting that X is not random and assuming that αs and αs′ are sampled independently, we have that
; therefore, xs and xs′ are independent.
Box 1. Monte Carlo analysis of variance (MC-ANOVA)
- Draw a random vector αs from a proper multivariate distribution.
- Form the linear combination xs = Xαs.
- Estimate the proportion of variance of xs
using a random-effects model (expression [1]) with variance parameters estimated using either Bayesian or likelihood-type methods.
- Repeat 1–3 B times (e.g., B = 10,000).
- Use the sequence of estimated R-squared
to approximate the distribution of the
. An estimate of
can be obtained using the median or the average,
, R-sq. in the sequence.
In Box 1 we did not specify how the αs are generated; for this aspect of the algorihtm there are countless possibilities: weights can be sampled from distributions with continuous support (e.g., p-variate Gaussian) or from mixture models with a point of mass at zero. The weights may be independent or correlated, and the distributions may be symmetric or skewed. We will show later on (using data from chicken genomes) that the process used to generate the weights may affect some features of the distribution of the R2 values, albeit not necessarily the mean or the median R2. The possibility of using different processes for generating random vectors in LX gives the MC-ANOVA a great deal of flexibility. For example, this method could be used to assess how the distribution of the proportion of variance explained may change for different trait architectures–we will further explore that flexibility in greater detail in one of the case studies presented below.
Regression using orthogonal basis (Eigen-ANOVA)
An orthogonal basis for the row-space of X can be obtained from the singular-value decomposition of , where UX and VX are the left- and right-singular vectors of X respectively, and DX is a diagonal matrix with the singular values of X in the diagonal. Both UX and VX are orthonormal, thus
and
. Each vector in LX can be represented as a linear combination of the left-singular vectors of X. Therefore, our second method estimates the proportion of variance of each of the left-singular vectors of X that can be explained by regression on W, and produces a global R2 estimate using a weighted sum of the R2 estimated for each singular vector (Box 2, note that
and UX in Box 2 are also the non-zero eigenvalues and the eigenvectors of XX′, respectively).
Box 2. Eigen-ANOVA
- Generate an orthogonal basis for LX; for instance, compute the singular-value decomposition of
where
and
form an orthonormal basis for the row- and column space of X respectively, and DX = Diag{di} is a diagonal matrix with the singular values of X in its diagonal (i = 1,…,r, where r is the rank of X).
- Regress each of the left-singular vectors on LW using a linear model such as that in expression [1] with ui = xs, and estimate the proportion of variance of each vector that can be explained by regression on LW,
.
- Estimate the global proportion of variance of vectors in LX that can be explained by regression on LW using
.
Statistical properties assessed via simulations
We evaluated the statistical performance of the two methods described above using simulated data panels with a known proportion of variance shared between input and output data set. We also compared the performance of the two proposed methods with that of the Partial Least Squares (PLS, [3])–a method commonly used to analyze high dimensional data. We considered two simulation settings. In both cases, the input set was obtained from a wheat (Triticum) genotype data set generated by the International Maize and Wheat Improvement Center (CIMMYT) which contains genotypes at 1,279 DNA-markers assessed in 599 wheat inbred lines (see Materials and Methods for further details on this data set).
We note here that while a genotype matrix contains strictly discrete values (0/1 or -1/1 for inbreed lines and 0/1/2 or -1/0/2 for outbred diploid individuals) the linear span of it includes vectors in Rn. The vectors in the linear span of the genotypes can be thought as ‘breeding values’ formed as linear combinations of genotypes.
In our first simulation setting, W599×1,279 was the genotype matrix and X599×1,279 was obtained by adding iid (independent and identically distributed) Gaussian noise to the genotype matrix. We tuned the variance of the noise to generate scenarios of the proportion of variance of X explained by W ranging from 0 (X was pure noise) to 1 (X = W). For each simulated data set we then estimated the proportion of variance of X explained by regression of W using random-effects models, with variance parameters estimated using REML [7] (see Materials and Methods for details).
The Monte Carlo method estimated the proportion of variance of X explained by W without any noticeable bias (Table 1). However, the regression of the left-singular vectors of X on W in Eigen-ANOVA produced estimates that were downwardly biased in case the true proportion of variance of X explained by W was large (e.g., >0.5). Further inspection of the results for individual MC replicates suggested that the bias of the Eigen-ANOVA method was likely due to a relatively large number of ‘corner’ solutions (zero estimated proportion of variance) which were common for high-order eigenvectors (i.e., those with small eigenvalue)–we illustrate this in an analysis of multi-omic cancer data further below. The use of PLS led to a downwardly bias estimates in cases where the proportion of variance of X explained by W was low (0.1, 0.3, 0.5) and an upwardly biased estimates when proportion of variance explained was high (0.8, 0.9).
We then considered a second simulation setting to contemplate cases involving asymmetric proportion of variance explained. To achieve this, we formed X using a subset of the wheat marker genotypes (5%, 10%, 30%, 50%, 80%, 90%, 95%) and formed W by binding the columns of X with additional columns filled with iid Gaussian noise (Z), that is W599×1,279 = [X599×p, Z599×(1,279−p)] (p<1,279). The columns of X and W were centered and scaled to unit variance; therefore the share of the variance of W explained by X () is known and equal to,
. Similarly, the proportion of variance of X explained by W (
) is 1 because X is included in W.
In our second simulation study, the MC-ANOVA method rendered nearly unbiased estimates of the proportion of variance of one set explained by the other (Table 2). However, the Eigen-ANOVA method and the PLS produced noticeable biases, with Eigen-ANOVA method again generating downwardly biased estimates in cases where the true proportion of variance explained was high, and the PLS generating downwardly (upwardly) biased estimates whenever the true proportion of variance was low (high).
Applications to experimental data
We used the MC-ANOVA and Eigen-ANOVA to quantify the proportion of variance explained in two experimental data sets. The first one contains a set of ultra-high-density (UHD) SNPs from chicken (Gallus gallus) genomes derived from a combination of whole-genome sequencing (WGS) and imputation. We used this data set to assess the proportion of variance of UHD genotypes that can be captured and predicted using low-density SNP sets. The second data set involved three omic-layers (gene expression [GE], methylation [ME], and copy-number-variants [CNVs]) of human (Homo sapiens) female breast cancer patients. We used this data set to assess the proportion of variance at one omic that can be explained by another omic.
Quantifying multi-locus linkage disequilibrium between SNP panels.
The continued reduction of genotyping and sequencing costs has led to a sustained increase in the number of loci that can be genotyped. In plant and animal breeding four typical genotyping options include customized low-density arrays with hundreds to a few thousand SNPs [8], commercial arrays of common SNPs with tens of thousands of SNPs [9], high-density SNP arrays with hundreds of thousands of SNPs [10, 11], and whole-genome sequence-derived SNP genotypes. The number of SNPs that can be derived from WGS varies between populations and sequencing depth but is usually of the order of tens of millions (UHD SNP genotypes). In recent years, several projects have produced large volumes of fully sequenced genomes for various agricultural species and model organisms. However, generating, storing, and fitting models with UHD-genotypes can be logistically, economically, and computationally challenging. Moreover, empirical evidence seems to suggest that using UHD SNP-genotypes does not lead to substantial gains in prediction accuracy relative to models trained using tens of thousands of SNPs [12–15]. This often leads to researchers wondering: How many SNPs are needed to capture (almost all) the information contained in UHD SNP genotypes? We used the MC- and Eigen-ANOVA methods to address precisely this question.
Data consisted of 1.79 million SNP-genotypes for 892 (female and male) chickens from six generations of a purebred commercial brown layer line of Lohmann Tierzucht GmbH. These genotypes originated from a combination of whole-genome sequencing of 25 layers and imputation to UHD of the genomes of 867 that were genotyped a high density (~600,000 SNPs) Affymetrix Axiom Chicken Genotyping Array [16]. Further details about this data set can be found in the Materials and Methods section.
In a first analysis, the output space was the linear space (LX) spanned by the UHD SNP genotypes. The input set, (LW), consisted of low-density genotypes obtained by selecting p (p = 500, 1K, 2K, 3K, 5K, 10K, and 50K) evenly-spaced (in variant counts) SNPs. We estimated the proportion of variance captured by low-density panels using the MC- and Eigen-ANOVA methods. For the MC method we sampled weights from iid standard normal distribution, , and then formed a random vector in LX using xs = Xαs, where X is the matrix of UHD SNP-genotypes. These random vectors were then regressed on the lower-density SNP sets, and the proportion of variance explained was estimated using REML. This was repeated 1,000 times to estimate the distribution of the proportion of variance of vectors in LX explained by each of the low-density SNP-sets. For the Eigen-ANOVA method, we regressed each of the left-singular vectors of the UHD SNP genotypes on the low-density panels.
According to the MC-ANOVA method, the panel containing 500 evenly-spaced SNPs captured about two-thirds of the variance spanned by the UHD SNP genotypes (Fig 1). The proportion of variance of the UHD SNPs explained by low-density panels increased with the number of SNPs in the low-density panels reaching 100% with p> = 10K SNPs. The variance in the proportion of variance captured by low-density panels also decreased with the number of SNPs in the array (Fig 1). Small sample size and small effective population size are further factors that may make 10K SNPs to be sufficient to achieve a very high R-sq.
The Eigen-ANOVA yielded a very similar estimate of the proportion of variance explained as the MC-ANOVA for p = 500. However, for SNP-panels with more than 500 SNPs, the estimated proportion of variance obtained with the Eigen-ANOVA was systematically lower than the one obtained with MC-ANOVA. This agrees with what we found in the simulations where for high R2 the Eigen-ANOVA method gave downwardly biased estimates. (Note that while the MC-ANOVA yields both a point estimate and measures of dispersion (across random vectors) of R2, the Eigen-ANOVA only yields the point-estimates which are shown in Fig 1.)
In the previous application of the MC method, we drew random effect vectors that had weights (drawn from a normal distribution) on all the SNPs of the UHD set. When X contains whole-genome sequence genomes, one can think of the random vectors in LX (xs = Xαs) as additive-genetic traits and of the MC method as exploring many possible of such traits. However, for any trait, the vast majority of variants in the genome are expected to have no effect. The number of variants affecting any trait could vary from very few (simple traits) to hundreds or thousands (complex traits). Therefore, to explore the effect of the trait architecture on the distribution of the proportion of genetic variance of those traits that could be captured by low-density SNP panels, we repeated the previous analyses using random vectors that had 5,10,50,500 non-zero weights–the set of SNPs with non-zero weight were randomly sampled from the UHD-genotypes, and the weights of those SNPs were iid normal (see Materials and Methods for details).
The estimated proportion of variance explained by regression on lower-density SNP panels was, on average, the same across “trait-architectures” (Fig 2). However, the dispersion of the estimated means was, as expected, much larger for simple traits (e.g., 5 ‘causal variants’). For "complex traits" with 500 "causal variants," the proportion of variance explained by regression on 10K or more SNPs was greater than 95% for all MC replicates. However, for simpler traits we had some random vectors with a proportion of variance explained smaller than 0.8. This suggests that, while for highly complex traits low-density SNP arrays of 10K-50K SNPs may be enough to span the variance of the whole genome, for some simple traits, such arrays may not contain enough SNPs in high LD with the causal variants.
Using MC-ANOVA and Eigen-ANOVA to study shared variance in multi-omic data sets.
Cancerous processes involve the deregulation of signaling pathways controlling cell fate and progression, arising from the accumulation of genomic and epigenomics alterations across multiple genes [17, 18]. Genetic and epigenetic modifications can lead to changes in GE, which in turn can lead to changes in downstream (e.g., protein expression) and upstream (e.g., DNA, ME) processes, thus resulting in complex multivariate association patterns between multiple omic-layers.
We used GE, ME and CNV data from breast cancer tumors (n = 593) from The Cancer Genome Atlas (TCGA) to study multivariate associations between those three omics. Details of the technologies used to generate these data, as well as the data quality controls (QC) and editions are described in the Materials and Methods. After QC and editions data consisted of the (log-transformed) expression of 20,319 genes, counts at 11,552 CVN-sites, and ME intensity at 28,241 ME CpG islands. We used the MC- and the Eigen-ANOVA methods to estimate the proportion of variance of one omic that can be explained by regression on another omic; we did this for all pairwise omics combinations (GE~ME, GE~CNV, ME~GE, ME~CVN, CNV~GE, and CVN~ME).
Our results with the MC-ANOVA method indicate that the CNV data were completely explained by both GE and ME (Table 3). About 70% of the variance spanned by ME was explained by GE and vice versa. Finally, CNV explained a relatively small fraction of the variance spanned by either GE or ME. These results suggest that most CNVs have effects in both ME and GE and therefore, variation in CNV can be predicted by ME and GE. However, although there is an association between CNV and both ME and GE, many other factors (e.g., environmental effects) seem to intervene, thus making the proportion of GE or ME explained by CNV relatively small (~20%). Overall the MC- and Eigen-ANOVA methods yielded similar results. However, in cases involving high R2 (CNV~ME, CNV~GE, GE~ME and ME~GE) the Eigen-ANOVA method gave R2 estimates that were lower than those of the MC method. This pattern is consistent with what we observed in the simulation and in the analyses of chicken genomes.
Eigen-vector-specific R2 values obtained with the Eigen-ANOVA method (Fig 3) showed that the R2 values were, in most cases (except GE~CNV and ME~CNV) very high (and in many cases very close to one) for the top-eigenvectors (i.e., those with high eigenvalue), and very small for eigenvectors associated with low eigenvalues. The transition in the R2 profile of individual eigenvectors showed a relatively sharp phase transition from R2 values near one to near-zero values. Overall, our results suggest a relatively good agreement in the patterns captured by the top-eigenvectors across omics.
Points give the proportion of variance for individual eigenvectors. GE = Gene Expression, ME = Methylation, CNV = Copy-number variants (global R2 estimates, derived from random vectors and from the Eigen-ANOVA method are shown in Table 3).
Discussion
Modern genomic data sets often combine information from multiple non-independent data-layers. Quantifying multivariate associations between data layers can shed light on many important aspects of the data. In this study, we developed two procedures to estimate the proportion of variance explained in settings where both the input and output sets are high-dimensional. The proposed approach uses random effects Gaussian models to estimate the proportion of variance of (independent) vectors in the linear span of an output set (X) that can be explained by regression on an input set (W). The resulting R2 estimate is a weighted average of the R2 values obtained from independent vectors. To generate independent vectors, we considered two approaches: The first one (MC-ANOVA) is a Monte Carlo method that uses randomly generated vectors in the linear span of the output set. The second one (Eigen-ANOVA) uses an orthogonal basis for the linear span of X.
The proposed methods share four important features. First, both methods can be used to perform analysis of variance when both explanatory and dependent data are high-dimensional. Second, estimates are entirely based on the likelihood function and there is no need to make regularization decisions (number of dimensions, penalty parameters). Third, for any pair of information sets, the analysis of variance is not necessarily symmetric; therefore, the approach accommodates cases where the proportion of variance of W explained by X is not equal to the reciprocal. Finally, in addition to producing an R2 estimate, the proposed methods can shed light on important aspects of the underlying association patterns (e.g., decomposition of the global R2 on eigen-vector specific R2’s, distribution of R2 over possible vectors in the linear span of the output set).
Our simulations suggest that MC-ANOVA renders nearly unbiased estimates of the proportion of the variance of one set that can be explained by another. However, the Eigen-ANOVA exhibited systematic biases in scenarios in which the true proportion of variance was high. We also evaluated the PLS regression method, and our simulations suggest that PLS lead to upwardly (downwardly) biased estimates whenever the true proportion of variance is high (low). Therefore, for estimation of the proportion of variance explained we recommend using MC-ANOVA. The Eigen-ANOVA method seems to be a valid alternative, provided that the proportion of variance of one set explained by the other is not too high.
Computational considerations
The Eigen-ANOVA requires computing all the eigenvectors of the response matrix (say X) and then estimating proportion of variance of each of the eigenvectors explained by the explanatory matrix (e.g., W). The computational complexity of standard algorithms for singular-value decomposition is O(n3) (assuming n~p). On the other hand, the MC-ANOVA requires forming B random vectors fo the form xs = Xαs; ignoring the cost of sampling the weights, the computational complexity of forming each of this vectors is O(n2), again assuming n~p; thus, in general the MC-ANOVA will be computationally less involved as long as the number of vectors required for accurate estimation is smaller than n. In our experience, a few hundred random vectors (say 300) are enough to estimate the average, median, and SD of the R2. Therefore, whenever the rank of the response matrix is high, the MC-ANOVA has clear computational advantages. These advantages would be particularly clear for very large rank matrices. Finally, we note that the estimation process of both Eigen-ANOVA and MC-ANOVA is 'embarrassingly' parallel since the R2 of each of the vectors (either eigenvectors or random vectors) can be computed independently of each other.
Consistent with our simulation results, the analyses of experimental data showed that in problems involving a high R2 the Eigen-ANOVA method yielded lower estimates of the proportion of variance explained than those obtained with the MC-ANOVA (e.g., see Fig 1 and Table 3). Inspection of the results of the Eigen-ANOVA for individual eigenvectors suggests that the downward bias of the method may originate from ‘corner’ solutions (zero-estimates of R2) for eigenvectors associated with small eigenvalues. Therefore, if the only goal is to estimate the proportion of variance of one set explained by another set, we recommend using the MC-ANOVA method.
The Eigen-ANOVA method yields R2-values for each of the eigenvectors of the output set. This information can help elucidate whether global patterns (e.g., those associated with the top-eigenvectors) in one information set can be predicted from information contained in another information set. For instance, our analysis of the multi-omic breast cancer revealed that the patterns described in the top-eigenvectors derived from GE and ME are very similar; therefore, one should not expect big differences in tumor classifications that are based on the top-eigenvectors derived from either set. Interestingly, we found that in the analyses of omic data the R2 of individual eigenvectors showed a very sharp phase transition, suggesting that eigenvectors associated with intermediate and small eigenvalues may describe omic-specific patterns, or perhaps measurement error associated to each of the techniques.
The MC-ANOVA method can be used to characterize the distribution of the R2 estimates across vectors in the linear span of the output set. We used this feature to study the effect of the trait-architecture on the distribution of the R2 estimates. Our results indicate that while the average R2 does not seem to be affected by the sparsity of the coefficients used to form random vectors (i.e., the αs′), the dispersion and the shape of the distribution highly depend on the process used to generate the weights (Fig 2). Highly sparse weights lead to a distribution of the R2 values that, compared with vectors that were less sparse, had higher dispersion and in some cases (e.g., when the proportion of variance explained was close to 1) was skewed (Fig 2).
An important feature of the methods proposed in this study is that the R2 measure is not symmetric, in contrast to CCA. Our simulation study shows that if the underlying patterns are non-symmetric (e.g., when one of the linear spaces is a subspace of the other) the proposed estimation methods (in particular the MC-ANOVA) can detect the lack of symmetry very well (see Table 2). Interestingly, our analysis of multi-omic data from breast cancer patients exhibited cases where R2 was rather symmetric (e.g., the regression ME~GE and the regression ME~GE) and others that were highly asymmetric (e.g., CNV~GE and GE~CNV). The asymmetric cases suggest that almost all the variability in CNV can be predicted from GE (and ME as well); however, only a fraction of the GE variance can be explained by differences in CNV patterns. This result is consistent with the hypothesis that most CNV have an impact on GE, but GE is also affected by factors other than CNV (e.g., methylation, environmental effects).
In this study, we focused on the application of the Eigen- and MC-ANOVA for problems involving two input sets (e.g., a low- and a high-dimensional SNP array, or two different omics) evaluated on the same set of individuals. However, with slight modifications, the MC-ANOVA method will be useful for evaluating the proportion of variance of vectors in the span of a training set (e.g., all the available genotypes/phenotypes) that could be captured/predicted by regression on a subset of it, e.g., founders, or “proven” individuals, e.g., [19, 20].
The methods discussed in this study are entirely based on linear models. However, both MC-ANOVA and Eigen-ANOVA can easily be extended to consider non-linear relationships by embedding each set using a non-linear mapping. For instance, in the case of SNPs, one could generate a linear space that accounts for additive and non-additive effects by considering contrasts for additive, dominance, and epistatic interactions [21]. More generally, one can consider embedding either X or W by transforming one or both sets using a non-linear mapping f(.) (e.g., Gaussian kernels). Then, the methods presented here could be applied using and
as information sets within the context of Reproducing Kernel Hilbert Spaces regressions (e.g., [22, 23]).
In our applications, we considered one dependent and one explanatory set; however, the methodology presented in this study could be easily adapted to accommodate cases with one output set (e.g., Y) and multiple explanatory sets (e.g., X, and W). This can be done by expanding the model used to estimate R2 [1] by including two random effects, each with its own variance parameter. Such methods could be used to answer potential questions such as what proportion of variance of gene expression may be explained by joint regression on methylation and copy-number-variants.
In conclusion, we developed two methods for estimating the proportion of variance explained in problems in which both the input and output sets are high-dimensional. The MC-ANOVA method provided nearly unbiased estimates across a range of simulation scenarios. In addition to providing estimates of the proportion of variance explained, the two methods can yield useful insight into the association patterns underlying multi-layered high-dimensional data.
Materials and methods
Data sets
The wheat data set used in Simulations 1 and 2 was generated by the International Maize and Wheat Improvement Center (CIMMYT). This data set provides genotypes at 1,279 molecular markers (Diversity Array Technology DNA markers) assessed in 599 wheat inbred lines. Further details about this data set can be found in [24]. The data set is available with the BGLR R-package [25].
The chicken data set used in Case Study 1 consisted of UHD SNP genotypes of 892 female and male chickens from six generations of a purebred commercial brown layer line of Lohmann Tierzucht GmbH. The genomes of 25 layers were sequenced at 8x read-depth. A total of 4.92M (M = million) SNPs were derived from these 25 genome sequences. The remaining layers (n = 867) were genotyped using the Affymetrix Axiom Chicken Genotyping Array [16] which contains ~600K (580,961) SNPs. Ni et al. [28] imputed the SNP-genotypes of those 867 layers to the whole-genome sequence (4.92 SNPs) using BEAGLE 3.3.2 [26] for phasing and MiniMac3 [27] for imputation. For details on the imputing procedure we refer to Ni et al. [28]. This produced a combined genotype file consisting of 4.92M SNPs from 892 = 25+867 genomes. We further filtered the combined genotype file by removing SNPs with minor-allele-frequency smaller than 0.005 (0.5%) and pruning adjacent SNPs that were in (almost) perfect LD (i.e., R2 ≥0.99). A total of 1.79M SNPs passed these last two filters.
The breast cancer data set used in Case Study 2 was from the Cancer Genome Atlas (TCGA https://www.cancer.gov/tcga) and consisted of gene expression (GE), methylation (ME), and copy-number-variants (CNV) data from (n = 593) breast cancer tumors from female breast cancer patients.
Gene expression data (RNA-Sequencing counts) were generated using the Illumina HiSeq RNA V2 platform and DNA methylation profiles were determined using the Illumina HM450 platform. RNA-sequencing data were transformed using the natural logarithm and individual CpG site β-values were summarized at the CpG island level, using the maximum connectivity approach implemented in the WGCNA R package [29]. The CpG island summaries were transformed into M-values (M = β/(1-β) [30]). CNV profiles corresponded to gene-level copy number intensity derived from Affymetrix SNP Array 6.0 platform, using hg19 as reference.
From each of the three omics we removed features with a coefficient of variation smaller than 1% and those with a proportion of missing values greater than 20%. The missing values that remained were imputed using the clustering algorithm described in [31]. After imputation, each feature was adjusted for batch effects using ComBat [32]. After applying the steps described above, the data set used in the analyses consisted of the (log-transformed) expression of 20,319 genes, 11,552 CVN-sites, and ME intensity at 28,241 ME CpG islands.
Restricted maximum likelihood estimation of variance components
For the MC-ANOVA and EIGEN-ANOVA methods, the proportion of variance of one set (e.g., X) explained by the other set (W) was estimated using a random-effects model of the form
where z was either a random linear combination of the columns of X (MC-ANOVA, see Box 1) or one of the eigenvectors of XX′ (Eigen-ANOVA, see Box 2), μ is an intercept, β is a vector of Gaussian random effects,
, and ε is a vector containing error terms, which were also assumed to be Gaussian,
. For computational convenience and without loss of generality, we reparametrized the above model in terms of a random-effects model, of the form
where
, where MVN() stands for Multivariate Normal Distribution. We centered and scaled the columns of W to a standard deviation equal to
; this leads to a covariance structure, WW′ with an average diagonal equal to one; therefore, with this scaling,
can be interpreted as the amount of variance of z captured by regression on W and the ratio
can be interpreted as the proportion of variance of z that can be explained by W.
We estimated the variance components of the above model using Restricted Maximum likelihood (REML [7]) which was implemented with a custom R-script that for optimization uses the bobyqa function of the minqa R-package [33]. The scripts used to fit variance components using REML are provided in the S1 File (see function fitREML).
Partial least squares
We also estimated the proportion of variance of X explained by W (and the reciprocal when needed) by regressing X on W using the pls R-package [34].
The ability of the pls regression to fit X depends on the number of components used. To determine the number of components, we first fitted the PLS regressions with 1, 2, …, 100 components in 10-fold cross-validation and evaluated the cross-validation prediction mean-square error of each of the resulting models. We then selected the number of components that led to the smallest mean-squared prediction error and fitted a PLS regression with that number of components to the entire data set. The R2 of the fitted model in the training data was used as an estimate of the proportion of variance of X that could be explained by W. The fitPLS function provided in the script provides a wrapper to the plsr function which implements the procedure described above.
Simulations
Both simulations were implemented using genotypes from the wheat data set.
Simulation 1.
In the first simulation setting the input set was the wheat genotypes W599×1,279 = {w1,…,w1,279}, and X = {x1,…,x1,279} was a noisy version of W obtained by adding Gaussian iid noise, , to the genotypes, xi = wi+δi where i = 1,…, 1,279. The columns of W were standardized to unit variance, and the noise variance (
) was set such that the proportion of variance of xi explained by wi was equal to 0.1, 0.3, 0.5, 0.8, 0.9 and 1:
We also consider a scenario where
(i.e., X was purely random noise).
We conducted 1,000 MC simulations (the input set (W) did not change across MC samples; however, the output set (X) varied across MC replicates due to the noise term), and, for each simulated data set, estimated the proportion of variance of X explained by regression on W using MC-ANOVA, Eigen-ANOVA, and the PLS method.
Simulation 2.
We designed a second simulation to consider the case where one of the sets (X) was included in the other set (W). In this setting X599×p was generated by including p (≤1,279) of the 1,279 DNA-markers; we used values of p that led to the inclusion of 5%, 10%, 30%, 50%, 80%, 90%, and 95% of all the available DNA markers. Subsequently, W was formed by combining X with (p-1,279) columns filled with iid Gaussian random variables (Z): W599×1,279 = [X599×p,Z599×(p−1,279)].
The columns of X and W were all centered and scaled to unit variance. Since Z is independent of X, the proportion of variance of W explained by X equals p/1,279. On the other hand, the proportion of variance of X explained by W is one because X is included in W. We conducted 1,000 MC simulations and, for each simulated data sets, we estimated and
by regressing X on W and W on X, respectively, using MC-ANOVA, Eigen-ANOVA, and the PLS method.
Supporting information
S1 File. Contains the scripts used to carry out the simulations and data analyses.
https://doi.org/10.1371/journal.pone.0243251.s001
(HTML)
Acknowledgments
We thank The Cancer Genome Atlas Research Network (https://www.cancer.gov/tcga), CIMMYT, and Lohmann Tierzucht GmbH for making their data available.
References
- 1.
Mardia K V, T. KJ, Bibby JM. Multivariate Analysis. Academic Press; 1979.
- 2.
Rencher AC, Christensen WF. Methods of multivariate analysis [Internet]. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 2012. https://doi.org/10.1002/9781118391686
- 3. Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. Elsevier; 2001;58: 109–130.
- 4.
Krzanowski WJ, J. W. Principles of multivariate analysis: a user’s perspective [Internet]. Clarendon Press; 1988. Available: https://dl.acm.org/citation.cfm?id=59560
- 5. Izenman AJ. Reduced-rank regression for the multivariate linear model. J Multivar Anal. Academic Press; 1975;5: 248–264.
- 6. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10: 515–534. pmid:19377034
- 7. Patterson HD, Thompson R. Recovery of Inter-Block Information When Block Sizes are Unequal. Biometrika. 1971;58: 545–554.
- 8. Boichard D, Chung H, Dassonneville R, David X, Eggen A, Fritz S, et al. Design of a Bovine Low-Density SNP Array Optimized for Imputation. Liu Z, editor. PLoS One. 2012;7: e34130. pmid:22470530
- 9. Matukumalli LK, Lawley CT, Schnabel RD, Taylor JF, Allan MF, Heaton MP, et al. Development and Characterization of a High Density SNP Genotyping Assay for Cattle. Toland AE, editor. PLoS One. 2009;4: e5350. pmid:19390634
- 10. Kranis A, Gheyas AA, Boschiero C, Turner F, Yu L, Smith S, et al. Development of a high density 600K SNP genotyping array for chicken. BMC Genomics. 2013;14: 59. pmid:23356797
- 11. Unterseer S, Bauer E, Haberer G, Seidel M, Knaak C, Ouzunova M, et al. A powerful tool for genome analysis in maize: Development and evaluation of the high density 600 k SNP genotyping array. BMC Genomics. BioMed Central Ltd.; 2014;15: 823. pmid:25266061
- 12. Erbe M, Gredler B, Seefried FR, Bapst B, Simianer H. A Function Accounting for Training Set Size and Marker Density to Model the Average Accuracy of Genomic Prediction. Liu Z, editor. PLoS One. Public Library of Science; 2013;8: e81046. pmid:24339895
- 13. Ober U, Ayroles JF, Stone EA, Richards S, Zhu D, Gibbs RA, et al. Using Whole-Genome Sequence Data to Predict Quantitative Trait Phenotypes in Drosophila melanogaster. Wray NR, editor. PLoS Genet. Public Library of Science; 2012;8: e1002685. pmid:22570636
- 14. Vazquez AI, Rosa GJM, Weigel KA, de los Campos G, Gianola D, Allison DB. Predictive ability of subsets of single nucleotide polymorphisms with and without parent average in US Holsteins. J Dairy Sci. 2010;93. pmid:21094768
- 15. Weigel KA, de los Campos G, Vazquez AI, Rosa GJM, Gianola D, Van Tassell CP. Accuracy of direct genomic values derived from imputed single nucleotide polymorphism genotypes in Jersey cattle. J Dairy Sci. 2010;93. pmid:20965358
- 16. Kranis A, Gheyas AA, Boschiero C, Turner F, Yu L, Smith S, et al. Development of a high density 600K SNP genotyping array for chicken. BMC Genomics. BioMed Central; 2013;14: 59. pmid:23356797
- 17. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. Cancer genome landscapes. Science. 2013;339: 1546–58. pmid:23539594
- 18. Witte T, Plass C, Gerhauser C. Pan-cancer patterns of DNA methylation. 2014; 1–18.
- 19. Misztal I. Inexpensive computation of the inverse of the genomic relationship matrix in populations with small effective population size. Genetics. Genetics; 2016;202: 401–409. pmid:26584903
- 20. Pocrnic I, Lourenco DAL, Masuda Y, Legarra A, Misztal I. The dimensionality of genomic information and its effect on genomic prediction. Genetics. Genetics; 2016;203: 573–581. pmid:26944916
- 21. Martini JWR, Gao N, Cardoso DF, Wimmer V, Erbe M, Cantet RJC, et al. Genomic prediction with epistasis models: on the marker-coding-dependent performance of the extended GBLUP and properties of the categorical epistasis model (CE). BMC Bioinformatics. BioMed Central Ltd.; 2017;18: 3. pmid:28049412
- 22. de los Campos G, Gianola D, Rosa GJM. Reproducing kernel Hilbert spaces regression: a general framework for genetic evaluation. J Anim Sci. 2009;87: 1883–1887. pmid:19213705
- 23. de los Campos G, Gianola D, Rosa GJM, Weigel KA, Crossa J. Semi-parametric Genomic-Enabled Prediction of Genetic Values Using Reproducing Kernel {H}ilbert Spaces Methods. Genet Res (Camb). Cambridge University Press; 2010;92: 295–308. pmid:20943010
- 24. Crossa J, de los Campos G, Perez P, Gianola D, Burgueo J, Araus JL, et al. Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics. 2010;186: 713–724. pmid:20813882
- 25. Pérez P, de los Campos G. Genome-wide regression and prediction with the BGLR statistical package. Genetics. Genetics; 2014;198: 483–95. pmid:25009151
- 26. Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. Elsevier; 2009;84: 210–223. pmid:19200528
- 27. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet. 2012;44: 955–959. pmid:22820512
- 28. Ni G, Strom TM, Pausch H, Reimer C, Preisinger R, Simianer H, et al. Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken. BMC Genomics. 2015;16: 824. pmid:26486989
- 29. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. BioMed Central; 2008;9: 559. pmid:19114008
- 30. Du P, Zhang X, Huang C-C, Jafari N, Kibbe WA, Hou L, et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. BioMed Central; 2010;11: 587. pmid:21118553
- 31. Hastie T, Tibshirani R, Narasimhan B, Chu G. Impute: Imputation for microarray data. 2016;17: 520–525.
- 32. Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C, et al. Batch effect removal methods for microarray gene expression data integration: a survey. Brief Bioinform. Oxford University Press; 2013;14: 469–490. pmid:22851511
- 33. Bates D, Mullen KM, Nash JC, Varadhan R. minqa: Derivative-free optimization algorithms by quadratic approximation [Internet]. 2014. Available: https://cran.r-project.org/package=minqa
- 34. Mevik B-H, Wehrens R, Hovde K. pls: Partial Least Squares and Principal Component Regression. R Packag version 27–2}. 2019; https://CRAN.R-project.org/package=pls