Restoring the Duality between Principal Components of a Distance Matrix and Linear Combinations of Predictors, with Application to Studies of the Microbiome

Appreciation of the importance of the microbiome is increasing, as sequencing technology has made it possible to ascertain the microbial content of a variety of samples. Studies that sequence the 16S rRNA gene, ubiquitous in and nearly exclusive to bacteria, have proliferated in the medical literature. After sequences are binned into operational taxonomic units (OTUs) or species, data from these studies are summarized in a data matrix with the observed counts from each OTU for each sample. Analysis often reduces these data further to a matrix of pairwise distances or dissimilarities; plotting the first two or three principal components (PCs) of this distance matrix often reveals meaningful groupings in the data. However, once the distance matrix is calculated, it is no longer clear which OTUs or species are important to the observed clustering; further, the PCs are hard to interpret and cannot be calculated for subsequent observations. We show how to construct approximate decompositions of the data matrix that pair PCs with linear combinations of OTU or species frequencies, and show how these decompositions can be used to construct biplots, select important OTUs and partition the variability in the data matrix into contributions corresponding to PCs of an arbitrary distance or dissimilarity matrix. To illustrate our approach, we conduct an analysis of the bacteria found in 45 smokeless tobacco samples.


Introduction
Advances in sequencing technology have revolutionized our view of the microbiome, the microbial communities that exist in almost every environment including within humans and other animals. In the past, study of the microbiome was limited to what grows in culture. The advent of sequencing studies has removed this restriction. By sequencing the 16S rRNA gene, present in all bacteria and almost exclusive to bacteria, it is possible to survey the bacterial composition of samples irrespective of whether they grow easily in culture. The large number a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 of sequences obtained by modern genotyping methods means that bacteria present at very low prevalence can be observed. The resulting data on bacterial abundance are highly complex and analyses often require dimension reduction before important features can be found (in a microbiome study, the OTU counts or frequencies play the role of 'features' in a general machine learning context).
In a microbiome study, sequences are typically grouped into operational taxonomic units (OTUs) based on similarity using a bioinformatic pipeline such as QIIME [1] or Mothur [2]. These pipelines produce OTU counts (abundances) that can be summarized in a data matrix X; here we take the rows to correspond to observations and the columns to species or OTUs. Since in a microbiome experiment the number of species or OTUs will typically far exceed the number of observations, some sort of dimension reduction is required. As with other studies in ecology, it is common practice to use the species (OTU) abundance data in X to calculate a distance or dissimilarity matrix Δ with Δ ij denoting the distance between the ith and jth observation. The distance matrix can be a highly nonlinear function of the data in X and may in fact require external data for calculation. For example, the Unifrac distance [3,4], commonly used in microbiome studies, is a functional of the phylogenetic tree that summarizes the genetic distance between the OTUs, and thus requires genetic sequence data to calculate. Here we do not distinguish between dissimilarity measures that are or are not distance metrics, and generically refer to all dissimilarities as 'distances. ' The distance measures used by Ecologists (see [5] for an exhaustive discussion) are often very successful at describing the observations in the sense that the first few principal components (PCs) of the (appropriately centered and scaled) distance matrix allow visual separation of the data into meaningful groups. While this separation is useful in showing that OTUs vary systematically across groups, investigators often wish to know which OTUs contribute most to this separation. However, once a distance is calculated, it is difficult to know which species or OTUs contribute to the observed distances, or to place future observations in an ordination plot to see if they cluster with the 'correct' group.
In high-dimensional data, important linear combinations of features are frequently obtained by calculating the PCs of X T X, the correlation or covariance matrix of the data, depending on how X is scaled. These PCs can also be obtained from a singular value decomposition (SVD) of the data matrix X, which yields a set of singular vectors for observations and a set of singular vectors for features (here, OTUs or species). This approach has the advantage that there is a 'duality' between the two sets of singular vectors, so that if one set of vectors is known, the other set can be immediately obtained. This duality has useful consequences; the 'factor loadings' (coefficients of the corresponding singular vector in feature space) can be obtained for each component in observation space to see which features contribute most, or a biplot can be constructed. In addition, the singular vectors in observation space can be used as predictors in a model, because the duality assures we can interpret and calculate them for future observations. However, the cost is that we are implicitly using XX T to measure similarity, since the singular vectors for observations are eigenvectors of XX T . The goal of this paper is to restore the duality between the set of eigenvectors for an arbitrary choice of distance matrix Δ, and a set of vectors in feature space, to the largest extent possible.
A motivating example is an analysis of the bacteria found in 45 samples from three types of smokeless tobacco products (dry, moist, and brown toombak) reported elsewhere [6]. Using sequence data from the V4 region of the 16S rRNA gene, we used the QIIME pipeline [1] to categorize the 3,738,578 observed sequences into 5345 OTUs. After applying a thresholding criterion [6], we reduced the number of OTUs to 271 while retaining 3,555,575 (95%) sequences. Tyx et al. [6] found that the first three principal components of the (weighted) Unifrac distance matrix were very successful at differentiating the tobacco types, while also showing that replicates of the same product were closely clustered. However, we cannot know which OTUs are influential in this result. Further, we are unable to use the OTU frequencies of subsequent samples to see if their predicted type (as determined by their placement on the plot of PCs) are consistent with our original analysis. Finally, we cannot make a biplot that uses the ordination obtained using the UniFrac PCs to visualize which OTUs are influential in predicting tobacco type.
The approach we take here is to construct approximate decompositions of the data matrix that mimic the SVD. We first recall how the singular value decomposition (SVD) ensures a connection between eigenvectors of observations and OTUs when the data matrix is decomposed using a SVD, and then present approximate SVD-like decompositions that use the eigenvectors of an arbitrary distance matrix such as the Bray-Curtis or UniFrac distance in the role of the singular vectors for observations. We then show how these SVD-like decompositions can be used to partition the total sum of squares in the data, to aid in choosing the number of components to use and to determine the amount of variability explained by each OTU. In the results, we analyze the tobacco bacteria data to evaluate the performance of the methods we are proposing. We then discuss rarefaction and a kind of weighted analysis that connects two of the approaches we consider. Finally we conclude with a brief discussion.

Duality and the Singular Value Decomposition
Data from a 16S rRNA microbiome experiment can be summarized in a n × p-dimensional data matrix X where n is the number of observations and p is the number of species or OTUs. The elements of X count the number of reads in observation i that fall into OTU j. The row sums, referred to as the library size, are thought to be largely ancillary; thus, the count data in X is often converted to OTU frequencies by dividing the counts in each row by the corresponding library size (to put each row on the same scale) and then data for each OTU is centered by subtracting the mean OTU frequency. An interesting property of count data scaled and centered in this way is that both row and column sums are zero. Whatever scaling and centering is applied, the data matrix X can always be written using the singular value decomposition (SVD) as where L is a n × q matrix with orthonormal columns, S is a q × q diagonal matrix having positive entries, and R is a p × q matrix with orthonormal columns, where q is rank of X. If we are willing to measure similarity between observations using Δ = XX T , then the columns of L comprise the coordinates of the observations in a principal components analysis, or a principal coordinates analysis (PCoA) if count data in X have been scaled and centered as described above, since the columns of L are also the principal components of XX T . Equivalently, we can first calculate the PCs of X T X to obtain R, the PCs of the covariance (or correlation, depending on scaling) matrix of OTUs. Ecologists refer to the representation of observations by coordinates in a low-dimensional (typically in 2 or 3) space as ordination. The SVD is also the starting point for constructing a biplot of the data. If the data from each observation is standardized and we use Δ = I − XX T to measure distance then using Eq (1) we see that the eigenvectors of Δ are given by the columns of L. In this situation, given only the kth PC of Δ (i.e., L k , the kth column of L) we could use Eq (1) to obtain the 'factor loadings' R k (i.e., the kth column of R) by rewriting Eq (1) as The constant of proportionality (S kk ) can be determined by normalizing R k . The factor loadings from Eq (2) contain information on which OTUs are important predictors of the kth PC. Conversely, given the matrix of factor loadings R and diagonal matrix of constants of proportionality S, the eigenvectors of Δ (i.e., the columns of L) could be reconstructed by rewriting Eq (1) as Representation Eq (3) allows us to use observed OTU frequencies for a new observation to see where it falls in an ordination plot of existing data. Of course, (Eqs (2) and (3)) are immediate consequences of the SVD and coordinates for observations L, factor loadings R and constants S can be calculated simultaneously. If we wish to use an arbitrary distance matrix Δ, then the eigenvectors of Δ will not correspond to the left singular vectors of X. As a result, Eq (2) cannot be used to express the eigenvectors of Δ as linear combinations of OTU frequencies and Eq (3) cannot be used to determine the PCs of a new observation. Because Δ is real and symmetric, we can always write Δ = BEB T where B is orthogonal and E is diagonal; however, the elements of E may not all be positive unless Δ is Euclidean.
We can attempt to restore the relationship between the eigenvectors of Δ and linear combinations of the rows of X in two ways, either using the singular value decomposition of X as our guide, or using prediction of the left singular vectors of X (that are used for ordination) as our guide. In the first case, we can seek a decomposition of X that looks like the SVD, but uses B in place of the left singular vectors. Specifically, we can seek a matrix V with normalized columns and a diagonal matrix D with nonnegative elements that minimize the objective function ij is the Frobenius matrix norm used for least-squares problems posed in terms of matrices. For identifiability we insist that the elements of D are nonnegative. We refer to this as the 'decomposition' approach. Note that if we are only interested in a subset of the columns of B, we can replace B by B d , the n × d matrix that contains the d columns of interest. For notational simplicity, we suppress the subscript d here.
Alternatively, we can use Eq (3) as our starting point, and seek a matrix V with normalized columns and a diagonal matrix D having nonnegative entries that minimize the objective function where M Ák denotes the kth column of M and where ||C|| 2 is the Euclidean (L 2 ) norm. We refer to this as the 'regression' approach. Note that, unless constraints are added to the problem that mix information from the columns of V, the regression approach naturally separates into univariate regressions, one for each column of B that we are fitting. The requirement that V have normalized columns corresponds to Diag(V T V) = I d .

Unconstrained Solutions to the Decomposition and Regression Approaches
If the only constraint on V is that Diag(V T V) = I, the matrices V and D that minimize Eqs (4) and (5) can be easily found. We first note a lemma governing minimizers of Eq (4): where X has rank q, B has dimension n × d and X = LSR T is the singular value decomposition given in Eq (1).
The proof of Lemma 1 can be found in the appendix. Note that Lemma 1 implies that minimization of Eq (4) is equivalent to minimization of jjLS À BQ T jj 2 F for q × d-dimensional matrix Q, which implies we can find a unique minimizer even when p > n since q min(p, n). By direct optimization we find that if the columns of B are orthogonal, the minimizer of Eq (4) is given W du , D du and V du are determined by the norms of the columns of W du . Unlike Eq (4), optimization of Eq (5) produces a family of solutions. The general solution can be written as where the subscript r denotes regression. Using Eq (7) in Eq (5) we find where M − denotes the Moore-Penrose inverse of M. Eq (5) gives no information on A ru ; however, if we choose A ru = 0 then Lemma 1 shows the resulting choice of V ru will give the best decomposition (in the sense of minimizing Eq (4)) among all choices in the family Eq (7). Thus, we choose A ru = 0, to obtain the particular solution As before, V ru and D ru are determined by the norms of Z ru . Note that in general V du obtained by minimizing Eq (4) differs from V ru obtained by minimizing Eq (5). Because the unconstrained decomposition and regression approaches differ, it is not clear that either is adequate for our dual goal of predicting B for future observations and describing X for ordination and biplot construction. Thus, XV du D À 1 du may give poor prediction of B in the sense that Eq (5) is large, while BD ru V T ru may be a poor approximation to X in the sense that Eq (4) is large.
Because V ru 6 ¼ V du , OTUs selected as important for regression may not correspond to important variables for decomposition, or vice versa. We explore these issues further using the Tobacco data in the next section. Since both regression and decomposition are important, we next consider minimizing Eqs (4) and (5) subject to the constraint that V has orthonormal columns. We will see in our analysis of the tobacco data that this has the effect of ensuring that the V that is selected performs well for both regression and decomposition.

Orthogonal Solutions to the Decomposition and Regression Approaches
The easy connection between Eqs (1), (2) and (3) when using measuring similarity using XX T occurs because the columns of R are orthogonal. In order to ensure that the OTUs selected are important for both regression and decomposition, we next consider minimizing Eqs (4) or (5) subject to the constraint Unless all the singular values of X are equal (i.e., if X has been standardized by right-multiplication by ðX T XÞ À 1 2 ), it is easy to see that neither W du nor Z ru have orthogonal columns. As a result, we seek V do and D do , the minimizers of Eq (4) subject to constraint Eq (9) and V ro and D ro , the minimizers of Eq (5) subject to constraint Eq (9), where the subscript o refers to orthogonal.
Finding V do and D do is related to the orthogonal but not orthonormal Procrustes problem [7]. Because the minimizer of jjX À BDV T jj 2 F with respect to either D or V subject to Eq (9) is available in closed form, Everson [7] suggests the Tandem algorithm, an alternating approach in which first V then D is updated, until convergence. Finding the optimal V given D is not difficult, requiring only the calculation of a single SVD, while the optimal D given V can be expressed in closed form (see the proof of Lemma 2 in the appendix). Further, Lemma 1 implies V = RQ while Eq (9) implies Q T Q = I.
Finding V ro and D ro is much harder, even after using Eq (7), because the closed form of V that minimizes the Frobenius norm subject to orthogonality constraint Eq (9) is not known even when D is assumed to be known. We know of three ways to numerically optimize Eq (5) subject to constraint Eq (9); none of the methods outperform the others in all cases. First, [7] gives a representation of V in terms of an initial matrix V 0 that satisfies Eq (9) and d 2 À Á Givens rotation matrices; this enables brute-force minimization of Eq (5) subject to Eq (9) using a derivative-free optimizer. A similar representation for the derivatives of Eq (5) w.r.t. the Givens rotation angles is possible as well. Second, an approximate quadratic programming algorithm by Watson [8], described in [7], can be used. This approach requires solving a d 2 À Á -dimensional linear system for each step. Finally an approach described by Gower and Dijksterhuis ( [9], pp98-100) using an algorithm by Koschat and Swayne [10] for finding V satisfying Eqs (5) and (9) for fixed D can be used. If only the first few columns of V ro are needed, the brute force approach works well. In this situation, we can optimize Eq (5) using only d columns of B, and systematically increase d until the needed components stabilize. This approach assumes the first few columns of B explain the majority of variability; the values of D do can be used as a guide to ensure that the important columns of B are being used.
Decomposing The Variability in the Data Matrix X Once we have obtained an estimate of V, it can be used either for predicting B (e.g., for future observations) or describing variability in X (e.g., for constructing biplots). Since both goals are important, we need to evaluate the performance of each method for regression and decomposition. Regression performance is easily summarized by R 2 , the correlation between the predicted and observed columns of B; note this measure is independent of D. Decomposition performance is a bit more complicated, since the natural quantity jjX À BDV T jj 2 F depends explicitly on D. To avoid penalizing the regression approaches just because of the scale choice, for assessing the performance of V ro in explaining variability in X, we replace D ro byD ro , the minimizer of Eq (4) when V = V ro . This change is unnecessary for D ru since it is easy to show that it already minimizes Eq (4). We now show the following lemma that governs partitioning the total sum of squares jjXjj 2 F into a model sum of squares jjBDV T jj 2 F and a residual sum of squares jjX À BDV T jj 2 F . If X is centered, then the total sum of squares is proportional to the variance of the X ij s. Lemma 2. Let B be a n × d-dimensional matrix with orthonormal columns and let D be a d × d-dimensional diagonal matrix chosen to minimize jjX À BDV T jj 2 F . Then jjXjj Proof of Lemma 2 can be found in the Appendix. Further, as long as B is orthogonal, we can decompose the model sum of squares either as (10)  j , gives us another method to evaluate the performance of each method. Finally we note that Eqs (10) and (11) holds for any choice of d; we may wish to reserve the term 'residual sum of squares' for the value of jjX À BDV T jj 2 F that is attained when the maximum value of d is used. In this case we can partition the 'model' sum of squares into a part corresponding to components actually used (typically, the first d components) and a part corresponding to the unused (truncated) components. From Eqs (10) and (11), it is easily seen that the sum of squares corresponding to truncated components can be written either as P d max j¼dþ1 D 2 j or as

Analysis of Bacteria found in Smokeless Tobacco Products
To illustrate the approaches developed here, we applied the decomposition and regression approaches, with and without the orthogonality constraint, to 16S rRNA data on 15 smokeless tobacco products; 6 dry snuffs, 7 moist snuffs, and 2 toombak samples from Sudan. Three separate (replicate) observations (starting with sample preparation) were made of each product, so that in total 45 observations are available. Our goal in analyzing these data are both to find important OTUs that describe the variability in the microbial communities in these products, and to develop insight on how well each approach performs in a variety of measures. We measured distance Δ between samples using the (weighted) unifrac distance. To account for differences in read count across samples, we sub-sampled reads so that each sample had the same number of reads before calculating the distance. We repeated this subsampling 1,000 times and averaged over replicates to obtain a final matrix Δ. After centering rows and columns of the the matrix having elements D 2 ij as described by Gower [11], we obtained the matrix B by spectral decomposition of the resulting matrix. We additionally converted the rows of X to percent abundances to eliminate differences in scale, and then centered the rows and columns to sum to zero. All calculations were carried out using R.
The Tandem algorithm [7] applied to these data converges almost instantly even when all 44 columns of B are used in the decomposition. We found it much harder to find V ro for all 44 components, as there are apparently local minima. The computation time was measured in hours or days, not seconds like the Tandem algorithm. A modification of the Watson [8] algorithm that used a line search to choose the step size gave the solution having the smallest value of Eq (4) that we present here.
In Table 1 we compare the performance of the four methods in terms of their ability to explain X and their ability to predict B. Results in Table 1 are based on estimating d = 44 components, the maximum number for these data. The most surprising result in Table 1 is the remarkably small proportion (0.8%) of the data matrix X that is explained by using V ru chosen by unconstrained regression, even though V ru predicts the columns of B perfectly. Although V du predicts 100% of the variability in X, its performance in predicting the columns of B is the worst of the four approaches. Overall V do seems to perform best, explaining almost 90% of the variability in X while also predicting the important columns of B well. The performance of V ro in predicting the columns of B was also good but it only explained about 75% of the variability in X. Thus, even if prediction of B is the primary goal, the small improvements in R 2 do not seem to warrant the computational effort required to obtain V ro .
In Fig 1 we plot the (square of the) diagonal elements of D for the four methods considered here: the unconstrained regression and decomposition approaches, and the orthogonal regression and decomposition approaches. For orthogonal regression we plot the (square of the) diagonal elements ofD ro selected for the orthogonal regression approach when used as a decomposition method. Like a typical SVD, the 'scree plot' shows that for each decomposition only a few components are important. This is reassuring as we are most interested in truncated versions of the SVD-like decompositions. From Fig 1 we are assured that a biplot in 2 or 3 dimensions will capture much of the variability in the data. Note that the components are sorted by the eigenvalues of B, not the magnitude of D, so that the generally monotonic decrease in D values indicates directions that are important in describing B are also important in describing X. If this were not the case, it may be worth choosing another measure of distance for calculating B.
In Fig 2 we plot the sorted values of w 2 j for each method considered here; note that the order of OTUs may be different in each panel. To see how similar the orderings OTU influence (as measured by w 2 j ) are across methods, we calculated the variance-covariance matrix of the w 2 j values from the four approaches (Table 2). These correlations are high except when calculating w 2 j using unconstrained regression, indicating that the ordering of OTUs in Fig 2 is similar for all the methods except unconstrained regression. Since Tyx et al. [6] found 3 principal components were necessary to separate these three groups, we used d = 3 when calculating w 2 j . In Table 3 we show the 11 OTUs that were selected to be on the list of the top 5 OTUs for each method (along with the variability explained by that OTU and its rank by each method). There is good agreement between both decomposition approaches and the orthogonal regression approach, while none of the OTUs selected by unconstrained regression appear on the top 5 list for any other method. The OTUs selected by unconstrained regression are biologically distant as well, with only one OTU selected by unconstrained regression sharing a family (Staphylococcaceae) with any OTU selected by one of the other methods.
The effect of each OTU can be displayed in a biplot. In Fig 3 we show a 2-dimensional biplot based on the orthogonal decomposition method, showing the second and third   components (which had the two highest values of both D du and D do ). It is clear the ordination of these data using the first 3 PCs of the (weighted) Unifrac matrix are fairly successful at separating the different types (dry, moist and toombak). Further, the replicates corresponding to the same product are tightly clustered. We also show arrows corresponding to the top five OTUs calculated using orthogonal decomposition. To construct this biplot, we note that the orthogonal decomposition implies where B i denotes the ith row of B and W j denotes the jth column of W = V Á D and (A, B) denotes the Euclidean inner product. Since the elements of B i are the coordinates of the ith observation and W j is the vector whose norm determines the influence of OTU j in explaining the model sum of squares, it is natural to represent OTUs by plotting W j . Further, the magnitude of X ij is represented by the dot product of B i and W j , so that if W j for an OTU 'points towards' a certain group of samples, we can expect that the values of X ij are relatively large for these samples. To create a low-dimensional plot, we typically sum k in Eq (12) over two or three dimensions; for Fig 3 we sum k from 2 to 3. By examining the biplot in Fig 3, we see that Sudanese toombak is characterized by elevated levels of OTU 810425, assigned to the Corynebacteriaceae family, largely absent from all other types. The OTU 4379247 (Lactobacillaceae) appears elevated in some dry snuff samples; whereas, OTUs 29012 (Enterococcaceae), 4312974 (Staphylococcaceae) and 52399 (Aerococcaceae) appear elevated in moist snuff samples.

Additional Considerations
In this section, we show that the results we have obtained can be applied directly to some simple but important generalizations. In particular, we show how to incorporate rarefaction into our decomposition approach, and indicate where it may not be necessary. We also consider a weighted regression approach that gives a connection between the regression and decomposition approaches. Rarefaction is a commonly used (but still controversial, see e.g. [12]) approach to processing microbiome data to account for differences in library size. In our analysis of the tobacco data, we averaged over rarefactions when calculating the distance matrix; here we address the question of how to incorporate averaging over rarefactions of the data matrix into our orthogonal decomposition. Computing a separate decomposition for each rarefaction is not tenable as it is unclear how we would combine the decompositions obtained for each replicate. Instead, we propose finding D and V that minimize the objective function subject to D ! 0 and the desired constraints on V, where X r is the r th rarefied data matrix. However, since we see that we can instead optimize jjX À BDV T jj where X is the average of the data matrix over rarefactions. Thus, if X contains the untransformed counts (or even if the data matrix is scaled by the library size for each observation), in the limit this corresponds to using we scale the rows of X by the library sizes), where π ij is the frequency of the jth OTU in the ith sample and M is the number of reads selected in each rarefaction.
Since centering for PCoA is also linear in the elements of X, this argument suggests that using the empirical frequencies without rarefaction, at least for the decomposition approaches, is warranted.
Turning now to the relationship between the orthogonal regression and decomposition approaches, the objective function for the orthogonal regression approach given in Eq (5) assigns equal importance (weight) to the prediction of each column of B. If we choose to weight the prediction of the jth column of B by D 2 jj , a measure of the importance of the jth column, then it is easy to show that minimizing the resulting objective function jjðXVD À 1 À BÞDjj 2 F ¼ jjXV À BDjj 2 F yields the same values of V and D as minimizing the objective function for orthogonal decomposition. Thus, orthogonal decomposition also has an interpretation as a weighted regression, where the weight assigned to the prediction of each column is proportional to the variance of X explained by that column in the decomposition.

Discussion
The principal components of a distance matrix Δ can be very useful in ordination, the representation of observations in an Ecology or microbiome study as points in a low-dimensional space. Meaningful groupings in the data are often apparent in an ordination plot. When the correlation matrix is used to measure similarity, there is a natural duality that enables us to express the eigenvectors of Δ as linear combinations of the species or OTU frequencies. This duality allows construction of a biplot, in which both observations and OTUs can be simultaneously represented graphically. When an arbitrary distance is used, we have developed methods to restore this duality, at least approximately. We evaluated these approaches within the context of an analysis of the bacterial species found in smokeless tobacco products [6].
In our analysis of the bacteria found in smokeless tobacco products, we found that the orthogonality constraint results in linear combinations that perform well both in explaining the variability in the data matrix X as well as predictors in a regression. This is reasonable as orthogonality is the same principle connecting the regression and decomposition approaches in a SVD of X. We also found that the orthogonal regression and orthogonal decomposition approaches gave similar results, which were also fairly close to the unconstrained decomposition approach. Finally, given the difficulties in obtaining more than a few components of the orthogonal regression approach, and the interpretation of the orthogonal decomposition approach as a weighted version of the orthogonal regression approach given in the previous section, it seems that the orthogonal decomposition approach is the most appealing approach.
We also showed that the approaches we presented have a variance partitioning property in which the total sum of squares represented by jjXjj 2 F can be partitioned into residual sums of squares and model sums of squares. We further showed that, even when we choose a set of linear combinations V that are not orthogonal, the model sum of squares can be partitioned in two ways; one in which we sum over the contributions of each component, another in which we sum over the contributions of each variable (OTU). The first partition can be used to justify a truncated decomposition; the second can be used to find important variables, especially for making biplots. We found that both orthogonal approaches and unconstrained decomposition were in broad agreement (similar model sums of squares, similar OTUs identified as important) while unconstrained regression behaved very differently, identifying very different OTUs as important and having a small model sum of squares. This may be because a certain set of OTUs may allow good prediction of a columns of B even if these OTUs do not explain much of the overall variability in the OTU table (e.g., if they are rare). Since the decomposition approaches also give good prediction of at least those columns of B that explain most of the variability (at least in the tobacco data we considered) it seems that unconstrained regression can miss important large-scale features in favor of small-scale features that happen to be good predictors of B, in some sense failing to see the forest through the trees.
We have considered here only decompositions of the data matrix X. The results here thus can be considered 'unsupervised learning.' In further work, we plan to consider extensions of this approach to 'supervised learning' where we have additional variables that we wish to incorporate into the choice of linear combinations. For example, we may wish to find linear combinations of OTUs that optimally explain group membership (e.g., tobacco type in the tobacco data considered here).

Appendix: Proofs of the Lemmas
Proof of Lemma 1. Because q is the rank of X, p ! q. If p = q the result is trivial, since the columns of R span R p and so the columns of any p × d-dimensional matrix W can be expressed as a linear combination of columns of R, which establishes the result. For p > q let R ? be a (p − q) × p-dimensional matrix having orthonormal columns that span the orthogonal compliment of the space spanned by the columns of R. Because the columns of any p × d-dimensional matrix can we written in terms of the basis given by the columns of R and R ? we have W = RQ + R ? A for matrices Q and A. Inserting this form into f(W) and using R T R ? = 0 and XR ? = 0 we find f ðWÞ ¼ jjX À BQ T R T jj 2 F þ jjA T Ajj 2 F . Since A T A is a real symmetric matrix, f(W) is minimized when A = 0, i.e. when W = QR.
Proof of Lemma 2: By direct calculation, each element D jj satisfies (B T XV) jj = Djj(V T V) jj . The lemma holds because Tr½VDB T ðX À BDV T Þ ¼ TrðDB T XV À D 2 V T VÞ ¼ P d j¼1 D jj ðB T DVÞ jj À D 2 jj ðV T VÞ jj ¼ 0 elementwise. Finally, note if D jj < 0 we can replace D jj by −D jj while replacing V kj by −V kj 8k and the lemma still holds.