A Higher-Order Generalized Singular Value Decomposition for Comparison of Global mRNA Expression from Multiple Organisms

The number of high-dimensional datasets recording multiple aspects of a single phenomenon is increasing in many areas of science, accompanied by a need for mathematical frameworks that can compare multiple large-scale matrices with different row dimensions. The only such framework to date, the generalized singular value decomposition (GSVD), is limited to two matrices. We mathematically define a higher-order GSVD (HO GSVD) for N≥2 matrices , each with full column rank. Each matrix is exactly factored as Di = UiΣiVT, where V, identical in all factorizations, is obtained from the eigensystem SV = VΛ of the arithmetic mean S of all pairwise quotients of the matrices , i≠j. We prove that this decomposition extends to higher orders almost all of the mathematical properties of the GSVD. The matrix S is nondefective with V and Λ real. Its eigenvalues satisfy λk≥1. Equality holds if and only if the corresponding eigenvector vk is a right basis vector of equal significance in all matrices Di and Dj, that is σi,k/σj,k = 1 for all i and j, and the corresponding left basis vector ui,k is orthogonal to all other vectors in Ui for all i. The eigenvalues λk = 1, therefore, define the “common HO GSVD subspace.” We illustrate the HO GSVD with a comparison of genome-scale cell-cycle mRNA expression from S. pombe, S. cerevisiae and human. Unlike existing algorithms, a mapping among the genes of these disparate organisms is not required. We find that the approximately common HO GSVD subspace represents the cell-cycle mRNA expression oscillations, which are similar among the datasets. Simultaneous reconstruction in the common subspace, therefore, removes the experimental artifacts, which are dissimilar, from the datasets. In the simultaneous sequence-independent classification of the genes of the three organisms in this common subspace, genes of highly conserved sequences but significantly different cell-cycle peak times are correctly classified.


Mathematical Framework: HO GSVD
Comparative analyses of large-scale datasets promise to enhance our fundamental understanding of the data by distinguishing the similar from the dissimilar among these data.
For example, comparative analyses of global mRNA expression from multiple model organisms promise to enhance fundamental understanding of the universality and specialization of molecular biological mechanisms, and may prove useful in medical diagnosis, treatment and drug design [1]. Existing algorithms limit analyses to subsets of homologous genes among the different organisms, effectively introducing into the analysis the assumption that sequence and functional similarities are equivalent [2]. For sequence-independent comparisons [3], mathematical frameworks are required that can distinguish the similar from the dissimilar among multiple large-scale datasets tabulated as matrices with the same column dimension and different row dimensions, corresponding to the different sets of genes of the different organisms. The only such framework to date, that of the generalized singular value decomposition (GSVD) [4][5][6] is limited to two matrices.
Recently we showed that the GSVD provides a comparative mathematical framework for global mRNA expression datasets from two different organisms, tabulated as two matrices with the same column dimension and different row dimensions, where the mathematical variables and operations represent biological reality [7]. In this application, one matrix tabulates DNA microarraymeasured genome-scale mRNA expression from the yeast S. cerevisiae, sampled at n time points at equal time intervals during the cell-cycle program. This matrix is of size m 1 -S. cerevisiae genes × n-DNA microarrays. The second matrix tabulates data from the HeLa human cell line, sampled at the same number of time points, also at equal time intervals, and is of size m 2 -human genes × n-arrays. The underlying assumption of the GSVD as a comparative mathematical framework for the two matrices is that there exists a one-to-one mapping among the columns of the matrices, but not necessarily among their rows. The GSVD factors each matrix into a product of an organism-specific matrix of size m 1 -S. cerevisiae genes or m 2 -human genes × n-"arraylets," i.e., left basis vectors, an organism-specific diagonal matrix of size n-arraylets × n-"genelets," i.e., right basis vectors, and a shared matrix of size n-genelets × n-arrays.
We showed that the mathematical variables of the GSVD, i.e., the patterns of the genelets and the two sets of arraylets, represent either the similar or the dissimilar among the biological programs that compose the S. cerevisiae and human datasets. Genelets of common significance in both datasets, and the corresponding arraylets, represent cell-cycle checkpoints that are common to S. cerevisiae and human. Simultaneous reconstruction and classification of both the S. cerevisiae and human data in the common subspace that these patterns span outlines the biological similarity in the regulation of their cell-cycle programs. Patterns almost exclusive to either dataset correlate with either the S. cerevisiae or the human exclusive synchronization responses. Reconstruction of either dataset in the subspaces of the common vs. exclusive patterns represents differential gene expression in the S. cerevisiae and human conserved cell-cycle programs vs. their unique synchronization-response programs, respectively. Notably, relations such as these between the expression profiles of the S. cerevisiae genes KAR4 and CIK1, which are known to be correlated in response to the synchronizing agent, the α-factor pheromone, yet anticorrelated during cell division, are correctly depicted.
We now define a higher-order GSVD (HO GSVD) of N ≥ 2 datasets, tabulated as N real matrices D i with the same column dimension and, in general, different row dimensions. In our form of the HO GSVD, each data matrix D i is assumed to have full column rank and is factored as the product D i = U i Σ i V T , where U i is the same shape as D i (rectangular), Σ i is diagonal and positive definite, and V is square and nonsingular. The columns of U i have unit length; we call them the "left basis vectors" for D i (a different set for each i). The columns of V also have unit length; we call them "right basis vectors," and they are the same in all factorizations. In the application of the HO GSVD to a comparison of global mRNA expression from N ≥ 2 organisms, the right basis vectors are the genelets and the N sets of left basis vectors are the N sets of arraylets. We use the notation: We call {σ i,k } the "higher-order generalized singular value set." They are weights in the following sums of rank-one matrices of unit norm: Hence we regard the kth values σ i,k as indicating the significance of the kth right basis vector v k in the matrices D i (reflecting the overall information that v k captures in each D i in turn). To obtain the factorizations, we work with the matrices A i = D T i D i and the matrix sum S, defined as the arithmetic mean of all pairwise quotients The eigensystem SV = V Λ is used to define V , and the factors U i and Σ i are computed from D i and V .
To clarify our choice of S, we note that in the GSVD, defined by Van Loan [4], the matrix V can be formed from the eigenvectors of the unbalanced quotient A 1 A −1 2 (Supplementary Section 1.1). We observe that this V can also be formed from the eigenvectors of the balanced arithmetic mean S 12 = 1 2 (A 1 A −1 2 + A 2 A −1 1 ). We prove that in the case of N = 2, our definition of V by using the eigensystem of S ≡ S 12 = 1 2 (A 1 A −1 2 + A 2 A −1 1 ) leads algebraically to the GSVD and therefore, as Paige and   Figure S1. The GSVD of two matrices D 1 and D 2 is reformulated as a linear transformation of the two matrices from the two rows × columns spaces to two reduced and diagonalized left basis vectors × right basis vectors spaces. The right basis vectors are shared by both datasets. Each right basis vector corresponds to two left basis vectors.
Saunders showed [5], can be computed in a stable way. We also note that in the GSVD, the matrix V is invariant under the exchange of the two matrices D 1 and D 2 . Therefore, we define our HO GSVD for N ≥ 2 matrices by using the balanced arithmetic mean S of all pairwise arithmetic means S ij , each of which defines the GSVD of the corresponding pair of matrices D i and D j , noting that S is invariant under the exchange of any two matrices D i and D j .
We also show that the existing SVD and GSVD decompositions are in some sense special cases of our HO GSVD (Supplementary Section 1.2). Finally, we conjecture a role for our exact HO GSVD in iterative approximation algorithms (Supplementary Section 1.3).
1.1. The Matrix GSVD 1.1.1. Construction of the matrix GSVD. Suppose we have two real matrices D 1 ∈ R m1×n and D 2 ∈ R m2×n each with full column rank. Van Loan [4] defined the GSVD of D 1 and D 2 as where each U i ∈ R mi×n has orthonormal columns, X ∈ R n×n is nonsingular, and the Σ i = diag(σ i,k ) ∈ R n×n are diagonal with σ i,k > 0 (i = 1, 2). Paige and Saunders [5] showed that the GSVD can be computed in a stable way by orthogonal transformations. In the full column-rank case it takes the form where U 1 , U 2 and » Σ1 Σ2 have orthonormal columns [6], and V is square and nonsingular.
We work with the form of Supplementary Equation (6), but we find it useful and more similar to the standard SVD [6] if we assume that the columns of V are scaled to have unit length, with the columns of Σ i scaled accordingly. Note that the ratios σ 1,k /σ 2,k are not altered by the scaling.
In place of the methods of Van Loan [4] and Paige and Saunders [5], we construct the GSVD of Supplementary Equation (6) as follows. We obtain V from the eigensystem of S, the arithmetic mean of the quotients A 1 A −1 2 and A 2 A −1 1 of the matrices A 1 = D T 1 D 1 and A 2 = D T 2 D 2 : with v k = 1. Given V , we compute matrices B i by solving two linear systems and we construct Σ i and U i = (u i,1 . . . u i,n ) by normalizing the columns of B i : We prove below that S is nondefective (it has n independent eigenvectors) and its eigensystem is real. From Supplementary Equations (8) and (9) we have . We see that the rows of both D 1 and D 2 are superpositions of the same right basis vectors, the columns of V (Supplementary Figure S1). This is the construction that we generalize in Equations (1)-(4) to compute our HO GSVD.

Interpretation of the GSVD construction.
In our GSVD comparison of two matrices, we interpreted the kth diagonals of Σ 1 and Σ 2 , i.e., the "generalized singular value pair" (σ 1,k , σ 2,k ), as indicating the significance of the kth right basis vector v k in the matrices D 1 and D 2 , and reflecting the overall information that v k captures in D 1 and D 2 respectively [7]. The ratio σ 1,k /σ 2,k indicates the significance of v k in D 1 relative to its significance in D 2 .
A ratio of σ 1,k /σ 2,k = 1 corresponds to a basis vector v k of equal significance in D 1 and D 2 . GSVD comparisons of two matrices showed that right basis vectors of approximately equal significance in both matrices reflect themes that are common to the two matrices under comparison [7].
A ratio of σ 1,k /σ 2,k 1 corresponds to a basis vector v k of almost negligible significance in D 2 relative to its significance in D 1 . Likewise, a ratio of σ 1,k /σ 2,k 1 indicates a basis vector v k of almost negligible significance in D 1 relative to its significance in D 2 . GSVD comparisons of two matrices showed that right basis vectors of negligible significance in one matrix reflect themes that are exclusive to the other matrix.
1.1.3. Mathematical properties of Λ and V in the GSVD construction. Note that our GSVD construction in Supplementary Equations (7)-(9) is well defined for any square nonsingular V . We now show that our particular V leads algebraically to the GSVD of Supplementary Equation (6), ignoring the rescaled columns of In practice we would prefer not to form A i or S directly. Instead we may work with the QR factorizations D i = Q i R i , where Q T i Q i = I and R i is upper triangular and nonsingular. Define another triangular matrix R and its SVD to be where U and V are square and orthogonal, and Σ = diag( σ k ), σ k > 0. We then have where the diagonal matrix D normalizes R T 1 U so that V has columns of unit length.
Supplementary Theorem S1. The matrices U 1 and U 2 constructed in Supplementary Equation (9) with help from Supplementary Equations (10), (12) and (13). We see that both B T i B i are diagonal, and the quantities in Supplementary Equation (9) must be with U i column-wise orthonormal.
Supplementary Theorem S2. The eigenvalues of S satisfy λ k ≥ 1, and S has a full set of n eigenvectors (it is nondefective). Also, the eigenvectors are real.
Proof. The equivalence transformation of Supplementary Equation (11) shows that S has the same eigenvalues as U Λ U T , namely Λ. From the definition of Λ and Σ, the eigenvalues are λ k = 1 2 ( σ 2 k + 1/ σ 2 k ) ≥ 1. Also, in the eigensystem of S in Supplementary Equation (13), V = R T 1 U D is a product of real nonsingular matrices and hence is real and nonsingular.
Supplementary Theorem S3. An eigenvalue λ k = 1 corresponds to a right basis vector v k of equal significance in both matrices D 1 and D 2 . That is, σ 1,k /σ 2,k = 1. Figure S2. The higher-order GSVD (HO GSVD) of three matrices D 1 , D 2 , and D 3 is a linear transformation of the three matrices from the three rows × columns spaces to three reduced and diagonalized left basis vectors × right basis vectors spaces. The right basis vectors are shared by all three datasets. Each right basis vector corresponds to three left basis vectors.
Note that the GSVD is a generalization of the SVD in that if one of the matrices is the identity matrix, the GSVD reduces to the SVD of the other matrix.
In Equations (1)-(4), we now define a HO GSVD and in Theorems 1-3 and Corollary 1 we show that this new decomposition extends to higher orders all of the mathematical properties of the GSVD except for complete column-wise orthogonality of the left basis vectors that form the matrices U i for all i. We proceed in the same way as in Supplementary Equations (7)-(9).

The Matrix GSVD and SVD as Special Cases
Let us now show that the GSVD and the standard SVD are special cases of our HO GSVD.
Supplementary Theorem S4. Suppose matrices D and D j are real and have full column rank. The HO GSVD of N matrices satisfying D i = D for all i = j reduces to the GSVD of the two matrices D and D j .
The eigenvectors of S are the same as the eigenvectors V of 1 2 (AA −1 j + A j A −1 ), and Supplementary Theorem S2 shows that V exists. Solving two linear systems for B and B j and normalizing the solutions, reduces the HO GSVD of Equation (1) to the GSVD of D and D j of Supplementary Equation (6): Supplementary Theorem S1 shows that the columns of U and U j are orthonormal.
Supplementary Theorem S5. Suppose the matrix D j is real and has full column rank. The HO GSVD of N matrices satisfying D i = I for all i = j reduces to the SVD of D j .
Proof. Substituting D i = I for all i = j in Supplementary Equation (2), we obtain the matrix sum S = 1 The symmetry of S implies that its eigenvectors V exist and are orthonormal. Computing the matrix B j from gives D j = U j Σ j V T , and Supplementary Theorem S1 shows that the columns of U j are orthonormal. Hence the factorization must be the SVD of D j [6].

Role in Approximation Algorithms
Recent research showed that several higher-order generalizations are possible for a given matrix decomposition, each preserving only some but not all of the properties of the matrix decomposition [12,13]. Our HO GSVD preserves the exactness as well as the diagonality of the matrix GSVD, i.e., all N matrix factorizations in Equation (1) are exact and all N matrices Σ i are diagonal. In general, our HO GSVD does not preserve the orthogonality of the matrix GSVD, i.e., the matrices U i in Equation (1) are not necessarily columnwise orthonormal. For some applications, however, one might want to preserve the orthogonality instead of the exactness of the matrix GSVD. An iterative approximation algorithm might be used to compute for a set of N > 2 real matrices D i ∈ R mi×n , each with full column rank, an approximate decomposition where each U i ∈ R mi×n is composed of orthonormal columns, each Σ i = diag(σ i,k ) ∈ R n×n is diagonal with σ i,k > 0, and V is identical in all N matrix factorizations. If there exist an exact decomposition of Equation (1) where the matrices U i are column-wise orthonormal, then it is reasonable to expect that the iterative approximation algorithm will converge to that exact decomposition. More than that, when the iterative approximation algorithm is initialized with the exact decomposition, it is reasonable to expect convergence in just one iteration. We show below that if there exist an exact decomposition of Equation (1) in which the matrices U i are column-wise orthonormal, our HO GSVD leads algebraically to that exact decomposition. (1) where the matrices U i are column-wise orthonormal, then our particular V leads algebraically to the exact decomposition.

Supplementary Theorem S6. If there exist an exact HO GSVD of Equation
Proof. If there exist an exact decomposition with columnwise orthonormal U i for all i, then A i = D T i D i = V Σ 2 i V T and the right basis vectors, i.e., the columns of V , simultaneously diagonalize all pairwise quotients as well as their arithmetic mean SV = V Λ. Therefore, the V of the eigensystem of S in Equation (2) is equivalent to the V of the HO GSVD of Equation (1) with column-wise orthonormal U i for all i.
We conjecture, therefore, the following role for our exact HO GSVD in iterative approximation algorithms.
Supplementary Conjecture S1. An iterative approximation algorithm will converge to the optimal approximate decomposition of Supplementary Equation (16) in a significantly reduced number of iterations when initialized with our exact HO GSVD, rather than with random U i , Σ i and V . Schizosaccharomyces pombe over about two cell-cycle periods, in a culture synchronized initially by the cdc25-22 block-release late in the cell-cycle phase G 2 , relative to reference mRNA from an asynchronous culture, at 15min intervals for 240min. The S. pombe dataset we analyze (Supplementary Dataset S1) tabulates the ratios of gene expression levels for the m 1 =3167 gene clones with no missing data in at least 14 of the n=17 arrays. Of these, the mRNA expression of 380 gene clones was classified as cell cycle-regulated by Rustici et al. or Oliva et al. [16].
Spellman et al.
[17] monitored mRNA expression in the yeast Saccharomyces cerevisiae over about two cellcycle periods, in a culture synchronized initially by the α-factor pheromone in the cell-cycle phase M/G 1 , relative to reference mRNA from an asynchronous culture, at 7min intervals for 112min. The S. cerevisiae dataset we analyze (Supplementary Dataset S2) tabulates the ratios of gene expression levels for the m 2 =4772 open reading frames (ORFs), or genes, with no missing data in at least 14 of the n=17 arrays. Of these, the mRNA expression of 641 ORFs was traditionally or microarray-classified as cell cycle-regulated.  Figure S3. Correlations among the n=17 arraylets in each organism. Raster displays of U T i U i , with correlations ≥ = 0.33 (red), ≤ − (green) and ∈ (− , ) (black), show that the arraylets u i,k with k = 13, . . . , 17 that correspond to 1 λ k 2, are = 0.33-orthonormal to all other arraylets in each dataset. The corresponding five genelets v k are approximately equally significant with σ 1,k : σ 2,k : σ 3,k ∼ 1 : 1 : 1 in the S. pombe, S. cerevisiae and human datasets, respectively ( Figure 2). Following Theorem 3, therefore, these genelets span the approximately "common HO GSVD subspace" for the three datasets.
Whitfield et al.
[18] monitored mRNA levels in the human HeLa cell line over about two cell-cycle periods, in a culture synchronized initially by a double thymidineblock in S-phase, relative to reference mRNA from an asynchronous culture, at 2hr intervals for 34hr. The human dataset we analyze (Supplementary Dataset S3) tabulates the ratios of gene expression levels for the m 3 =13,068 gene clones with no missing data in at least 14 of the n=17 arrays. Of these, the mRNA expression of 787 gene clones was classified as cell cycle-regulated.
Of the 53,839, 81,124 and 222,156 elements in the S. pombe, S. cerevisiae and human data matrices, 2420, 2936 and 14,680 elements, respectively, i.e., ∼5%, are missing valid data. SVD [6] is used to estimate the missing data as described [7]. In each of the data matrices, SVD of the expression patterns of the genes with no missing data uncovered 17 orthogonal patterns of gene expression, i.e., "eigengenes." The five most significant of these patterns, in terms of the fraction of the mRNA expression that they capture, are used to estimate the missing data in the remaining genes. For each of the three data matrices, the five most significant eigengenes and their corresponding fractions are almost identical to those computed after the missing data are estimated, with the corresponding correlations >0.95 (Supplementary Mathematica Notebooks S1 and S2). This suggests that the five most significant eigengenes, as computed for the genes with no missing data, are valid patterns for estimation of missing data. This also indicates that this SVD estimation of missing data is robust to variations in the data selection cutoffs.
We compute the HO GSVD of the three data matrices after missing data estimation by using Equations (1)-(4).
Following Theorem 3, we find that the approximately common HO GSVD subspace of the three datasets is spanned by the five genelets v k k = 13, . . . , 17 (Figure 2 and Supplementary Figure S3).

Common Subspace Interpretation
In analogy with the GSVD [7], a common HO GSVD genelet and the N = 3 corresponding arraylets, which are approximately orthonormal to all other arraylets in the corresponding datasets (Theorem 3), are inferred to represent a biological process common to the three organisms and the corresponding cellular states when a consistent biological or experimental theme is reflected in the interpretations of the patterns of the genelet and the arraylets.
A genelet v k is associated with a biological process or an experimental artifact when its pattern of expression variation across the arrays, i.e., across time, is interpretable. An arraylet u i,k is parallel-and antiparallelassociated with the most likely parallel and antiparallel cellular states according to the annotations of the two groups of m genes each, with largest and smallest levels of expression in this arraylet among all m i genes, respectively. The P -value of a given association, i.e., P (y; m, m i , z), is calculated assuming hypergeometric probability distribution of the z annotations among the m i genes, and of the subset of y ⊆ z annotations among the subset of m genes, as described [39],  Figure S4. The three-dimensional least-squares approximation of the five-dimensional approximately common HO GSVD subspace. Line-joined graphs of the first (red), second (blue) and third (green) most significant orthonormal vectors in the least squares approximation of the genelets v k with k = 13, . . . , 17, which span the common HO GSVD subspace. We approximate this five-dimensional subspace with the two orthonormal vectors x (green) and y (red), which fit normalized cosine functions of two periods, and 0-and −π/2-initial phases, i.e., normalized zero-phase cosine and sine functions of two periods, respectively.
We find that the approximately common HO GSVD subspace represents cell-cycle mRNA expression in the three disparate organisms ( Figure 2 and Table 1).

HO GSVD Data Reconstruction
The decoupling of the HO GSVD genelets and N sets of arraylets, i.e., the diagonality of the matrices Σ i in Equation (1), allows simultaneous reconstruction of the N = 3 datasets in the common HO GSVD subspace without eliminating genes or arrays.
In analogy with the GSVD [7], given a common HO GSVD subspace that is spanned by the K genelets {v k } where k = n − K + 1, . . . , n, we reconstruct each dataset in terms of only these genelets and the corresponding {u i,k } arraylets, Note that this reconstruction is mathematically equiv-alent to setting to zero the higher-order generalized singular values {σ i,k } in Σ i for all k = n − K + 1, . . . , n, and then multiplying the matrices U i Σ i V T to obtain the reconstructed D i .

Simultaneous HO GSVD Classification
Identifying the subset of genelets and the corresponding arraylets that span the approximately common HO GSVD subspace allows simultaneous classification of the genes and arrays of the three datasets by similarity in their expression of these genelets or corresponding arraylets, respectively, rather than by their overall expression, as described [7]. We least squares-approximate the K=5-dimensional subspace spanned by the five genelets v k with k = 13, . . . , 17, with the two-dimensional space spanned by two of the three orthonormal vectors x, y and z ∈ R n that maximize the norm Figure S4). Since the common HO GSVD subspace represents cell-cycle mRNA expression, the two vectors that we select to approximate the common subspace, x and y, describe expression oscillations of two periods in the three time courses.
We plot the projection of each gene of the dataset D i from the K-genelets subspace onto y, i.e., e T m n k=n−K+1 (σ i,k u i,k v T k )y/a m where e mi is a unit m ivector, along the y-axis vs. that onto x along the xaxis, normalized by its ideal amplitude a m , where the contribution of each genelet to the overall projected expression of the gene adds up rather than cancels out, a 2 m = k j |σ i,k σ i,j (e T mi u i,k u T i,j e mi )v T k (xx T + yy T )v j |. In this plot, the distance of each gene from the origin is the amplitude of its normalized projection. A unit amplitude indicates that the genelets add up; a zero amplitude indicates that they cancel out. The angular distance of each gene from the x-axis is its phase in the progression of expression across the genes from x to y and back to x, going through the projection of each genelet v k in this subspace, i.e., (xx T + yy T )v k . Sorting the genes according to these angular distances gives the angular order, or classification, of the genes.
Similarly, we plot the projection of each array from the K-arraylets subspace onto n k=n−K+1 (u i,k v T k )y, i.e., y T n k=n−K+1 (σ i,k v k v T k )e n where e n is a unit n-vector, along the y-axis vs. that onto n k=n−K+1 (u i,k v T k )x along the x-axis, normalized by its ideal amplitude a n , where the contribution of each arraylet to the overall projected expression of the array adds up rather than cancels out, We sort the arrays according to their angular distances from the x-axis.
For classification, we set to zero the arithmetic mean of each genelet across the arrays, i.e., time, and that of each arraylet across the genes, such that the expression of each gene and array is centered at its time-or gene-invariant level, respectively.  (c) Line-joined graphs of the 13th (red), 14th (blue), 15th (green), 16th (orange) and 17th (violet) arraylets fit one-period cosines with initial phases similar to those of the corresponding genelets ( Figure 2).
With all 3167 S. pombe, 4772 S. cerevisiae and 13,068 human genes sorted, the expression variations of the five k = 13, . . . , 17 arraylets from each organism approximately fit one period cosines, with the initial phase of each arraylet similar to that of its corresponding genelet. The global mRNA expression of each organism, reconstructed in the common HO GSVD subspace, approximately fits a traveling wave, oscillating across time and across the genes (Supplementary Figures  S5-S7).

Sequence-Independence of the Classification
Our new HO GSVD provides a comparative mathematical framework for N ≥ 2 large-scale DNA microarray datasets from N organisms tabulated as N matrices that does not require a one-to-one mapping between the genes of the different organisms. The HO GSVD, therefore, can be used to identify genes of common function across dif-ferent organisms independently of the sequence similarity among these genes, and to study, e.g., nonorthologous gene displacement [3]. The HO GSVD can also be used to identify homologous genes, of similar DNA or protein sequences in one organism or across multiple organisms, that have different cellular functions. We examine, for example, the HO GSVD classifications of genes of significantly different cell-cycle peak times [19] but highly conserved sequences [20,21]. We consider three subsets of genes, the closest S. pombe, S. cerevisiae and human homologs of (i) the S. pombe gene BFR1, which belongs to the evolutionarily highly conserved ATP-binding cassette (ABC) transporter superfamily [22][23][24][25][26][27][28], (ii) the S. cerevisiae phospholipase B-encoding gene PLB1 [29,30], and (iii) the S. pombe strongly regulated S-phase cyclin-encoding gene CIG2 [31,32] (Supplementary Table S1). We find, notably, that these genes are correctly classified (Figure 4).  Figure S6. S. cerevisiae global mRNA expression reconstructed in the five-dimensional common HO GSVD subspace with genes sorted according to their phases in the two-dimensional subspace that approximates it. (a) Expression of the sorted 4772 genes in the 17 arrays, centered at their gene-and array-invariant levels, showing a traveling wave of expression. (b) Expression of the sorted genes in the 17 arraylets, centered at their arraylet-invariant levels. Arraylets k = 13, . . . , 17 display the sorting. (c) Line-joined graphs of the 13th (red), 14th (blue), 15th (green), 16th (orange) and 17th (violet) arraylets fit one-period cosines with initial phases similar to those of the corresponding genelets ( Figure 2).