A Higher-Order Generalized Singular Value Decomposition for Comparison of Global mRNA Expression from Multiple Organisms
In this raster display of Equation (1) with overexpression (red), no change in expression (black), and underexpression (green) centered at gene- and array-invariant expression, the S. pombe, S. cerevisiae and human global mRNA expression datasets are tabulated as organism-specific genes17-arrays matrices , and . The underlying assumption is that there exists a one-to-one mapping among the 17 columns of the three matrices but not necessarily among their rows. These matrices are transformed to the reduced diagonalized matrices , and , each of 17-“arraylets,” i.e., left basis vectors17-“genelets,” i.e., right basis vectors, by using the organism-specific genes17-arraylets transformation matrices , and and the shared 17-genelets17-arrays transformation matrix . We prove that with our particular of Equations (2)–(4), this decomposition extends to higher orders all of the mathematical properties of the GSVD except for complete column-wise orthogonality of the arraylets, i.e., left basis vectors that form the matrices , and . We therefore mathematically define, in analogy with the GSVD, the “common HO GSVD subspace” of the matrices to be the subspace spanned by the genelets, i.e., right basis vectors that correspond to higher-order generalized singular values that are equal, , where, as we prove, the corresponding arraylets, i.e., the left basis vectors , and , are orthonormal to all other arraylets in , and . We show that like the GSVD for two organisms , the HO GSVD provides a sequence-independent comparative mathematical framework for datasets from more than two organisms, where the mathematical variables and operations represent biological reality: Genelets of common significance in the multiple datasets, and the corresponding arraylets, represent cell-cycle checkpoints or transitions from one phase to the next, common to S. pombe, S. cerevisiae and human. Simultaneous reconstruction and classification of the three datasets in the common subspace that these patterns span outline the biological similarity in the regulation of their cell-cycle programs. Notably, genes of significantly different cell-cycle peak times  but highly conserved sequences ,  are correctly classified.