Dimensionality reduction of longitudinal ’omics data using modern tensor factorizations

Longitudinal ’omics analytical methods are extensively used in the evolving field of precision medicine, by enabling ‘big data’ recording and high-resolution interpretation of complex datasets, driven by individual variations in response to perturbations such as disease pathogenesis, medical treatment or changes in lifestyle. However, inherent technical limitations in biomedical studies often result in the generation of feature-rich and sample-limited datasets. Analyzing such data using conventional modalities often proves to be challenging since the repeated, high-dimensional measurements overload the outlook with inconsequential variations that must be filtered from the data in order to find the true, biologically relevant signal. Tensor methods for the analysis and meaningful representation of multiway data may prove useful to the biological research community by their advertised ability to tackle this challenge. In this study, we present tcam—a new unsupervised tensor factorization method for the analysis of multiway data. Building on top of cutting-edge developments in the field of tensor-tensor algebra, we characterize the unique mathematical properties of our method, namely, 1) preservation of geometric and statistical traits of the data, which enable uncovering information beyond the inter-individual variation that often takes over the focus, especially in human studies. 2) Natural and straightforward out-of-sample extension, making tcam amenable for integration in machine learning workflows. A series of re-analyses of real-world, human experimental datasets showcase these theoretical properties, while providing empirical confirmation of tcam’s utility in the analysis of longitudinal ’omics data.


Introduction
Recent developments in high-throughput methodologies enable the assessment of molecular entities from biological samples on a global scale at steadily decreasing costs, allowing to conduct biological and clinical studies at previously unfeasible magnitude, in terms of the number of biological repetitions and molecules quantified [1]. A consequence of the increased availability of omics methods is the possibility to conduct large-scale longitudinal studies prospectively following participants at a variety of clinical contexts. In particular, longitudinal omics profiling, combined with clinical measurements, enable to detect and understand individual changes from baseline, improving personalized health and medicine by using tailored therapies [2].
Yet, despite the surge of longitudinal multi-omics studies, the toolset for analyses which fully utilize multi-way structures in the data remains limited to date, with only a handful of applicable algorithms and software being suitable for specific tasks [3][4][5].
Higher-order tensors (multi-way arrays of numbers) are arguably the most natural data structures for describing high-dimensional, multi-way data such as longitudinal omics data. Indeed, an impressive adoption of tensor factorization methods for time-series analysis has recently emerged, allowing trajectory analysis for microbiome data [6][7][8] as well as neural dynamics [9]. Generally referred to as tensor component analysis (TCA) [9], these multiway dimensionality reduction methods for omics data provide a view on global-multivariate variations in the data, thus complement methods such as [3][4][5] for univariate time-series analysis. The TCA is most often computed using CANDECOMP/PARAFAC (CP) factorization [10,11], which dramatically limits the ability to apply machine learning (ML) algorithms, as it does not allow for straightforward mapping of unseen data points to the reduced space. In addition, CP-based TCA requires choosing the number of components (dimensions) to be considered, since different choices may result in significantly different transformations of the data, additional uncertainties in analyzing complex information are introduced. Moreover, when alternating least squares (ALS) is used for finding the CP factors, there is no guarantee that the resulting factors correspond to the best low-rank approximation.
In this article, we present TCAM, a method for unsupervised dimensionality reduction method, which provide answer to the unmet need for trajectory analysis of longitudinal omics data (Fig 1). Our method is based on a cutting-edge mathematical framework (the Mproduct between tensors), which allows for a natural generalization of the notion of singular value decomposition (SVD) for matrices (2 nd order tensors) to higher order tensors [12]. In contrast to existing ALS-CP-based methods, TCAM factors do not depend on prior choice of target rank, resulting in consistent outputs between executions. Additionally, given a fitted TCAM factorization, the embedding of a new trajectory is straightforward, unlike CP-based methods in which such computation is untrivial. Moreover, TCAM enjoys provable optimality regarding the variance and distortion of the embeddings. Finally, we stress that unlike other tensor methods (e.g., [6]), that rely on statistical properties of the data and its generative process, TCAM makes no assumptions regarding the structure of the data, making it suitable for any choice of normalization method. Thus, allowing the usage of TCAM with diverse data types (e.g., different 'omics sources) and providing the flexibility needed to answer different kinds of questions.
Following the formal construction of TCAM, we present a series of comparative analyses, using real-world data collected from longitudinal 'omics studies. These experiments, and their TCAM analyses, highlight how TCAM overcomes both the shortcomings of using traditional, matrix-based workflows for the analysis of these data, and those that remain when employing existing, state-of-the-art tensor methods. Given its theoretical guarantees, TCAM should be applicable to any kind of 'omics data (assuming the adequate normalization of the data). Empirical evidence for this claim is given through the re-analysis of a longitudinal proteomics dataset, in which we utilized TCAM to uncover insights that were not disclosed in the original work from which the data was taken (though established in the literature). Finally, we showcase the straightforward application of TCAM to supervised machine learning workflows, where TCAM serves as a 'drop-in' feature-engineering step in the pipeline by the virtue of its out-ofsample extension. Center rhombus describes the typical data produced in a longitudinal experiment, where high-dimensional samples are collected from multiple subjects across various timepoints. Left hand side describes a summarized overview of standard, non-tensor-based workflows, including (top to bottom) ordination plots with repeated measurements, per-timepoint multivariate analysis, and funnels for discovery using univariate time-series analysis. Top right-schematic derivation of TCAM from TSVDM. Bottom right-TCAM's output and its applications, including exploratory analysis of the data through a reduced features space where variation between points reflects differences between high-dimensional temporal trajectories, feature engineering for downstream ML workflows, and feature selection for downstream univariate exploration. https://doi.org/10.1371/journal.pcbi.1010212.g001

Preliminaries and notations
A real tensor of order-N, denoted by A 2 R d 1 �d 2 �����d N , is a multi-dimensional array with real entries indexed by N-tuples. For example, the i 1 , i 2 , . . ., i N entry of A is denoted by A i 1 ;i 2 ;...;i N . In this paper, we consider 3 rd order tensors A 2 R m�p�n holding data from p-dimensional samples, collected from m subjects across n time-points. The size of p is determined by the number of features measured in the 'omics method being used, e.g., it can be the number of observed bacterial species in metagenomics sequencing or the number of genes in transcriptomics. We consider a "subject centered" view of the data tensor, that is, viewing the order-3 tensor as an m-long "list of matrices" where each matrix corresponds to a time-series collected from a single individual, as illustrated in Fig 2a and 2b. We use MATLAB notations for slicing and indexing of tensors, e.g., A i;:;: 2 R 1�p�n denotes the i th horizontal slice of A, which may be considered as a p × n matrix. Fibers of A are obtained by fixing two indices. Of particular interest are the tube-fibers that are the n dimensional vectors A i;j;: .
We briefly define the tensor-tensor product following the construction in [12]. Let M be an n × n invertible matrix, then the mode-3 product the tensor A with the matrix M is denoted by the tensor c A ¼ A� 3 M whose tube-fibers are given by c A i;j;: ¼ MA i;j;: . Note that since M is non-singular, it holds that c A� 3 M À 1 ¼ A. The face-wise product of tensors A 2 R m�p�n and B 2 R p�r�n is denoted by the tensor C ≔ A 4 B 2 R m�r�n where each slice C :;:;i is given by the matrix product A :;:;i B :;:;i 2 R m�r . The ? M tensor-tensor product of A and B is defined  [12]. Briefly, the best t-rank q (multi-rank ρ) approximations of a tensor A are obtained by t-rank q (multirank ρ) truncation of A's TSVDM. In this work, we further generalize these results to a novel notion of explicit rank truncation (see S1 Text). To maintain consistency, throughout all

PLOS COMPUTATIONAL BIOLOGY
demonstrations in this study, we considered M defined by the (scaled) Discrete Cosine Transform (DCT-II). However, we encourage TCAM users to experiment with different transforms, e.g. the software implementation allows for sampling a random M 2 O(n) at uniform (Haar) distribution.

Proposed method: TCAM
The mean sample of a tensor A, is defined as The TCAM of A is defined by a scores matrix Z 2 R m�pn whose entries are and a np × p loadings matrix V with entries V h;j ¼ b V r h;1 ;j;r h;2 . Algorithm 1 for obtaining TCAM highlights that most of the computational effort is due to the TSVDM, which requires Oðnðpm 2 þ m 3 ÞÞ arithmetic operations, omitting the overhead involved with applying the × 3 M operations that are of negligent cost compared to that of the SVD computation. The explicit rank-q truncated TCAM is obtained by keeping the first q columns and rows of Z and V respectively. Each row of the factors matrix represents the p-dimensional time-series (trajectory) of each subject, while the loadings matrix measures the contribution-magnitude and direction-of each of the p 'omics features to each of the TCAM factors across samples. Given a time-series of samples X 2 R 1�p�n , not necessarily included in the horizontal slices of A, the transformation of X to the reduced dimensional space defined by the TCAM fitted using A is given by Algorithm 2 and is equal to This ability to transform new data points to the reduced features space makes TCAM amenable for use as a feature-engineering step in supervised ML workflows.

Algorithm 1 TCAM construction
Transformation parameters: r; V; � A

Algorithm 2 TCAM projection
Input: X 2 R 1�p�n a single (multivariate) time-series Parameters: TCAM transform parameters: The transformation in Eq 3 (Fig 4), is a member of the family of pseudo ? M -orthogonal, explicit rank-q truncated tensor-to-vector mappings (see S1 Text). As such, it enjoys two fundamental properties: variance maximization and minimization of the distortion. Formally, let Z ¼ QðAÞ 2 R m�q denote the image of a tensor A 2 R m�p�n (in MDF) under a pseudo ? Morthogonal explicit rank-q truncated tensor-to-vector mapping Q, then 1. The sample-variance of the image, Tr(Z T Z)/(m − 1), is maximized when Q is defined by the explicit rank-q truncated TCAM.
2. The distortion in the configuration caused by Q, kAA T − ZZ T k�, where A 2 R m�np is the matrix obtained by mode-1 unfolding of A, is minimized when Q is defined by the explicit rank-q truncated TCAM.
Combined, these properties provides a formal justification for applying multivariate hypothesis testing methods such as PERMANOVA [13] to the reduced space representation of the data, in addition to meaningful and intuitive interpretation for the results of such methods. Proofs for these statements, along with additional details, are given in S1 Text.
We now remark regarding the choice of q. The dimension of the reduced space depends on the purpose of the analysis. For example, one might choose q based on the traditional considerations for choosing number of principal components, such as signal-noise ratio assumptions or based on scree plots. Regardless of the considerations for choosing q, it should be noted that unlike CP, where the target rank of the approximation affects the resulting embedding, the TCAM factors are pre-determined by the data (and M), up to a variance and distortion invariant multiplication by a unitary tensor. Thus, the configuration reflected by the first i TCAM components remains unchanged when taking any q � i.

TCAM reveals information beyond inter-individual variations
In this first experiment, we wish to exemplify the shortcomings of traditional, matrix-based, dimensionality reduction methods when applied to longitudinal 'omics data. More often then not, longitudinal 'omics data is characterized by high inter-individual differences, regardless of time or state in which samples were taken. Traditional matrix-based dimensionality reduction methods such as PCA (which attempt to find a representation in a reduced dimensional space in which the variance is maximized) are prone to mask interesting temporal variations by highlighting the prominent inter-individual differences. To demonstrate this point, we use data from a work by Suez et. al. [14], investigating the reconstitution of the gut microbiome in healthy individuals following antibiotic administration. In the original study, participants were split into three study arms-21 day-long probiotics supplementation (PBX), autologous fecal microbiome transplantation (aFMT) derived from a pre-antibiotics treated sample, or spontaneous recovery (CTR). Stool samples were collected at baseline (BAS, days 0 to 6), during antibiotics treatment (ABX, days 7 to 13) and the intervention phase (INT, days 14 to 42).
Indeed, PCA of all time points resulted in a representation that mainly reflects inter-individual differences (Fig 5a and S1(a) Fig), while temporal intra-individual variations remained obscured. The high correlation between "baseline configuration" and "complete configuration" (Pearson, ρ = 0.72, p < 10 −10 , Fig 5b) implies that the repeated samples add very little information to the ordination. Similarly, a per-phase perspective of the data did not capture changes in microbiome composition, but a mere snapshot of temporal trends (S1(b), S1(c), S1(d) and S1(e) Fig). These results demonstrate that even in the presence of a temporal perturbations as substantial as the effect of antibiotics treatment on the microbiome, PCA is unable to utilize the longitudinal sampling for picking up signals, which were attainable in the case of single timepoint study design. In contrast, application of TCAM to the data following log-folds baseline normalization (LFB, S1 Text), generated a coherent representation of the data (Fig 5c) where each point in the reduced space represents a complete temporal trajectory of a subject throughout the entire experiment. Additionally, the TCAM scores approximate the true distances between input trajectories (S1 Text) providing clear and accurate interpretation of the configuration in the resulting embedding and proportion of variation explained by each ordination axis. Multivariate hypothesis testing revealed significant differences between trajectories of the FMT group and those of PBX (Fig 5c; PERMANOVA, P < 0.05). In light of the normalization scheme used, these result renders that courses of change in gut microbiome composition of subjects supplemented with probiotics following antibiotics administration, significantly differed from those of individuals who underwent autologous FMT-in agreement with the findings of Suez et. al.
To further investigate the sources of variation between the groups, we considered the highest magnitude TCAM loadings associated with F 1 (Fig 5d), which accounts for � 11% of the variation in the data by itself, and exhibits significant differences between groups (ANOVA, P < 0.05). Inspection of the top 5% contributing features to the variation in F 1 , highlighted five probiotic species of large-positive magnitude, namely B.bifidum, L.aciduphilus, L.rhamnosus

PLOS COMPUTATIONAL BIOLOGY
and L.lactis, meaning that higher values of these species in the LFB normalized data are "pushing" F 1 scores upwards (or, to the right hand side of Fig 5c). Thus, we conclude that these probiotic species which were in fact consumed by the PBX group [14], may play a major role in separating the PBX group from the rest of the cohort, providing a strong demonstration that results obtained using TCAM have meaningful biological interpretation.
Routine exploration for features exhibiting differential trends between groups across time typically involves univariate time-series hypothesis testing such as repeated measures ANOVA and linear mixed effect model (lmer), followed by false discovery rate correction necessitated by the number of features being tested. Considering the typically large number of features in 'omics data, combined with the assumption that only a few of these features actually differ between the groups, this strategy may prove too stringent, discarding true signals due to an arbitrary large initial number of features. Indeed, applying lmer to each species in the initial dataset (482 features), resulted in 128 taxa showing different temporal trends between groups, reflected by the statistically significant estimation of the interaction term (S1 Text), with only 11 bacteria maintaining statistically significant values after correction for multiple testing, none of which is a probiotic species (S2(a) and S2(b) Fig). In contrast, when TCAM was used as a pruning strategy, by considering only top TCAM loadings (S1 Text) to reduce the number of initial features being tested, the final set of species presenting significantly different temporal trends between groups contained twenty-three new bacteria (two of these are probiotic species, To demonstrate the application of TCAM to the exploratory analysis of dense longitudinal time-series, we used the ECAM dataset [15], which contains stool microbiome collected in high frequency during the first two years of life for 43 infants. In this example, we put our focus on the well characterized differences between development course of microbiomes of infants that were vaginally delivered (V), and those of infants delivered by cesarean section (C). Observing the leading TCAM factors of the data, we notice significant differences between microbiome trajectories of the two modes, highlighted both by univariate test and a (linear kernel) support vector machine classifier (SVC) (S4(a) Fig; PERMANOVA; P < 0.01). To find the bacteria of highest contribution to the separation between the temporal trajectories of the two modes of birth, we computed the magnitude of the feature loadings when projected onto the normal direction to the boundary of decision (S4(b) Fig). Indeed, among the top 1% contributing features, we were able to identify bacteria such as Bacteroides, Bifidobacterium and Enterobacteriaceae that were previously found to exhibit differential behavior across time between the birth modes. Applying univariate tests following the same pruning strategy as in the post-antibiotics reconstitution example, resulted in a fine-grained, coherent view of the discriminatory features (S4(c) and S4(d) Fig).

Comparison with existing tensor-based methods
Next, we evaluated TCAM's performance in comparison to a state-of-the-art tensor factorization method: Context-aware Tensor Factorization (CTF) [6]. Unlike TCAM, CTF is designed specifically for 16S metagenomic sequencing data, and substantially utilizes the compositionality and sparsity of the data by finding a CP decomposition which best approximates the non-zero values. For this demonstration, we chose the RS4 interventional dataset, comparing the effect of four different types of RS4 fiber administration (tapioca, maize, corn and potato) on the microbiome composition measured using 16S sequencing of stool samples collected each week during a five weeks long trial [16]. The four arms of this experiment were defined by the source of fibers: tapioca and maize groups represent sources of fermentable fibers, while potato and corn groups mostly contain fibers that are inaccessible for microbiome degradation, thus they are considered control groups. In the original paper, the authors noticed significant differences in specific time-points in the tapioca and maize groups, but did not report any results of timeseries analysis.
CTF was applied to the count data while considering five components (S1 Text), resulting in significant differences noted between trajectories of participants in the maize group from those of individuals supplemented with corn and potato (PERMANOVA; P < 0.05, Fig 6a), with significantly different factor scores between the groups obtained for the fourth CTF component (accounting for 0.11% of the squared sum of singular values, ANOVA; P = 0.005). However, when considering the features of highest contribution to variability of the scores on the fourth factor for downstream-univariate time-series analysis, none of the selected features have shown a significantly different temporal pattern between groups. In contrast to CTF, in which the factorization is inseparable from the robust centered log-ratio (rclr) normalization scheme, TCAM makes very few assumptions regarding the data and its generative model, thus allowing higher degree of flexibility when normalizing the data. First, on top of (matrix) rclr normalization [17] (S1 Text), we further employed deviation from baseline transformation (DFB, S1 Text) before applying TCAM to the data, in order to have a view of the data that is centered at each individual's deviation from their personal baseline. Indeed, truncated TCAM resulted in significant differences between maize group to the remaining study groups, in addition to observed differences between corn and tapioca groups (PERMANOVA; S5(a) Fig).
Leading TCAM loadings associated with factors differentiating between groups were considered for time-series analysis (S1 Text), which highlighted the differences in the relative abundance levels of P.distasonis and Enterobacteriaceae between the groups (lmer; P < 0.05, Q < 0.1, S5(b) and S5(c) Fig). To further demonstrate TCAM's flexibility, we applied the factorization to the same dataset following LFB normalization (S1 Text). The alternative normalization uncovered significantly different temporal trends between the maize group to all of the remaining groups in the cohort (PERMANOVA; P < 0.05 , Fig 6b), and the pruning strategy revealed seven bacteria, including the above mentioned taxa, featuring a statistically significant trend throughout time (lmer; P < 0.05, Q < 0.05, Fig 6c and 6d S5(c) Fig). Moreover, using the top loadings associated with F 3 , we highlighted additional features demonstrating patterns of increasing bacteria in the form of Lachnospiraceae in the maize group and P. distasonis in the tapioca group (Fig 6e).

Universal applicability to 'omics data
To assess TCAM's applicability to 'omics data other than metagenomics, we use the proteomics data set from Sailani et. al. [18] concerning seasonal patterns of the human microbiome, transcriptome, metabolome and proteome. The original study cohort contains data collected from 105 individuals, where each participant donated about twelve samples during a three-year period (one sample every three months). Here, we set the focus on a subset of the data which contains proteomics samples of individuals featuring insulin sensitivity (IS) or insulin resistance (IR), and addressed the differences between proteomic trajectories of these two groups throughout time.
Using TCAM following DFB normalization of the data (see S1 Text), we detected a significant separation between IR and IS groups based on first factor's scores (t-test; P < 0.05), suggesting that a considerable portion of the data's variability is explained by differences between these groups (Fig 7a). Similar to our analysis framework above, we turned to the top loadings associated with the first factor (Fig 7b). Among the top ranked proteins, we could easily notice: angiotensinogen (AGT), which was previously associated with IS condition [19]; paraoxonase-1 (PON1), that has been found to down-regulate insulin resistance in mice [20]; apolipoprotein-3 (APOC3), highly associated with IR [21]; and increasing a 2 -HS-glycoprotein (AHSG), which is indeed tightly associated with IS [22]. The high level of consistency between these

PLOS COMPUTATIONAL BIOLOGY
TCAM-produced findings and the existing literature convinces that our method is generally applicable to 'omics data.

Application to supervised ML
In the context of standard supervised-ML, classification or regression models are trained using a labeled training set, and are expected to generally apply to (unlabeled) inputs. Note that TCAM is essentially an unsupervised dimensionality reduction method, unlike label aware methods such as Avocado [8], which takes into account phenotypic information for computation of the embedding, TCAM's is computed without any label information. Thanks to its natural out-ofsample extension (Eq 3), TCAM is amenable for seamless integration as a feature engineering step in supervised-ML workflows. To demonstrate this ability, we utilized a 16S rRNA metagenomics dataset from a study by Schirmer et al., which contains stool samples collected from pediatric UC patients monitored for 52 weeks under three different treatments. The original study characterized microbial dynamics along disease course, in light of host response to each of the applied treatments [23]. Here, we construct a supervised ML model which uses longitudinal metagenomics data to classify remission status labeled by flare (FLR) and remission (REM).

PLOS COMPUTATIONAL BIOLOGY
We constructed an ML pipeline which, consisted of filtration, baseline normalization, TCAM based feature engineering step, followed by Multi Layer Perceptron (MLP) classifier (S1 Text).
First, an inspection of the complete dataset using TCAM, as well as PCA for the inspection of each time-point separately, failed to reveal differences between microbiome trajectories of FLR and REM groups (PERMANOVA; P > 0.05, S6(a), S6(b) and S6(c) Fig). This lack of clear structures in the data with respect to the labels, suggests that the task of modelling remission status using temporal microbiome data is highly challenging. Yet remarkably, the TCAM based MLP classifier managed to achieve noticeable scores for the prediction (AUROC = 0.78, Fig  8a) when evaluated in five-fold cross-validation iterations (S1 Text). For the sake of comparison, equivalent classifiers, which take a single time-point as input (S1 Text), In addition, the straightforward out-of-sample extension of TCAM enabled downstream analysis of the classification model's decision making process. Using standard feature importance utilities, we were able to evaluate the contribution of features in the original space (OTUs) to the model's performance (Fig 8b). For example, we were able to pinpoint Anaerococcus and B.fragilis whose trajectories contribution to the decision making process were among the highest of all taxa, and additional OTUs annotated with F.prausnitzii, which was linked to the differences between the groups in the original paper.
To further validate the TCAM-based feature importance results, we re-applied TCAM to a reduced dataset containing features whose importance score is in the top 5%. Focusing on these top ranked features, we were able to identify clustering according to remission states (Fig  8c), confirming that TCAM-based feature engineering truthfully preserves existing structures in the data (S6(g) Fig).
Collectively, these demonstrations highlight TCAM's ability to uncover predictive aspects underlying longitudinal data, while enabling seamless integration as a drop-in feature-engineering step in supervised ML workflows. These capacities enable to extend the transformation to samples outside the training dataset, and maintain traceable and meaningful relationships with the original features space. While it may be that other pre-processing schemes would result in better performances, especially for the Rocket based classifier, which is considered state-of-the-art, this case highlights that TCAM based classifier requires minimal pre-processing of the data in order to perform reasonably well, thus putting TCAM as a powerful method for feature engineering.

Discussion
In this work, we present TCAM, a dimensionality reduction method for longitudinal 'omics data analysis, constructed on the premise of recent tensor-tensor algebra innovations. We demonstrate that TCAM outperforms traditional and state-of-the-art methods for longitudinal analysis dimensionality reduction, both in terms of signature detection and by pruning for meaningful features. In addition, we show that TCAM is applicable to diverse omics types, including amplicon and shotgun sequencing as well as proteomics. Furthermore, unlike other tensor factorization methods, TCAM entertains a natural out-of-sample extension formula, making it suitable for prediction tasks in complex experimental designs as a drop-in feature engineering utility within ML workflows. We demonstrate that we can preserve the feature importance contribution of the original features, even when TCAM is applied.
To our knowledge, TCAM is the first tensor component analysis framework that is guaranteed, within the specific choice of domain transformation, to maximize the variance of the latent representation while keeping the distortion minimal. While distinct choices of M would generally result in different TCAM embeddings (and transformations), the explicit rank-q truncation of each resulting TCAM makes the q dimensional transform maximizing the variance and minimizing distortion in the algebraic framework defined by each M. One possible way to define the 'best' M for a given dataset, is the M for which the implicit rank of the (un-truncated) decomposition of the data is minimized. We consider this choice as the 'best' option as it surely provides the compressed most possible representation of the data. Alternatively, considering that any TSVDM (thus, any TCAM) may be written as approximation in CP form [12,Section 6.C], we get that the implicit rank under ? M of the data equals, by definition, to its tensor-rank, making the task of finding the 'best' M equivalent to finding the tensor-rank of the data. Since tensor-rank computation constitutes a difficult problem in general [25], we conclude that our definition for the 'best' M is unhelpful as it is generally impossible to lay one's hands on. Yet, we have shown that when dealing with time-series data, taking M as the discrete cosine transformation, TCAM is amenable for traditional downstream applications often used in biological data analysis, such as multivariate hypothesis testing and ML workflows.
While TCAM proves to constitute a useful tool for the analysis of longitudinal experimental designs, it relies on a fully sampled cohorts, i.e., where all participants provide the same number of samples corresponding to similar time points. Even though missing data imputation is a classic use-case for low-rank approximations in general [6,9] and the recent progress made in the applications of TSVDM to incomplete data [26], the accuracy and reliability of reconstructed data generally depend on assumptions regarding the generative process of the data, the frequency of observed values or their distribution across subjects, features and timepoints.
Maintaining TCAM's universality to all kinds of 'omics data necessitates a detachment of the factorization from imputation and normalization of the data. Currently, prior to applying TCAM, it is up to the user to impute missing samples and normalize the data by any method that is appropriate to the data of interest. In this work, we have demonstrated TCAM's power in longitudinal 'omics data analysis while considering naïve and straight-forward schemes for normalization and imputation (S1 Text).
Looking forward, the mathematical properties of TCAM may enable to not only perform a trajectory analysis across time, but to also harness spatial patterns of data collected across different body sites. Future TCAM versions would enable the factorization of higher order tensors, allowing for better understanding of even more complex experimental designs, such as concomitant incorporation of space and time. A probabilistic formulation for TCAM, i.e., as a model for generating high dimensional time-series data that is subjected to some prior assumptions imposed by the type of data, may also be useful in order to handle missing data and determination of uncertainty in estimates and predictions. Moreover, the tensor-tensor M product framework [12], which is the theoretical foundation underlying TCAM, may be further utilized to produce additional factorization schemes, such as the decomposition of dissimilarity tensors for microbiome applications, non-negative factorizations intended for count data and more.
To conclude, the presented approach may address an important, previously unmet need for longitudinal 'omics data analysis by introducing a toolkit that enables trajectory analysis, which we make available to the wide community as a simple, one-stop-shop Python implementation (https://github.com/UriaMorP/mprod_package), that is compatible with the highly popular scikit-learn package. We believe that the application of TCAM would help derive deep insights from large-scale, longitudinal and multi-omics data, while facilitating personalized medicine-based data mining and interpretation, thereby leading to the development of tailored treatments and preventive strategies for human diseases.