Figures
Abstract
The rapid growth of high-dimensional biological data has necessitated advanced data fusion techniques to integrate and interpret complex multi-omics and longitudinal datasets. Shared and unshared structure across such datasets can be identified in an unsupervised manner with Advanced Coupled Matrix and Tensor Factorization (ACMTF), but this cannot be related to an outcome. Conversely, N-way Partial Least Squares (NPLS) is supervised and captures outcome-associated variation but cannot identify shared and unshared structure. To bridge the gap between data exploration and prediction, we introduce ACMTF-Regression (ACMTF-R), an extension of ACMTF that incorporates a regression step, allowing for the simultaneous decomposition of multi-way data while explicitly capturing variation associated with a dependent variable. We present a detailed mathematical formulation of ACMTF-R, including its optimisation algorithm and implementation. Through extensive simulations, we systematically evaluate its ability to recover a small -related component shared between multiple blocks, its robustness to noise, and the impact of the tuning parameter (
) which controls the balance between data exploration and outcome prediction. Our results demonstrate that ACMTF-R can robustly identify the
-related component, correctly identifying outcome-associated shared and distinct variation, distinguishing it from existing approaches such as NPLS and ACMTF. The development of ACMTF-R was motivated by a real-world dataset investigating how maternal pre-pregnancy BMI affects the human milk microbiome, human milk metabolome, and infant faecal microbiome. Emerging evidence suggests that inter-generational transfer of maternal obesity may affect multiple omics layers, highlighting the need to identify outcome-associated variation. The applicability of ACMTF-R is therefore validated by applying it to this multi-omics dataset. ACMTF-R successfully identifies novel mother-infant relationships associated with maternal pre-pregnancy BMI, underscoring its utility in multi-omics research. Our findings establish ACMTF-R as a versatile tool for multi-way data fusion, offering new insights into complex biological systems by integrating common, local, and distinct variation in the context of a dependent variable.
Citation: van der Ploeg GR, White FTG, Jakobsen RR, Westerhuis JA, Heintz-Buschart A, Smilde AK (2026) ACMTF-R: Supervised multi-omics data integration uncovering shared and distinct outcome-associated variation. PLoS One 21(1): e0339650. https://doi.org/10.1371/journal.pone.0339650
Editor: Sefki Kolozali, University of Essex Faculty of Science and Engineering, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: September 4, 2025; Accepted: December 9, 2025; Published: January 12, 2026
Copyright: © 2026 van der Ploeg et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The CMTFtoolbox R package is available on CRAN, with a development version on GitHub (https://doi.org/10.5281/zenodo.16633119). The underlying code for this paper is supplied in a GitHub repository at https://doi.org/10.5281/zenodo.16633167.
Funding: GRvdP was funded by a grant from the University of Amsterdam, Research Priority Area on Personal Microbiome Health. FW was funded by a grant from the University of Amsterdam, Data Science Centre.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The advent of high-throughput omics technologies has led to a rapid growth of complex, high-dimensional biological datasets, including genomics [1–3], transcriptomics [4], proteomics [5], metabolomics [4], and microbiome [6]. When measured on the same samples, these multi-omics datasets give researchers the opportunity to obtain a systems-level understanding of biological processes [5]. However, their integration and analysis pose significant challenges due to data heterogeneity [5,7], high dimensionality [4,6], and complex interactions between different omics datasets [8,9]. Advanced data analytical methods are required to extract meaningful biological insights by distinguishing variation that is shared across datasets from variation that is dataset-specific [10–13].
A critical need in the analysis of multi-omics datasets is the identification of common, local and distinct sources of variation (Fig 1; [8–10,14]). Common variation is shared across all datasets, local variation is shared between a subset of datasets, and distinct variation is unique to a dataset. Identifying the Common, Local and Distinct (CLD) structure of a multi-omics dataset elucidates the complex biological processes that occur within the system.
Common variation is shared between all datasets
and
. Local variation is shared between some of the datasets. Hence
is shared between
and
, but not
. Similarly,
is shared between
and
, but not
, and
is shared between
and
, but not
. Distinct variation is unique to one dataset:
for block
,
for block
, and
for block
. Together, they make up the Common, Local and Distinct (CLD) structure of the data.
Advanced Coupled Matrix and Tensor Factorization (ACMTF) has emerged as a powerful multi-way data integration approach capable of uncovering common, local, and distinct sources of variation across datasets [10,15–18]. However, ACMTF is unsupervised and therefore does not allow for the identification of shared and distinct variation related to a dependent variable. N-way Partial Least Squares (NPLS) [19] is a supervised method that identifies variation related to a dependent variable, but is limited to analysing one dataset at a time and thus cannot simultaneously identify common, local, and distinct variation across multiple datasets.
This work introduces Advanced Coupled Matrix and Tensor Factorization Regression (ACMTF-R), building on four strands of prior work: (i) the separation of joint and individual structures in matrices [14,20] and tensors [10], (ii) relating latent variables to an outcome [19,21], (iii) performing a coupled decomposition for any combination of matrices and tensors [15,22,23], and (iv) a tuning parameter () that balances data reconstruction and prediction [24–26]. This approach enables unified extraction of common, local, and distinct variation across multiple data blocks while modelling an outcome-associated structure.
In this study, we first provide a detailed overview of the ACMTF framework, and the methodological development represented by ACMTF-R. We then present a simulation-based approach to systematically evaluate the ability of ACMTF-R to capture a small, hidden, -related component under different noise conditions and tuning parameter values. Finally, we apply ACMTF-R to a real-world dataset, integrating human milk (HM) microbiome, HM metabolome, and infant gut microbiome data to study how maternal pre-pregnancy body mass index (ppBMI) is associated with these data. Our findings highlight the potential of ACMTF-R as a versatile tool for multi-omics data integration, facilitating the discovery of biologically meaningful common, local and distinct variation, while accommodating both exploratory and predictive research objectives.
Methods
Notation and definitions
We briefly define the mathematical notation that will be used throughout this paper. The notation proposed by Kiers [27] is followed with some minor extensions for multi-omics. Scalars are denoted by lower-case letters (e.g., ). Column vectors are denoted by boldface lowercase letters (e.g.,
). Matrices are denoted by boldface capital letters (e.g.,
). Three-way arrays are denoted by underlined boldface capital letters (e.g.,
,
,
). While
can more generally be used to indicate four-way or higher-order arrays, this paper does not discuss such data.
The characters , and
are reserved to indicate the first, second and third mode of an array. A two-way array
will be assumed to be of size
, while a three-way array
will be assumed to be of size
. Similarly, lowercase characters
,
, and
will be used as indices for the first, second, and third modes, respectively. Columns of a matrix are vectors with a subscript of their index (e.g.,
for some matrix
). Similarly, the i-th row and j-th column entry is denoted as a scalar with subscripts of their indices (e.g.,
). For three-way arrays, the entries per mode are given as subscripts (e.g.,
).
The index is used to indicate the block number, which is denoted by a superscript (e.g.,
) and the size is indicated likewise (e.g., the size of
is
). In ACMTF and ACMTF-R
can be any mixture of two-way or three-way arrays, but this study focuses only on the case where all blocks are three-way arrays with a shared subject mode. Hence the superscript (
) is not used to describe the size of the first mode
.
Mathematically it is convenient to matricise a three-way array , as this transformation facilitates the description of multi-way models using matrix notation [27–29]. There are three different variants of matricisation, depending on which mode of the three-way array is preserved as rows in the resulting matrix. Hence, first mode matricisation yields a matrix
of size
. Similarly, second mode matricization produces a matrix
of size
, while third mode matricization generates a matrix
of size
.
The vectorisation of an matrix
is denoted
, such that
. The Hadamard product [30] of two equally sized arrays
and
is denoted
, such that
. The Kronecker product, denoted
, of two matrices
with size
and
with size
produces a block matrix containing all pairwise products of their entries following
. Thus, we have:
where the outcome of is size
The Khatri-Rao product, denoted , of two matrices
of size
and
of size
is defined as the column-wise Kronecker product [31,32]
where the outcome of has size
.
The Frobenius (or Euclidean) norm of a three-way array is defined as:
and the notation will also be used to refer to the Frobenius norm for matrices and vectors, respectively [28]. The Moore-Penrose inverse [33] of
is denoted
In this paper we follow the canonical polyadic form for defining multi-way arrays as the sum of rank-one arrays with normalized factor matrices and a scalar to capture the magnitude per term [15,34–36]. The notation
is used to describe the model of the three-way array
, defined elementwise as
where columns of the loading matrices are norm 1 and the column vector
is used to modify the size of each identified component
to norm
. For a two-way array, equation (3) reduces to
, assuming that
with size
and that the singular values are multiplied into
to stay in line with common practice.
Advanced Coupled Matrix and Tensor Factorization (ACMTF)
Advanced Coupled Matrix and Tensor Factorization is an all-at-once joint factorisation approach that is applicable to mixtures of matrices and tensors [10,15,16,37]. The method seeks a decomposition of the datasets as
such that loading matrices corresponding to shared modes are equal between the blocks. In this paper, we will assume that only the first mode is shared between all blocks. In that case, the matrix containing the first mode loadings
will be the same for every block
and the superscript can be left out.
The loss function for ACMTF is defined as
where the matrix
scales the factors per component and per data block,
is the penalty setting for norm-1 components,
is the sparsity penalty setting for
,
is the number of components, and
is the block index. Here the 1-norm loss term for the elements in
is replaced with a differentiable approximation
for a sufficiently small
[37,38]. The originally suggested default settings are
,
and
[10]. The gradient of the loss function is reported in Supplementary Methods in S1 File.
The ACMTF framework of putting the norm of a component into and constraining the loading vectors to become norm one, allows the
matrix to encode the Common, Local, and Distinct (CLD) structure of the data (Fig 1). For example, the CLD-structure of a three-block case
can be encoded in
through,
where corresponds to data block
contributing to component
, 0 corresponds to the data block not contributing to a component,
indicates a common component shared between all blocks,
,
and
indicate local components shared between some blocks, and
,
and
indicate distinct components unique to one block. Sparsity is then achieved by applying an
penalty on the
-matrix. While this example uses 7 components to showcase the entire CLD-structure, multiple components may be needed to fully describe a specific type of variation in real data (e.g., two components for
) and some types of variation may be absent altogether.
Advanced Coupled Matrix and Tensor Factorization Regression (ACMTF-R)
We present Advanced Coupled Matrix and Tensor Factorization Regression (ACMTF-R) to describe common, local, and distinct variation of interest in the data blocks in the context of a dependent variable by adding a regression term and a tuning parameter
to the loss function. The introduction of a tuning parameter has been used successfully in other approaches such as Principal Covariates Regression (PCovR) [24,26] and Multiway Covariates Regression (MCovR) [25] to steer the solution towards explaining the data blocks or towards predicting
. This gives the user more information about how the addition of
affects the model of the data. We only define ACMTF-R in the case where the first mode is shared across all data blocks and
. Hence
is a column vector of size
. The loss function of ACMTF-R is then defined as
where is a tuning parameter used to focus the model on explaining the data blocks
,
, …,
versus predicting
, and
is the
vector of regression coefficients of the factor matrix of the first mode
onto
. Here
is given when
is defined, through
The gradient of the loss function is reported in Supplementary Methods in S1 File.
Prediction of
using a new sample
A fitted ACMTF-R model can be used to predict for a new sample using a similar procedure as Latent Root Regression [39,40]. First, this requires a brief description of obtaining the subject mode loadings for a new sample in CANDECOMP/PARAFAC (CP) [34,36,41,42]. Subsequently, this result is extended to ACMTF-R, after which the prediction step is described.
In the CP case, finding the subject mode loadings with size
for a sample
with size
requires solving the problem:
where and
are the feature and time mode loadings of a previously fitted model [36]. This problem statement comes down to the least squares solution:
where the block matrix with size
is used for convenience and where
is the Moore-Penrose inverse of
.
In the case of ACMTF-R, the shared subject mode loadings across the blocks need to be identified. This is done by finding the block matrix with size
through
which can be concatenated across all blocks into one matrix for all data blocks simultaneously as
where is of size
. The shared subject mode loadings
with size
for a new sample
can then be found by reusing Equation 10.
Finally, the prediction of is done using the ACMTF-R regression coefficients
through
When multiple new samples are obtained, this procedure is performed per sample. The procedure is performed by the npred() function in the supplied R package.
Implementation and stopping criteria
The decomposition of ACMTF is achieved through an all-at-once optimization algorithm, originally developed in [15] for MATLAB, and now available in the CMTFtoolbox package for R on CRAN (development version on GitHub and https://doi.org/10.5281/zenodo.16633119). In ACMTF-R, this algorithm is modified in two ways: (1) the loss and gradient of the all-at-once optimization step is modified to include the tuning between data reconstruction and prediction of (Supplementary Methods in S1 File), and (2) the regression coefficients (
) are found using a closed-form solution in a separate step. The all-at-once optimisation is achieved by defining the loss and gradient function and subsequently using the nonlinear conjugate gradient (NCG) method with Hestenes-Stiefel updates and the Moré-Thuente line search as originally suggested [15,18]. For the provided R package, this was implemented using the mize package (v0.2.4, [43]). The provided package also supports the Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS) to speed up the line search as an experimental feature, but this setting is not used for any results in this paper [44–46].
ACMTF-R uses the same default values as ACMTF for the minimum value of the loss function (default ) and the minimum change in loss function between two function evaluations (default
) as stopping criteria [37]. The all-at-once optimization implementation through the mize package also defines three other stopping criteria: the maximum number of iterations to calculate (default 10 000), the maximum number of function evaluations that are allowed (default 10 000), the minimum l2-norm of the gradient vector (default
), and the absolute value of the size of the parameter update (default
). Fitting an ACMTF-R model for any of the datasets in this paper takes less than a minute of computational time on an average computer (Microsoft Windows 11 Home v10.0.22631 with an 12th Gen Intel® Core™ i5-12400F running 2500 MHz with 6 cores or 12 logical processors and 16 Gb RAM) and is comparable to the computational cost of ACMTF due to the closed-form regression step. Multi-core parallelisation has been implemented as part of the provided R package to allow many randomly initialised models to be fitted simultaneously. This is needed to efficiently find the appropriate number of components through cross-validation, as well as to find the global minimum for a given number of components.
Simulation approach
We demonstrate the effect of the ACMTF-R model parameters using various simulations (Fig 2; code supplied at https://doi.org/10.5281/zenodo.16633167). We follow the guidelines provided in previous work [10]. We generate factor matrices ,
and
with random entries drawn from the standard normal distribution
. The columns of the factor matrices are normalized to unit norm. We set
,
, and
to resemble a typical pre-processed dataset (for example, a mixture of microbiome and metabolomics data). The factor matrices are used to create three third-order tensors
through
and
All simulations contain a CLD-structure of 2 global components shared between blocks , 3 local components, and a distinct component unique to each block. This yields 8 components total for ACMTF-R to identify. All components are norm 1, except for the second local component whose norm
varies per simulation.
is equal to the subject mode loadings of the second local component. Noise and terms corresponding to components that have no contribution to a block have been removed for visual clarity. Subject mode loadings (black lines) are equal between the data blocks per component but are different between components.
where we encode the required CLD-structure into consisting of eight components total: two common components shared across all blocks, three local components, and a distinct component for each block. This corresponds to the
-matrix:
where all components are norm 1, except for the second local component whose size varies per simulation.
Next, randomly distributed noise is added to the third-order tensors through
where indicates the noise level, which is the same across all blocks.
We generate equal to the subject mode loadings of the second local component:
and some noise is added through
where indicates the noise level on
. The outcome
is then normalized to norm 1.
Simulation 1: ACMTF-R can detect a small, hidden component by leveraging 
In the first simulation, the ability of ACMTF-R to detect a small, hidden component by leveraging was examined. This was done using a three-block simulation with
, 40% noise on
(
), and 5% noise on
(
). The data were then block scaled to Frobenius norm 1. One hundred randomly initialised eight-component ACMTF models were compared with one hundred randomly initialised eight-component ACMTF-R models using different values of
, keeping all other parameter and convergence settings at their default values. Subsequently, factor recovery was assessed using Factor Match Score and Tucker Congruence Coefficient. The line search algorithm in most ACMTF-R models was terminated due to the absolute change for the size of the parameter update stopping criterion, while ACMTF was terminated only due to the minimum value of the loss function being reached (Supplementary Figure 1 in S1 File).
Simulation 2: The benefit of supervision using
versus the size of the hidden component
In the second simulation, the improvement of ACMTF-R compared to ACMTF was assessed by changing the size of the small, hidden component. This was done using:
The noise on was set at 40% (
), and the noise on
was 5% (
). The data were then block scaled to Frobenius norm 1. One hundred randomly initialised eight-component ACMTF models were compared with one hundred randomly initialised eight-component ACMTF-R models using different values of
for each case, keeping all other parameter and convergence settings at their default values. Subsequently, factor recovery was assessed using Factor Match Score and Tucker Congruence Coefficient. The line search algorithm in most ACMTF-R models was terminated due to the absolute change for the size of the parameter update stopping criterion, while most ACMTF models were terminated only due to the minimum value of the loss function being reached (Supplementary Figure 2 in S1 File).
Simulation 3: The benefit of supervision using
versus noise on X
In the third simulation, the improvement of ACMTF-R compared to ACMTF was assessed by changing the amount of noise on . This was done using
and a variable amount of noise on
,
and the noise on was 5% (
). The data were then block scaled to Frobenius norm 1. One hundred randomly initialised eight-component ACMTF models were compared with one hundred randomly initialised eight-component ACMTF-R models using different values of
for each case, keeping all other parameter and convergence settings at their default values. Subsequently, factor recovery was assessed using Factor Match Score and Tucker Congruence Coefficient. For most noise levels (
), the line search algorithm in most ACMTF-R models was terminated due to the absolute change for the size of the parameter update stopping criterion, while most ACMTF models were terminated only due to the minimum value of the loss function being reached (Supplementary Figure 3 in S1 File). At higher noise levels (
), most ACMTF models were terminated due to the relative change in the loss function being reached.
Simulation 4: The benefit of supervision using
versus noise on
In the fourth and final simulation, the improvement of ACMTF-R compared to ACMTF was assessed by changing the amount of noise on . This is done using the three-block simulation with
, with 40% noise on
(
), and a variable amount of noise on
:
The data were then block scaled to Frobenius norm 1. One hundred randomly initialised eight-component ACMTF models were compared with one hundred randomly initialised eight-component ACMTF-R models using different values of for each case, keeping all other parameter and convergence settings at their default values. Subsequently, factor recovery was assessed using Factor Match Score and Tucker Congruence Coefficient. The line search algorithm in most ACMTF-R models was terminated due to the absolute change for the size of the parameter update stopping criterion, while ACMTF was terminated only due to the minimum value of the loss function being reached (Supplementary Figure 4 in S1 File).
Simulation evaluation – Tucker Congruence Coefficient
In this paper, the recovered factors are compared against the input factors of the simulation using the absolute Tucker Congruence Coefficient (TCC) [47,48] through
where and
are the recovered and true factors, respectively. The TCC can be interpreted as the cosine of the angle between
and
. A TCC value in the range of 0.85–0.94 suggests that the recovered and input factors are fairly similar, whereas a value of 0.95–1 means that they are essentially equal [49]. Although TCC is insensitive to scalar multiplication of the vectors (that is,
for any non-zero scalars
and
), it is sensitive to sign flips between
and
[49]. For this reason, we use the absolute value instead.
Simulation evaluation – Factor Match Score
The Factor Match Score (FMS) is an adaptation of the Tucker Congruence Coefficient and indicates the similarity of two multi-way models across all components. For model evaluation, FMS was used to compare the decomposition against the input factors. This depends on the CLD-type of the relevant factor. For the hidden local component in the simulations,
is defined as
where ,
, and
contain the true subject, feature and time loadings of component
, and
,
, and
are the best matching subject, feature and time loadings in the fitted model. The FMS of the other components are defined equivalently. In the simulations, the best matching component is found using the Hungarian algorithm [50,51] for all pairwise combinations of components such that the FMS is maximised and the expected CLD structure is recovered (Supplementary Methods in S1 File). The FMS is a value in the range [0,1], where
corresponds to the model correctly recovering the input loadings.
The recovery of all input loadings can be quantified through , defined as
where corresponds to all input components being recovered correctly. Recovery of the CLD structure is assessed using the Lambda Similarity Index, which is defined and reported in the Supplementary Materials in S1 File.
Example dataset: Jakobsen2025
We include an application study of ACMTF-R on longitudinally measured human milk (HM) microbiome, HM metabolomics and infant faecal microbiome data. Full details on the preprocessing of the data is outlined in the original paper [52] and preprocessing of longitudinal compositional data is detailed in [53].
Briefly, the zOTU data and taxonomic information of the infant faecal microbiome and human milk microbiome were processed in R (v4.4.1, [54]). For the infant faecal microbiome data, features were kept if they had ≤ 75% sparsity across the dataset. This step resulted in 93 out of 565 features being selected. For the human milk microbiome data, features were kept if they had ≤ 85% sparsity across the dataset. This step resulted in 115 out of 707 features being selected. We performed a centred log-ratio transformation with a pseudo-count of 1 to correct for compositionality [55,56]. Subsequently the data was converted to a three-way tensor, keeping missing samples as a row of NAs. This resulted in three-way arrays of size 160 subjects 93 microbial taxa
3 time points and 169 subjects
115 microbial taxa
4 time points containing microbial abundances for the infant faecal and human milk microbiomes, respectively.
The human milk (HM) metabolomics data were processed using R (v4.4.1, [54]). Values below the detection limit were imputed with a random value between 0 and the detection limit per metabolite to preserve their distribution. Next, the dataset was (natural) log transformed to stabilise the variance. The dataset was then converted to a three-way tensor of size 165 subjects 70 metabolites
4 time points.
Since ACMTF-R assumes that the subject mode is shared between all data blocks, we took the intersection of the subject identifiers across all three datasets and removed the other subjects. This homogenised the subject mode size across all data blocks to 158 subjects. The data were then centred across the subject mode such that variation per time point is only due to between-subject variation, and scaled within the feature mode to make all features equally important for the modelling procedure [57,58]. As a result, the models will focus on inter-subject differences and the features involved, while the time mode loadings indicate when these differences are largest.
After centring and scaling, the datasets were block scaled to Frobenius norm 1 to equalise the variance across the blocks. The response variable of size
contained the maternal pre-pregnancy BMI for all mother-infant dyads, which was centred and subsequently scaled to norm 1 prior to modelling. ACMTF-R models were then created using the appropriate number of components and regressed on maternal pre-pregnancy BMI.
Selecting the appropriate number of ACMTF components
The optimal number of components does not exist in real multi-omics applications, since this depends on data idiosyncrasies as well as biological interpretability of the model. Therefore, this paper presents a data analyst guided triangulation approach based on model stability, reproducibility, fit, degeneracy, and biological interpretability to identify the appropriate number of components for ACMTF. This is done through a K-fold cross-validation scheme, as well as a random initialization scheme. The full procedure is available as the ACMTF_modelSelection() function in the supplied R package.
In the K-fold cross-validation scheme, one fold of the subjects is withheld for all blocks as test set data, while randomly initialised ACMTF models are fitted on the training data using components, keeping all other model parameters (
,
,
) fixed. To avoid data leakage, the means, standard deviations, and norms that are removed from the training data are also removed from the test set data. Model stability is then reported by pairwise comparing the models with the lowest loss per fold and calculating the
, defined as
where , and
contain the feature and time loadings for component
in one model,
, and
are the best matching subject, feature and time loadings for component
in a second model, and the subject mode term is eliminated from the calculated due to not being comparable between CV folds [15,59]. In the application, the matching of components is achieved using the Hungarian algorithm [50,51] for all pairwise combinations of components such that the FMS is maximised (Supplementary Methods in S1 File). Hence, the distribution of FMS values across all pairwise comparisons of CV (or jack-knifed) folds gives an indication of the stability of the decomposition for a given number of components
. Previous work has established that 95% of folds should have
for three-way data blocks and
for two-way data blocks in the ACMTF case for the decomposition to be considered replicable [59]. The appropriate number of components is expected to have a stable
near one across all comparisons.
In the second arm of the approach, randomly initialised ACMTF models are fitted to the full data using components. Model sensitivity to the starting values and to local minima is then assessed by pairwise comparing the models and calculating the
through
where ,
, and
contain the subject, feature and time loadings for component
in one model, and
,
, and
are the best matching subject, feature and time loadings for component
in a second model [15,59]. For the application, the matching of components is again achieved using the Hungarian algorithm [50,51] for all pairwise combinations of components such that the FMS is maximised (Supplementary Methods in S1 File). The distribution of FMS values across all pairwise comparisons of randomly initialised models for a given number of components
gives an indication of the stability of the decomposition. The appropriate number of components is expected to have a stable
near one across all randomly initialised models.
Additionally, it may be possible for ACMTF and ACMTF-R to split up common or local components into distinct components with highly similar subject mode loadings when too many components are chosen. This phenomenon can be detected by pairwise comparing all columns in the subject mode per model and calculating the maximum absolute Tucker Congruence Coefficient [47–49]. We define this as the Degeneracy Score, which is computed through
where and
indicate component numbers of the subject mode within the same model, and
. The Degeneracy Score is in [0, 1], will be close to one when a factor is split into two highly collinear factors, and will have intermediate values otherwise. This procedure is only performed for the randomly initialised ACMTF models since the complete subject mode is needed. The appropriate number of components is expected to have a low Degeneracy Score across all randomly initialised models.
Selecting the appropriate number of ACMTF-R components
For ACMTF-R, the suggested triangulation approach incorporates a few additional metrics related to the prediction and fit of for selecting the correct number of components. The full procedure is available as the ACMTFR_modelSelection() function in the supplied R package.
In the cross-validation procedure, the root mean squared error of cross-validation (RMSECV) is calculated by using the models fitted on the training set to predict in the test set. The appropriate number of components is expected to minimise the RMSECV while having a stable
near one across all comparisons.
In the random initialisation procedure, the root mean squared error (RMSE) of and the variance explained in
are reported using the ACMTF-R model with the lowest loss. The appropriate number of components is expected to minimise the RMSE while having a low Degeneracy Score and a stable
near one across all comparisons.
Results
Simulation 1: ACMTF-R can detect a small, hidden component by leveraging 
We evaluated how the tuning parameter influences the recovery of a local component that is weakly present in some data blocks and strongly reflected in
. This was done by applying ACMTF-R to a simulated three-tensor dataset with a known structure containing 2 common, 3 local, and 3 distinct components and 40% noise. All components were norm 1, except for the second local component (
), which was present across
and
with norm
. Since this component was equal to
with 5% noise, we expected that lowering
would enhance its detection. By systematically varying the tuning parameter
from 1 (fully focused on explaining the data blocks) to 0 (fully focused on predicting
), we examined whether outcome information could reveal a latent structure that would otherwise remain hidden (Fig 3). One hundred randomly initialised eight-component ACMTF-R and ACMTF models were created for each setting of
.
One hundred randomly initialised ACMTF-R models were fitted per setting of tuning parameter . These were compared with one hundred randomly initialised ACMTF models. (A) Overall factor recovery, measured using
, was higher in ACMTF-R using
compared to ACMTF. (B) Likewise, the
maximum was higher for ACMTF-R than ACMTF. Red data points in (A,B) indicate the model with the lowest loss, which were compared in (C-H). The absolute Tucker Congruence Coefficient (
) between the recovered loadings and the input (C) subject mode loadings, (D) feature mode loadings of
, (E) time mode loadings of
, (F) feature mode loadings of
, and (G) time mode loadings of
for hidden component
. For all modes, factor recovery using ACMTF-R with intermediate values of
was higher compared to ACMTF.
As assessed by the Factor Match Score, the lowest-loss ACMTF-R models using showed an improvement in factor recovery for all components compared to the lowest-loss ACMTF model (Fig 3A, Supplementary Figure 5 in S1 File). This was also the case for factor recovery of the hidden component
(Fig 3B, Supplementary Figure 6 in S1 File). Meanwhile, the fraction of models that correctly identified the input CLD structure was comparable to ACMTF (Supplementary Figure 7-8 in S1 File).
Next, the lowest-loss ACMTF-R models for each setting were selected for comparison with the lowest-loss ACMTF model (Fig 3C-3H). The absolute Tucker Congruence Coefficient (
) was used to compare the best matching recovered factor loadings with the input loadings of
per mode (Supplementary Methods in S1 File). This analysis revealed that the recovered shared subject mode (Fig 3C) and time mode loadings of
(Fig 3E) and
(Fig 3G) were almost identical to the input loadings in the ACMTF-R models for
, while ACMTF failed to identify them. While the recovered feature mode loadings in
(Fig 3D) and
(Fig 3F) were not exactly equal to the input factor of the hidden component, their order largely matched the input (Kendall’s
for
; Supplementary Table 1 in S1 File).
Taken together, ACMTF-R recovered the hidden component across intermediate values of
by leveraging
, while ACMTF did not.
Simulation 2: The benefit of supervision using
depends on the size of the hidden component
We next evaluated how the size of the hidden component affected the improvement of ACMTF-R compared to ACMTF. This was done using the same three-tensor setup as in the first simulation, except that the size of the second local component (
) was varied between
to
. We expected the benefit of supervision to decrease as the size of the hidden component
increased. As before, the tuning parameter
was varied from 0.1 to 0.9 and one hundred randomly initialised eight-component ACMTF-R and ACMTF models were created for each setting (Fig 4).
One hundred randomly initialised eight-component ACMTF-R and ACMTF models were fitted for each combination of hidden component size () and tuning (
), after which the lowest-loss models were compared. The improvement in factor recovery is computed by subtracting ACMTF performance from ACMTF-R performance in all panels. Improvement per setting for (A) overall factor recovery, (B) factor recovery of the hidden component, (C) shared subject mode recovery of the hidden component, (D) feature mode recovery of
for the hidden component, (E) time mode recovery of
for the hidden component, (F) feature mode recovery of
for the hidden component, (G) time mode recovery of
for the hidden component. Improvement compared to ACMTF is shown in green, while deterioration is shown in purple. A version of this figure without subtraction of ACMTF performance is shown in Supplementary Figure 9 in S1 File.
Input factor recovery across all components improved in ACMTF-R compared to ACMTF at when the hidden component was small (
), while it deteriorated at lower values of
regardless of hidden component size due to overfitting (Fig 4A). Meanwhile, factor recovery of the hidden component
improved in ACMTF-R at
only at intermediate component sizes (
) (Fig 4B). At
, ACMTF and ACMTF-R were both able to identify
, while at
neither method could recover it (Supplementary Figure 9 in S1 File).
The recovered shared subject mode loadings of the hidden component showed a consistent recovery at small component sizes (0.01–0.05), regardless of (Fig 4C). Meanwhile, the recovery of the feature and time mode loadings improved at similar component sizes only for intermediate values of
(0.4–0.9) (Fig 4D-4G). Taken together, the simulation revealed that a hidden component that is roughly 1/20th the size of the other components can be successfully recovered by supervising with ACMTF-R using intermediate values of
.
Simulation 3: The benefit of supervision using
is consistent for biologically realistic noise levels
We evaluated how the amount of noise on the independent data blocks ,
, and
affected the improvement in recovering the hidden component
using ACMTF-R compared to ACMTF. This was done using the same three-tensor setup as in the first simulation, except that the amount of noise on the independent data blocks was varied from
to
. We expected the benefit of supervision to increase as the amount of noise increased. As before, the tuning parameter
was varied from 0.1 to 0.9 and one hundred randomly initialised eight-component ACMTF-R and ACMTF models were created for each setting (Fig 5).
One hundred randomly initialised eight-component ACMTF-R and ACMTF models were fitted for each combination of noise on X () and tuning parameter (
), after which the lowest-loss models were compared. The improvement in factor recovery is computed by subtracting ACMTF performance from ACMTF-R performance in all panels. Improvement per setting for (A) overall factor recovery, (B) factor recovery of the hidden component, (C) shared subject mode recovery of the hidden component, (D) feature mode recovery of
for the hidden component, (E) time mode recovery of
for the hidden component, (F) feature mode recovery of
for the hidden component, (G) time mode recovery of
for the hidden component. Improvement compared to ACMTF is shown in green, while deterioration is shown in purple. A version of this figure without subtraction of ACMTF performance is shown in Supplementary Figure 10 in S1 File.
Input factor recovery across all components improved in ACMTF-R compared to ACMTF at most values (0.3–0.9) at 0−50% noise (Fig 5A). However, at 10−30% noise, the improvement was low due to both ACMTF and ACMTF-R recovering the hidden component (Supplementary Figure 10 in S1 File). Likewise, factor recovery of the hidden component
improved only at intermediate noise levels (30−50%) and at intermediate
values (0.6–0.9) (Fig 5B). When no noise was present, the improvement in factor recovery of the hidden component was zero because both methods were unable to identify it (Supplementary Figure 10 in S1 File).
The recovered shared subject mode loadings of the hidden component showed a consistent recovery at all noise levels and values of (Fig 5C). The recovery of the feature mode loadings in block
improved only at intermediate noise levels (20−50%) and intermediate
values (0.6–0.9), while the recovery of the feature mode loadings in block
was largely independent of the noise level (Fig 5D, 5F). Surprisingly, the recovery of the time mode loadings in block
deteriorated at high noise levels (50 + %), while the recovery of the time mode loadings in block
improved at those levels (Fig 5E, 5G). Taken together, this simulation revealed that factor recovery by ACMTF-R improves at biologically realistic noise levels (>30%).
Simulation 4: The benefit of supervision using
is largely independent of the amount of noise on
We evaluated how the amount of noise on affects the improvement in recovering the hidden component
using ACMTF-R compared to ACMTF. This was done using the same three-tensor setup as in the first simulation, except that the amount of noise on
was varied from
to
. We expected the benefit of supervision to decrease as the amount of noise on
increased. As before, the tuning parameter
was varied from 0.1 to 0.9 and one hundred randomly initialised eight-component ACMTF-R and ACMTF models were created for each setting (Fig 6).
One hundred randomly initialised eight-component ACMTF-R and ACMTF models were fitted for each combination noise on (
) and tuning parameter (
), after which the lowest-loss models were compared. The improvement in factor recovery is computed by subtracting ACMTF performance from ACMTF-R performance in all panels. Improvement per setting for (A) overall factor recovery, (B) factor recovery of the hidden component, (C) shared subject mode recovery of the hidden component, (D) feature mode recovery of
for the hidden component, (E) time mode recovery of
for the hidden component, (F) feature mode recovery of
for the hidden component, (G) time mode recovery of
for the hidden component. Improvement compared to ACMTF is shown in green, while deterioration is shown in purple. A version of this figure without subtraction of ACMTF performance is shown in Supplementary Figure 11 in S1 File.
Input factor recovery across all components improved in ACMTF-R compared to ACMTF at intermediate values (0.4–0.9) for most noise levels (0–100%) (Fig 6A). Likewise, factor recovery of the hidden component
improved at intermediate
values (0.5–0.9) for most noise levels (0–100%) (Fig 6B). At lower
values, factor recovery in ACMTF-R deteriorated due to the strong focus on predicting
(Supplementary Figure 11 in S1 File). At very high noise levels (100 + %), neither ACMTF nor ACMTF-R were able to recover the hidden component.
The recovered shared subject mode loadings of the hidden component showed a consistent recovery at most noise levels (0−100%) and intermediate values of (0.4–0.9) (Fig 6C). Similarly, recovery of the feature and time mode loadings was consistent at most noise levels (0−100%) and intermediate values of
(0.4–0.9) (Fig 6D-6G). Taken together, the simulation revealed that recovery of the hidden component by ACMTF-R is insensitive to biologically realistic noise levels on
.
Application: ACMTF-R enhances multi-omics integration compared to ACMTF and NPLS
To demonstrate the advantage of ACMTF-R for integrating multi-omics data, we analysed a longitudinal dataset comprising human milk (HM) metabolomics, HM microbiome and infant faecal microbiome data, using pre-pregnancy BMI (ppBMI) as the outcome (). To contextualise performance, we evaluated the results against ACMTF, which focuses only on the CLD structure, and N-way Partial Least Squares (NPLS) models per block [52], which focus only on prediction. We expected ACMTF-R to identify more strongly maternal ppBMI associated components compared to ACMTF.
A combined random initialisation and cross-validation procedure was used to select the number of components for ACMTF and ACMTF-R (Methods). The following tuning parameter values were investigated: ,
,
, and
. For each
value, 10 randomly initialised models were fitted and a 10-fold cross-validation was performed across models with one to five components (Supplementary Figure 12 in S1 File). This analysis revealed that one component was optimal for all ACMTF-R settings. For ACMTF, a similar procedure indicated that two components were appropriate, each explaining a different source of variation (Supplementary Figure 13 in S1 File). For NPLS, cross-validation indicated that one component minimised RMSECV for the infant faecal and HM microbiome data, while two components were required for the HM metabolomics data (Supplementary Figure 14 in S1 File).
After selecting the number of components, 100 randomly initialised one-component ACMTF-R models were fitted for each value. The model with the lowest overall loss in each case was selected, and the variance explained was computed (Table 1). To contextualise performance, the ACMTF model explained 14.34%, 9.24%, and 7.03% of the variation in the infant faecal microbiome, HM microbiome, and HM metabolomics data blocks, respectively, while explaining 9.28% of the variation in
. Additionally, the NPLS models explained 5.78%, 5.41%, and 8.25% of the variation in the infant faecal microbiome, HM microbiome, and HM metabolomics datasets, respectively, while explaining 11.97%, 11.22%, and 35.95% of the variation in
.
The ACMTF-R model with explained 2.08–8.31% of the variation in the independent data blocks, and 24.08% of the variation in
. At
, this dropped to between 1.78–5.38% of the variation in the independent data blocks, and 62.06% in
. At
and
, the models were most heavily tuned towards predicting
, yielding 1–2% of explained variation in the independent data blocks and capturing 90 + % of the variation in
– a clear indication of overfitting.
MLR was performed on the subject mode loadings to test for associations with subject metadata, including maternal pre-pregnancy BMI, infant growth at six months, birth mode, maternal secretor (), and maternal Lewis (
) blood group status (Table 2). Maternal
and
status determine the presence of specific fucosylated human milk oligosaccharides, which may influence the composition of the infant faecal microbiome.
The subject mode loadings of the NPLS models all contained at least one component that was associated with maternal pre-pregnancy BMI. While both components of the ACMTF model were associated with both ppBMI and maternal secretor status, the first component was most strongly associated with maternal secretor status and the second was most strongly associated with ppBMI. This was not the case for the ACMTF-R models with and
, which both contained subject mode loadings associated exclusively with ppBMI after correction. The subject mode loadings of the ACMTF-R models with lower values of
were strongly associated with ppBMI due to overfitting.
Due to the extraction of a component exclusively related to maternal ppBMI, as well as the trade-off between explaining the independent data and predicting , the ACMTF-R model using
was selected for further interpretation.
The -matrix of the ACMTF-R model using
was examined to interpret the common, local, and distinct (CLD) structure of the component. In this model,
values of the only available component were equal to 0.34, 0.29, and 0.15, for the infant faecal microbiome, HM microbiome, and HM metabolomics, respectively. Based on the relative magnitudes of the
values, the component appears to be primarily local to the infant faecal and HM microbiome blocks, with a minor contribution of the HM metabolomics block. This suggests that ppBMI-associated variation was mainly present in the microbiomes of mother and infant.
In the infant faecal microbiome, genera associated with higher ppBMI included Bifidobacterium bifidum, Escherichia coli, Staphylococcus epidermidis, and B. breve (Fig 7A). Genera enriched in infants of low-ppBMI mothers included Parabacteroides, Enterococcus and Streptococcus. In the HM microbiome, higher ppBMI was associated with increased abundance of Staphylococcus and Streptococcus spp., while genera associated with lower ppBMI included various taxa such as Clostridium sensu stricto, Collinsella, and Lactobacillus (Fig 7C). In the HM metabolome, higher ppBMI was associated with increased levels of many unknown HMOs, caffeine, and LDFT (Fig 7E). Metabolites with increased levels in low-ppBMI mothers included several amino acids and related compounds such as 2-aminobutyrates, aspartate, isoleucine, and methionine.
Feature mode loadings of the selected ACMTF-R model using describing (A) the infant faecal microbiome, (C) the HM microbiome, and (E) the HM metabolomics datasets. Time loadings of the same model are shown in the same order in (B,D,F). Signs of the feature mode loadings were adjusted post-hoc such that positive values consistently corresponded to higher ppBMI and negative values to lower ppBMI.
Due to the pre-processing of the data, the largest absolute time loading in the ACMTF-R model indicates the time point at which inter-subject differences related to ppBMI are greatest (Fig 7B, 7D, 7F). Overall, the time loadings were relatively flat, suggesting that ppBMI-associated variation remained stable throughout the sampling period. In the HM microbiome, however, these differences peaked at day 30 and declined thereafter.
Taken together, ACMTF-R was able to identify variation that was strongly and exclusively associated with maternal ppBMI. Additionally, it revealed that this variation was primarily local to the infant faecal and HM microbiome, with a limited contribution from the HM metabolome. This highlights the value of incorporating supervision to isolate outcome-relevant signals in complex multi-omics data.
Discussion
In this study, we presented ACMTF-Regression (ACMTF-R) as an extension of Advanced Coupled Matrix and Tensor Factorization (ACMTF) that incorporates a regression term to model common, local and distinct (CLD) variation related to a dependent variable. Through simulations, we demonstrated its ability to capture a hidden component, its robustness to noise, and its flexibility in tuning between data exploration and predicting. Our application demonstrated how ACMTF-R finds novel relationships for the effect of maternal ppBMI on the HM metabolome, HM microbiome and infant faecal microbiome.
ACMTF-R extends the capabilities of ACMTF by allowing explicit modelling of outcome-associated variation, bridging the gap between exploratory data integration and supervised learning [10]. Compared to NPLS, which is designed for predictive modelling but lacks the ability to separate common, local, and distinct variation across multiple blocks, ACMTF-R provides a more nuanced view of multi-way data structures [19]. Unlike classical ACMTF, which only uncovers underlying data patterns, ACMTF-R allows for targeted analysis of biological variation of interest [37]. The ability to tune between the two makes it particularly useful for cases where a balance between exploration and prediction is necessary.
Our simulations revealed that ACMTF-R reliably recovers the underlying CLD-structure of multi-way datasets across a range of settings. We found that the tuning parameter () governs the trade-off between reconstructing the input data and predicting the dependent variable. While
corresponds to regular ACMTF, decreasing
enhances the identification of factors associated with the dependent variable, even revealing weak structures that would otherwise remain undetected. However, tuning too far towards
(
) leads to overfitting and an inaccurate reconstruction. The observation that intermediate values of
(0.6–0.9) lead to recovery of the hidden local component is largely in agreement with PCovR and MCovR [24,25].
Furthermore, the simulations revealed that the hidden component needed to be sufficiently small () for ACMTF to fail in its recovery. However, this finding is largely dependent on the relative sizes of the other components, which we have kept at Frobenius norm 1 in our simulations. These norms were intentionally controlled to showcase how supervision helps recover a small, outcome-related component to the reader. Since the true CLD structure of the data is not known to the model, this design is appropriate for addressing factor recovery under a well-specified ground truth. However, further research is needed to investigate the recovery of a hidden component using ACMTF and ACMTF-R in other scenarios.
Finally, the simulations showed that recovery of the hidden component using ACMTF-R was most accurate when a biologically realistic amount of noise was present in the data (,
). The observation that neither ACMTF nor ACMTF-R can identify the hidden component in the noiseless case is surprising but has limited impact on real data applications. We speculate that noise may have a regularizing effect on flatter parts of the loss landscape, or that it may reduce factor collinearity in a phenomenon similar to ridge regression [60]. Due to non-convexity of the optimization problem, this is difficult to investigate and is left as future work. It is reassuring that ACMTF-R can reliably capture the
-associated subject mode loadings for most tested noise levels (Supplementary Figure 10-11 in S1 File).
Application to a longitudinal multi-omics dataset integrating human milk microbiome, human milk metabolome, and infant gut microbiome data further highlighted the utility of ACMTF-R by recovering an exclusively ppBMI-associated component [52]. In the infant faecal microbiome, elevated abundances of Enterococcus sp., E. coli, Clostridium sp., and Staphylococcus epidermidis in infants born to high-ppBMI mothers are consistent with prior findings linking maternal obesity to pro-inflammatory microbial profiles [61,62]. However, the association of elevated infant gut microbiome abundances of Bifidobacterium breve and Bifidobacterium bifidum with high maternal ppBMI is an unexpected result, since they are considered beneficial to early-life gut function [61–67]. It is possible that associations were confounded by reduced initiation or shorter duration of breastfeeding in overweight mothers, since cessation of breastfeeding strongly correlated with a decrease in bifidobacteria [63,67–70]. Future research should investigate the robustness of these associations using sensitivity analysis, or by using a cohort with homogenized breastfeeding initiation and duration.
Interestingly, the genera enriched in infants from low-ppBMI mothers represent a potential novel finding, as comparable reports are currently lacking, and warrant further investigation. In the HM microbiome, enrichment of Streptococcus and Staphylococcus spp. in high-ppBMI mothers is in agreement with previous research [71–73]. However, comparable data for taxa enriched in low-ppBMI mothers were not available for direct comparison.
In the HM metabolomics data, elevated levels of lactate and multiple unidentified HMOs were observed in high-ppBMI mothers. While the relationship between ppBMI and HMO concentrations has been investigated in several studies, the results remain inconsistent [74,75]. One study did report higher levels of a lactate derivative in the milk of obese mothers [74], suggesting a possible link to altered maternal energy metabolism. Conversely, the observation of increased concentrations of several amino acids in low-ppBMI mothers is supported by previous findings [74], as well as by recent evidence showing enrichment of amino acid biosynthesis pathways in this group [76].
Furthermore, the relatively flat time loadings across the blocks indicate that inter-subject differences remain approximately constant throughout the time series. The exception is the HM microbiome, where a transient increase is observed at day 30, potentially reflecting ppBMI effects on milk maturation, or dietary, or hormonal variation. Further research with more samples during the first month, ideally including information from dietary surveys and hormonal measurements, is needed to investigate this pattern in more detail. Taken together, these results show that ACMTF-R provides additional biological insights compared to ACMTF and NPLS, effectively disentangling common, local and distinct variation related to maternal ppBMI.
Finally, treating as an additional data block for ACMTF could offer a different approach to incorporating outcome variation into ACMTF-R. By embedding
within the multi-way decomposition framework, rather than optimizing it separately, the model might be able to better capture latent structures linking
to the input datasets. However, this approach would lose the predictive aspect of ACMTF-R. Future work should focus on the comparability between ACMTF-R and adding
as a block to ACMTF.
Limitations
While ACMTF-R demonstrates strong performance in identifying biologically meaningful patterns, several challenges remain. The choice of the tuning parameter is dataset-dependent, requiring careful selection through cross-validation or heuristic approaches. Additionally, in multi-omics applications, different data blocks may contain varying amounts of biologically relevant information. Allowing block-specific
values could improve model flexibility, enabling certain datasets to contribute more strongly to the dependent variable while ensuring others remain focused on data structure. Future research should explore whether adaptive weighting strategies or block-specific hyperparameter tuning can further improve performance.
For ACMTF and ACMTF-R, the single guiding metric for selecting the number of components was often the stability of the decomposition through the factor match score (FMS) [15,59]. This paper has expanded upon this approach by formalising FMS for a cross-validation and random initialisation scheme, as well as developing the degeneracy score diagnostic to identify overfactorization. However, this approach is computationally intensive for the simulations. Future work should investigate if this approach could be accelerated through the Limited-memory Broyden-Fletcher-Goldfarb-Shanno [44–46] and AO-ADMM algorithms [22,23,77].
While Factor Match Score (FMS) has been a useful diagnostic for model selection, it suffers from the phenomenon of all blocks contributing to all components in ACMTF outlined above. Due to the way the structural model of ACMTF is implemented, trilinear components are fitted per data block, even when the rank of that block is lower than
[37] This causes the creation of superfluous feature and time mode loading vectors that contribute to the model due to having non-zero
values. As a result, the FMS is negatively impacted by considering these components in the calculation. This problem may be avoided by more strongly penalising the
-matrix through replacement of the L1 penalty term by a smooth L0 penalty, as presented in Relaxed ACMTF (RACMTF) [78,79]. Future research should investigate if this approach yields more stable ACMTF and ACMTF-R models for multi-omics data.
The multi-omics application presented in this study highlights a critical shortcoming of ACMTF in the identification of common, local, and distinct (CLD) sources of variation by revealing that every data block contributes to all identified components. However, the relative magnitude of the contributing blocks captured by still offers a useful interpretation. For example, a
-matrix threshold could be applied post-hoc to identify CLD-structures [37]. Another approach could be to pre-define the expected number of shared and distinct components, as suggested by Common and Discriminative subspace Non-negative Tensor Factorization (CDNTF) [80]. Future research should investigate if either approach or a combination of both might yield more interpretable ACMTF models for multi-omics data.
In this work we assume a single shared subject mode across data blocks, due to this setting being the most common in multi-omics data analysis. However, other applications may involve different, multiple, or only partially shared modes. Such settings have already been integrated into a well-posed and scalable implementation of CMTF using an Alternating Optimization (AO) Alternating Direction Method of Multipliers (ADMM) framework [22,23]. This approach is also capable of supporting other data distributions, such as Poisson and Negative Binomial. Ongoing work therefore targets the implementation of ACMTF-R using this framework to support other study designs.
Conclusions
Our results demonstrate that ACMTF-R is a powerful framework for multi-way data integration that extends ACMTF by incorporating outcome-associated variation. Through simulations, we showed that ACMTF-R can recover a small hidden component, is robust to noise, and provides flexible tuning between data exploration and outcome prediction. Its application to multi-omics data highlights how ACMTF-R finds novel relationships for the effect of maternal ppBMI on the HM metabolome, HM microbiome and infant faecal microbiome. Ongoing work includes the implementation of ACMTF-R using an AO-ADMM framework to support other types of coupling.
Supporting information
S1 File. Supplementary Methods, Figures, and Tables.
https://doi.org/10.1371/journal.pone.0339650.s001
(DOCX)
S2 File. Appendix 1: Derivation of the gradient function for ACMTF-R.
https://doi.org/10.1371/journal.pone.0339650.s002
(PDF)
Acknowledgments
The authors wish to thank Daniël Rademaker (RU/UvA) for their useful discussions and suggestions, and Evrim Acar for her feedback regarding the optimization approach and convergence properties.
References
- 1. Capobianco E. Ten challenges for systems medicine. Front Genet. 2012;3:193. pmid:23060899
- 2. Davis-Turak J, Courtney SM, Hazard ES, Glen WB, Da Silveira WA, Wesselman T. Genomics pipelines and data integration: challenges and opportunities in the research setting. Expert Rev Mol Diag. 2017;17(3):225–37.
- 3. Zhao Z, Jin VX, Huang Y, Guda C, Ruan J. Frontiers in integrative genomics and translational bioinformatics. BioMed Res Int. 2015;2015.
- 4. Bordbar A. Interpreting the deluge of omics data: new approaches offer new possibilities. Blood Transfus. 2017;15(2):189–90. pmid:28263179
- 5. Chen C, McGarvey PB, Huang H, Wu CH. Protein bioinformatics infrastructure for the integration and analysis of multiple high-throughput “omics” data. Adv Bioinforma. 2010;2010:1–19.
- 6. Misra BB, Langefeld C, Olivier M, Cox LA. Integrated omics: tools, advances and future approaches. J Mol Endocrinol. 2019;62(1):R21–45. pmid:30006342
- 7. Vitorino R. Transforming clinical research: the power of high-throughput omics integration. Proteomes. 2024;12(3):25. pmid:39311198
- 8. van der Kloet FM, Sebastián-León P, Conesa A, Smilde AK, Westerhuis JA. Separating common from distinctive variation. BMC Bioinformatics. 2016;17 Suppl 5(Suppl 5):195. pmid:27294690
- 9. Måge I, Smilde AK, van der Kloet FM. Performance of methods that separate common and distinct variation in multiple data blocks. J Chemom. 2019;33(1):e3085.
- 10. Acar E, Papalexakis EE, Gürdeniz G, Rasmussen MA, Lawaetz AJ, Nilsson M, et al. Structure-revealing data fusion. BMC Bioinformatics. 2014;15(1):239. pmid:25015427
- 11. Tan CS, Salim A, Ploner A, Lehtiö J, Chia KS, Pawitan Y. Correlating gene and protein expression data using correlated factor analysis. BMC Bioinformatics. 2009;10:272. pmid:19723309
- 12. Xiao X, Moreno-Moral A, Rotival M, Bottolo L, Petretto E. Multi-tissue analysis of co-expression networks by higher-order generalized singular value decomposition identifies functionally coherent transcriptional modules. PLoS Genet. 2014;10(1):e1004006. pmid:24391511
- 13. Berger JA, Hautaniemi S, Mitra SK, Astola J. Jointly analyzing gene expression and copy number data in breast cancer using data reduction models. IEEE/ACM Trans Comput Biol Bioinform. 2006;3(1):2–16. pmid:17048389
- 14. Song Y, Westerhuis JA, Smilde AK. Separating common (global and local) and distinct variation in multiple mixed types data sets. J Chemom. 2020;34(1):e3197.
- 15.
Acar E, Kolda TG, Dunlavy DM. All-at-once optimization for coupled matrix and tensor factorizations. In: arXiv, 2011. http://arxiv.org/abs/1105.3422
- 16. Acar E, Bro R, Smilde A. Data fusion in metabolomics using coupled matrix and tensor factorizations. Proc IEEE. 2015;103:1602.
- 17. Acar E, Bro R, Smilde AK. Data fusion in metabolomics using coupled matrix and tensor factorizations. Proc IEEE. 2015;103(9):1602–20.
- 18. Acar E, Dunlavy DM, Kolda TG, Mørup M. Scalable tensor factorizations for incomplete data. Chemom Intell Lab Syst. 2011;106(1):41–56.
- 19. Bro R. Multiway calibration. Multilinear PLS. J Chemom. 1996;10(1):47–61.
- 20. Lock EF, Hoadley KA, Marron JS, Nobel AB. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann Appl Stat. 2013;7(1):523–42.
- 21. Palzer EF, Wendt CH, Bowler RP, Hersh CP, Safo SE, Lock EF. sJIVE: supervised joint and individual variation explained. Comput Stat Data Anal. 2022;175:107547. pmid:36119152
- 22.
Schenker C, Cohen JE, Acar E. An optimization framework for regularized linearly coupled matrix-tensor factorization. In: 2020 28th European signal processing conference (EUSIPCO), 2021. 985–9. https://ieeexplore.ieee.org/document/9287459/
- 23. Schenker C, Cohen JE, Acar E. A flexible optimization framework for regularized matrix-tensor factorizations with linear couplings. IEEE J Sel Top Signal Process. 2021;15(3):506–21.
- 24. Vervloet M, Kiers HAL, Noortgate WV den, Ceulemans E. PCovR: an R package for principal covariates regression. J Stat Softw. 2015;65:1–14.
- 25. Smilde AK, Kiers HAL. Multiway covariates regression models. J Chemom. 1999;13(1):31–48.
- 26. de Jong S, Kiers HAL. Principal covariates regression. Chemometr Intelligent Lab Syst. 1992;14(1–3):155–64.
- 27. Kiers HAL. Towards a standardized notation and terminology in multiway analysis. J Chemom. 2000;14(3).
- 28.
Kolda TG. Multilinear operators for higher-order decompositions. Albuquerque, NM, and Livermore, CA: Sandia National Laboratories (SNL); 2006. https://www.osti.gov/biblio/923081
- 29. De Lathauwer L, De Moor B, Vandewalle J. A multilinear singular value decomposition. SIAM J Matrix Anal Appl. 2000;21(4):1253–78.
- 30. Styan GPH. Hadamard products and multivariate statistical analysis. Linear Algebra Appl. 1973;6:217–40.
- 31. McDonald RP. A simple comprehensive model for the analysis of covariance structures: some remarks on applications. Br J Math Stat Psychol. 1980;33(2):161–83.
- 32.
Rao CR. Generalized inverse of a matrix and its applications. In: Le Cam LM, Neyman J, Scott EL, eds. Theory of statistics. University of California Press; 1972. 601–20.
- 33.
Ben-Israel A, Greville TN. Generalized inverses: theory and applications. Springer Science and Business Media; 2006.
- 34. Carroll JD, Chang J-J. Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika. 1970;35(3):283–319.
- 35.
Harshman RA. Foundations of the parafac procedure: models and conditions for an “explanatory” multimodal factor analysis. 84.
- 36.
Bro R. Multi-way analysis in the food industry.
- 37. Acar E, Papalexakis EE, Gürdeniz G, Rasmussen MA, Lawaetz AJ, Nilsson M, et al. Structure-revealing data fusion. BMC Bioinform. 2014;15(1):239. pmid:25015427
- 38.
Lee SI, Lee H, Abbeel P, Ng AY. Efficient l1 regularized logistic regression. In: Aaai, 2006. 401–8. https://cdn.aaai.org/AAAI/2006/AAAI06-064.pdf
- 39. Bertrand D, Qannari EM, Vigneau E. Latent root regression analysis: an alternative method to PLS. Chemom Intell Lab Syst. 2001;58(2):227–34.
- 40. Webster JT, Gunst RF, Mason RL. Latent root regression analysis. Technometrics. 1974;16(4):513–22.
- 41.
Harshman RA. Foundations of the PARAFAC procedure: models and conditions for an“ explanatory” multimodal factor analysis. 1970. https://www.psychology.uwo.ca/faculty/harshman/wpppfac0.pdf
- 42. Bro R. Parafac. Tutorial and applications. Chemom Intell Lab Syst. 1997;23.
- 43.
Melville J. Unconstrained numerical optimization algorithms. 2019.
- 44. Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Math Program. 1989;45(1):503–28.
- 45.
Malouf R. A comparison of algorithms for maximum entropy parameter estimation. In: proceeding of the 6th conference on Natural language learning - COLING-02, 2002. 1–7. https://doi.org/10.3115/1118853.1118871
- 46.
Andrew G, Gao J. Scalable training ofL1-regularized log-linear models. In: Proceedings of the 24th international conference on Machine learning, 2007. 33–40. https://doi.org/10.1145/1273496.1273501
- 47. Burt C. The factorial study of temperamental traits. Br J Psychol. 1948.
- 48.
Tucker LR. A method for synthesis of factor analysis studies. Princeton, NJ: Educational Testing Service; 1951. https://apps.dtic.mil/sti/pdfs/AD0047524.pdf
- 49. Lorenzo-Seva U, ten Berge JMF. Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodol Eur J Res Methods Behav Soc Sci. 2006;2(2):57–64.
- 50. Kuhn HW. The Hungarian method for the assignment problem. Nav Res Logist Q. 1955;2(1–2):83–97.
- 51. Kuhn HW. Variants of the hungarian method for assignment problems. Nav Res Logist Q. 1956;3(4):253–8.
- 52. Jakobsen R, Ploeg GR, Sundekilde U, Astono J, Poulsen K, Fuglsang J. Supervised modelling of longitudinal human milk and infant gut microbiome reveal maternal pre-pregnancy BMI and early life growth interactions. Res Sq. 2025. https://www.researchsquare.com/article/rs-6244750/v1
- 53. van der Ploeg GR, Westerhuis JA, Heintz-Buschart A, Smilde AK. parafac4microbiome: exploratory analysis of longitudinal microbiome data using parallel factor analysis. mSystems. 2025;10(6):e0047225. pmid:40396737
- 54.
Core Team R. R: a language and environment for statistical computing. 2013.
- 55. Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Front Microbiol. 2017;8:2224.
- 56. Aitchison J. The statistical analysis of compositional data. J R Stat Soc Ser B Methodol. 1982;44(2):139–60.
- 57. Bro R, Smilde AK. Centering and scaling in component analysis. J Chemom. 2003;17(1):16–33.
- 58.
van der Ploeg GR, Westerhuis JA, Heintz-Buschart A, Smilde AK. Parafac4microbiome: exploratory analysis of longitudinal microbiome data using parallel factor analysis. Accessed 2024 July 8. http://biorxiv.org/lookup/doi/10.1101/2024.05.02.592191
- 59.
Li L, Yan S, Horner D, Rasmussen MA, Smilde AK, Acar E. Revealing static and dynamic biomarkers from postprandial metabolomics data through coupled matrix and tensor factorizations. 2024.
- 60.
Schott JR. Matrix analysis for statistics. John Wiley and Sons; 2016.
- 61. Singh SB, Madan J, Coker M, Hoen A, Baker ER, Karagas MR, et al. Does birth mode modify associations of maternal pre-pregnancy BMI and gestational weight gain with the infant gut microbiome?. Int J Obes (Lond). 2020;44(1):23–32. pmid:30765892
- 62. Collado MC, Isolauri E, Laitinen K, Salminen S. Effect of mother’s weight on infant’s microbiota acquisition, composition, and activity during early infancy: a prospective follow-up study initiated in early pregnancy. Am J Clin Nutr. 2010;92(5):1023–30. pmid:20844065
- 63. Monteiro POA, Victora CG. Rapid growth in infancy and childhood and obesity in later life--a systematic review. Obes Rev. 2005;6(2):143–54. pmid:15836465
- 64. Duranti S, Lugli GA, Milani C, James K, Mancabelli L, Turroni F, et al. Bifidobacterium bifidum and the infant gut microbiota: an intriguing case of microbe-host co-evolution. Environ Microbiol. 2019;21(10):3683–95. pmid:31172651
- 65. Stuivenberg GA, Burton JP, Bron PA, Reid G. Why are bifidobacteria important for infants?. Microorganisms. 2022;10(2):278. pmid:35208736
- 66. Vu K, Lou W, Tun HM, Konya TB, Morales-Lizcano N, Chari RS, et al. From birth to overweight and atopic disease: multiple and common pathways of the infant gut microbiome. Gastroenterology. 2021;160(1):128-144.e10. pmid:32946900
- 67. Dinleyici M, Barbieur J, Dinleyici EC, Vandenplas Y. Functional effects of human milk oligosaccharides (HMOs). Gut Microbes. 2023;15(1):2186115. pmid:36929926
- 68. Amir LH, Donath S. A systematic review of maternal obesity and breastfeeding intention, initiation and duration. BMC Preg Childbirth. 2007;7:9. pmid:17608952
- 69. Ong KK, Loos RJF. Rapid infancy weight gain and subsequent obesity: systematic reviews and hopeful suggestions. Acta Paediatr. 2006;95(8):904–8. pmid:16882560
- 70. Bäckhed F, Roswall J, Peng Y, Feng Q, Jia H, Kovatcheva-Datchary P, et al. Dynamics and stabilization of the human gut microbiome during the first year of life. Cell Host Microbe. 2015;17(5):690–703. pmid:25974306
- 71. Lundgren SN, Madan JC, Karagas MR, Morrison HG, Hoen AG, Christensen BC. Microbial communities in human milk relate to measures of maternal weight. Front Microbiol. 2019;10:2886. pmid:31921063
- 72. Cabrera-Rubio R, Collado MC, Laitinen K, Salminen S, Isolauri E, Mira A. The human milk microbiome changes over lactation and is shaped by maternal weight and mode of delivery. Am J Clin Nutr. 2012;96(3):544–51.
- 73. Cortés-Macías E, Selma-Royo M, Rio-Aige K, Bäuerl C, Rodríguez-Lagunas MJ, Martínez-Costa C, et al. Distinct breast milk microbiota, cytokine, and adipokine profiles are associated with infant growth at 12 months: an in vitro host-microbe interaction mechanistic approach. Food Funct. 2023;14(1):148–59. pmid:36472137
- 74. Isganaitis E, Venditti S, Matthews TJ, Lerin C, Demerath EW, Fields DA. Maternal obesity and the human milk metabolome: associations with infant body composition and postnatal weight gain. Am J Clin Nutr. 2019;110(1):111–20. pmid:30968129
- 75. Bardanzellu F, Puddu M, Peroni DG, Fanos V. The human breast milk metabolome in overweight and obese mothers. Front Immunol. 2020;11:1533.
- 76. Sivalogan K, Liang D, Accardi C, Diaz-Artiga A, Hu X, Mollinedo E, et al. Human milk composition is associated with maternal body mass index in a cross-sectional, untargeted metabolomics analysis of human milk from guatemalan mothers. Curr Dev Nutr. 2024;8(5):102144. pmid:38726027
- 77. Roald M. Learning coupled matrix factorizations with Python. SoftwareX. 2023;21:101292.
- 78. Mohimani H, Babaie-Zadeh M, Jutten C. A fast approach for overcomplete sparse decomposition based on smoothed $\ell ^{0}$ norm. IEEE Trans Signal Process. 2009;57(1):289–301.
- 79. Rivet B, Duda M, Guérin-Dugué A, Jutten C, Comon P. Multimodal approach to estimate the ocular movements during EEG recordings: a coupled tensor factorization method. Annu Int Conf IEEE Eng Med Biol Soc. 2015;2015:6983–6. pmid:26737899
- 80.
Liu W, Chan J, Bailey J, Leckie C, Ramamohanarao K. Mining labelled tensors by discovering both their common and discriminative subspaces. In: Proceedings of the 2013 SIAM international conference on data mining, 2013. 614–22. https://doi.org/10.1137/1.9781611972832.68