Fig 1.
Omic data–integration methods in machine learning.
Multiview omic data–integration methods can be classified into three main domains. (a) Concatenation-based (early-stage) integration involves combining all omic data into one large matrix before applying ML methods to obtain a data-driven model. (b) Transformation-based (intermediate-stage) integration involves applying data transformations to obtain a uniform format, which can then permit the combination into one fused dataset. (c) Model-based (late-stage) integration involves obtaining individual machine learning models separately for each dataset before combining the outcomes rather than combining data prior to the learning phase. ML, machine learning.
Fig 2.
Constraint-based data integration and fluxome generation.
(a) Constraint-based metabolic modeling begins with the construction of a manually curated GSMM recording all reactions taking place in the network. (b) Coded within the structure of a GSMM is the stoichiometric matrix S, denoting the involvement of metabolites in each reaction. Constraints are applied to the model to identify a given metabolic goal, represented as the objective function c, and linear or quadratic optimization is used to maximize or minimize this objective. The steady-state assumption (Sv = 0) sets the product of the stoichiometric matrix S and flux vector v as invariant. (c) To compute a unique flux distribution, the objective function can be regularized by subtracting a concave function from it. In addition to v being restricted between default lower and upper limits (vmin and vmax), external multiomic data θ can be used to further constrain fluxes using the mapping function φ(θ), hence driving the output toward condition-dependent solutions. GSMM, genome-scale metabolic model.
Fig 3.
Multiomic data analysis by combination of constraint-based modeling with machine learning.
(a) Fluxomic analysis involves FBA or related techniques performed on a general-purpose GSMM, from which the flux data obtained can be used as input for unsupervised or supervised machine learning. (b) To improve the accuracy of machine learning predictions, multiomic datasets are obtained using high-throughput analytics—e.g., transcriptomics (DNA microarrays, RNA sequencing), proteomics (2D gel electrophoresis, stable isotope labeling, mass spectrometry), or metabolomics (NMR spectroscopy, isotopic labeling, LC-MS, GC-MS). As these datasets are obtained from different sources, they must undergo several preprocessing stages such as filtration and normalization to maintain synchronicity, account for variance, and reduce noise. Condition-specific knowledge-based models are generated by introducing these multiple datasets into GSMMs to obtain more precise flux estimations, from which machine learning techniques can be applied to infer biologically relevant patterns in the data. (c) Alternatively, machine learning can be directly applied to single- or multiomic datasets to produce or improve GSMMs or fluxomic data. FBA, flux balance analysis; GC-MS, gas chromatography–mass spectroscopy; GSMM, genome-scale metabolic model; LC-MS, liquid chromatography–mass spectroscopy; NMR, nuclear magnetic resonance.
Table 1.
Overview of previous studies that integrated CBM and machine learning, grouped by task type.