Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Identifying dynamic regulation with machine learning using adversarial surrogates

  • Ron Teichner ,

    Roles Formal analysis, Investigation, Writing – original draft

    ron.teichner@ef.technion.ac.il

    Affiliations Viterbi Department of Electrical and Computer Engineering, Technion - Israel Institute of Technology, Haifa, Israel, Network Biology Research Laboratory, Technion - Israel Institute of Technology, Haifa, Israel

  • Naama Brenner,

    Roles Formal analysis, Supervision, Writing – review & editing

    Affiliations Network Biology Research Laboratory, Technion - Israel Institute of Technology, Haifa, Israel, Wolfson Department of Chemical Engineering, Technion - Israel Institute of Technology, Haifa, Israel

  • Ron Meir

    Roles Formal analysis, Supervision, Writing – review & editing

    Affiliations Viterbi Department of Electrical and Computer Engineering, Technion - Israel Institute of Technology, Haifa, Israel, Network Biology Research Laboratory, Technion - Israel Institute of Technology, Haifa, Israel

Abstract

Biological systems maintain stability of their function in spite of external and internal perturbations. An important challenge in studying biological regulation is to identify the control objectives based on empirical data. Very often these objectives are time-varying, and require the regulation system to follow a dynamic set-point. For example, the sleep-wake cycle varies according to the 24 hours solar day, inducing oscillatory dynamics on the regulation set-point; nutrient availability fluctuates in the organism, inducing time-varying set-points for metabolism. In this work, we introduce a novel data-driven algorithm capable of identifying internal regulation objectives that are maintained with respect to a dynamic reference value. This builds on a previous algorithm that identified variables regulated with respect to fixed set-point values. The new algorithm requires adding a prediction component that not only identifies the internally regulated variables, but also predicts the dynamic set-point as part of the process. To the best of our knowledge, this is the first algorithm that is able to achieve this. We test the algorithm on simulation data from realistic biological models, demonstrating excellent empirical results.

Introduction

Living systems maintain stability against internal and external perturbations at multiple levels of organization. This phenomenon, known as homeostasis [13], is essential for their functioning, and its failure is associated with disease. Typical examples include regulating the temperature and blood sugar level to within certain pre-set limits. Despite their importance, identifying the objectives of homeostatic regulation in complex systems in a data-driven way, based on abundant data without a known model, remains a computational and algorithmic challenge.

Control theory, which provides sophisticated tools to model and control dynamical systems [4], is often invoked to describe homeostasis. Yet, it does not directly enable understanding the exact objectives of homeostatic regulation in biological systems. This is due to several reasons. First, control theory is based on a clear separation between the controlled system (or plant) and the controller. In biological systems, the “plant” and “controller” have developed together through evolution, generating a complex network of both positive and negative feedback interactions and cannot be separated [5]. Second, control design is based on a mathematical model for the plant, and deriving such a model for biological systems is not always possible. This highlights the importance of empirical methods to identify biological regulation in a data-driven way.

A related challenge in Machine Learning (ML) is the data-driven identification of conserved quantities in dynamical systems. ML algorithms that address this problem are applicable in different fields of science. A conserved quantity is found in physics, in the context of Hamiltonian dynamics [6]. Epidemiology models (e.g., the Kermack-McKendrick model), as well as population dynamics models (e.g., the Lotka-Volterra model, or the Monod chemostat model), have constants of motion with significant biological implications (see, e.g. [7, 8]). Dedicated ML algorithms have been developed to detect such conserved quantities [912], but their applicability to biological systems is limited. For example Inverse Optimal Control and Inverse Reinforcement Learning, that are used in a related context, assume a clear separation between a plant and a controller (see [13, 14] for detailed surveys of both fields). Most algorithmic approaches directed at physical systems search for conserved quantities under the assumption of a given dynamical system with fixed parameters, whereas in biology this is not necessarily the case. Therefore, dedicated algorithmic tools are required to identify the regulated quantities in a data-driven manner in biological systems.

Identifying Regulation with Adversarial Surrogates (IRAS) is a data-driven algorithm for detecting regulated quantities [15]. The basic idea is to find a combination of the measured variables that remains stable as a function of time, and for which shuffling the components of the combination maximally harms this stability. Such a combination is presumably made up of co-varying quantities that compensate for one another. The algorithm iteratively solves a min-max optimization problem by dynamically generating adversarial surrogates. IRAS was verified on systems with known ground-truth, and demonstrated impressive empirical results [15]. It discovered the regulated quantity in examples from different fields including protein interactions, ecological systems, a psychophysical experiment in which a stimulus signal was regulated and physical Hamiltonian systems. Recently, several sufficient conditions guaranteeing local convergence of IRAS were analytically obtained [16].

Algorithms for identifying conserved quantities assume that there exists a function of the observed variables that is constant in time (but may vary between different trajectories). Given a set of measurements over time, , they attempt to find a real-valued function of the observed variables that is maintained around a fixed set-point:

(1)

Note that if the function g exists, it is not unique. For example, affine and other transformations on g are also fixed. Therefore we can assume without loss of generality that . Note also that direct optimization can detect trivial combinations that are not useful. For example, optimizing for conservation alone can (and does) lead to quantities such as a constant function , independent of z. The min-max optimization formulated for IRAS [15] guarantees that the algorithm avoids convergence to such trivial results.

The word homeostasis combines the Greek words homoios (“similar”) and stasis (“standing still”) yielding the idea of “staying the same”. However, in many biological systems the regulated variables may follow a time-varying function. This is the case for example in systems that are entrained to various rhythms like the 24 hours solar day [1720]. A limitation of many ML algorithms, including IRAS, is that they only identify variables that are regulated around a constant, time-independent, value. Thus, they are incompatible with time-dependent regulation.

Allowing a dynamically changing reference value, contrary to (1), is motivated by biological phenomena such as circadian rhythms in body temperature, sleep-wake cycle regulation, and seasonal changes in fur thickness in animals; well-documented examples include the Baroreflex control of the cardiovascular system, where slow blood-pressure and heart-rate oscillations are observed [21], neuronal oscillations in brain activity, which are crucial for cognition and motor control [22], and menstrual cycles of hormones like estrogen, which regulate ovulation and menstruation [23].

In this work, we consider systems that regulate an unknown function of the observables around an unknown dynamically changing reference value,

(2)

We do not assume any specific form for either the regulated function g() or the setpoint process c().

The IRAS algorithm searches for a function that is regulated about a fixed set-point (1). Inferring meaningful dynamic regulatory processes, (2), renders the task impossible for IRAS to solve. Here we generalize IRAS to allow inferring meaningful dynamic regulatory processes, as is required in many biological settings. We present Identifying Dynamic Regulation with Adversarial Surrogates (IDRAS), a purely data-driven algorithm that simultaneously learns these two separate unknowns, a function g() and the dynamic process that it follows, c(). This removes a crucial obstacle in the detection of regulation in biological systems, which exhibit sustained oscillations and other modulations of their homeostatic set-points. We expect this algorithm to be widely applicable to biological measured data at multiple levels of organization, and to contribute to revealing their regulatory logic.

The remainder of this paper is organized as follows. Section ‘Algorithm development’ details the new IDRAS algorithm, and Section ‘Examples’ validates its performance on several simulated datasets with a known control objective. The final section concludes and describes possible directions for further research. Throughout the manuscript, we interchangeably refer to the regulated quantity as either a function of the observables or a combination of the observables. When referring to a function that remains constant over time, we also describe it as a conserved quantity.

Algorithm development

IDRAS is an iterative algorithm consisting of two competing players. The input to the algorithm is an observed sequence of measurements, z(t), and the goal is to identify both a function of these observables, g(z(t)), and a dynamic reference value c(t) (2). Since we assume that g(z(t)) follows the reference c(t), it should be possible, given the observations z(t) for , to predict a future value g(z(t2)), where . Intuitively, to detect the coupled pair and , one could straightforwardly minimize the prediction error, but this may lead to a trivial solution such as . To overcome this difficulty, the IDRAS algorithm iteratively optimizes a quantitative measure that characterizes the sensitivity of the prediction error to destroying the temporal order of the observed time-series. It is expected that a regulatory process which involves co-variation among variable would be sensitive to their temporal order, while a trivial constant would not.

The algorithm iterates between two players as seen in Fig 1. The first, the Combination player, iteratively minimizes the error, while the other adversarial Shuffle player, successively creates more constrained shuffled ensembles of the data. We call these ensembles "adversarial surrogates", since the second player aims to render the task of the first player more difficult by forcing it to extract information about the temporal structure of the data, which is absent from surrogate data created by random time-shuffling. upon convergence, the algorithm outputs a coupled pair - the function g(z(t)) and the dynamics c(t) that it follows. This allows to assess the significance of the identified function. We next present a detailed mathematical formulation of the problem and the IDRAS algorithm.

thumbnail
Fig 1. IDRAS algorithm outline.

The observation time-series z is permuted according to (8) to create the unconstrained series . The shuffle player, only exposed to the 1D filtering errors, sets the resampling function used to resample from such that the distributions of filtering errors are identical, (10). Then the combination player, given z and , updates the parameters towards minimizing the filtering error variance ratio, (12). These steps continue to iterate until no further improvement is possible. The block , based on (5) and whose architecture is detailed in Fig 2, replaces the function g() in Fig 3 of [15].

https://doi.org/10.1371/journal.pone.0325443.g001

Problem formulation

We are interested in identifying empirically, from a set of measured observables, a function of these observables which tightly follows a dynamically changing reference value, where both are unknown. The function could represent an internal quantity of high importance to the system, and the reference value could reflect temporal trends in the environment. The input to the algorithm is a sequence of observations where typically is the noisy output of a continuous-time dynamical system measured at time . We would like to identify two parametric functions. The first, , is such that

(3)

is the time-series of the (learned) regulated quantity. We assume that this time-series follows some unknown dynamics and that it is possible to learn a filter, namely a predictor of the value of ck based on its previous values (formally (4) below is a 1-step predictor, but, following [24], we refer to it as a filter). This leads to our second learned function (predictor), formally given by

(4)

The structure of the learned filter F, and the meaning of its parameters, is detailed at the end of the present Section, and with T>0 a hyper-parameter. The filtering error is,

(5)

and is a function of the parameters

(6)

where such that

and . The goal is to converge to a parameter vector such that the error is small, namely for all k. We assume that the parametric functions are neural networks and is the vector of parameters in these networks. We note that a straightforward optimization yields the trivial pair, , , implying the need to formulate a different optimization problem that avoids this trivial solution.

The problem formulated is in fact a generalization of the static problem (1). It is immediate to verify that by setting , we obtain , and thus we optimize to find a function such that , namely regulated around a fixed value. This simpler problem was addressed in [15] and in [16] where several sufficient conditions guaranteeing local convergence of IRAS were derived. The next Section introduces the generalization of this algorithm to time-varying regulation set-points.

Identifying dynamic regulation with adversarial surrogates

IDRAS is based on the probabilistic assumption that the noisy sample series {zk} is the output of a stationary process and thus admits a probability density function (PDF)

implying that the sequence zkT:k admits a PDF

for all T>0. We denote the random variable of a length-T + 1 sequence by

where . Then the filtering error is a scalar projecting of z(T) by , whose (scalar) PDF is

(7)

where and . To denote the variance of the filtering error we will use the notation and, in general, for any PDF we denote the PDF [variance] of its projection by with [].

Straightforward minimization of will usually lead to a trivial solution () as mentioned earlier. Therefore, to find a non-trivial , IDRAS also uses a surrogate PDF

that corresponds to a random variable

and is defined

(8)

In the right hand side we see that is the PDF for the case where a sequence of observations of length T is followed by a random observation drawn from the marginal distribution. Clearly this ‘detached’ observation , or any non-trivial function of it , cannot be predicted given with a mean-square-error lower than . Therefore, given the structure of , to learn a meaningful pair it seems a natural idea to find a vector that minimizes the ratio of filtering error variances,

(9)

While (9) eliminates the trivial solution which will give a ratio of 0/0, it fails to eliminate all trivial solutions. Consider the case of analyzing a time-series of vital physiological signals where are the values of blood-pressure and heart-rate sampled every second. Let T = 1. Clearly, all the samples , where is the concatenation of two sequential samples, are plausible sequential cardiovascular measurements. These samples admit, among other constraints, that , namely, the rate of change of heart-rate cannot exceed a maximal value []. A sample drawn from the surrogate PDF, where the samples are not sequential, may violate this physiological constraint. Such an artefact, which is an implausible physiological observation, can superficially increase the denominator in (9), in turn leading to an erroneous solution. We refer to such a solution as trivial since it highlights a feature in the data that is a biological constraint rather than the result of a regulatory process (see Discussion for further details on this). To address this difficulty, IDRAS iteratively eliminates non-plausible observations in the surrogate PDF. In what follows we mathematically formalize the case of a trivial, artefact solution.

Remark 1. Assume that there exists a set such that

then any vector for which for all and for , minimizes the ratio in (9) - achieving a value zero. This is usually a "trivial", or artifact, solution that reflect constraints in the data; however a straightforward optimization algorithm minimizing ratio (9), may converge on such a solution.

IDRAS is an iterative algorithm consisting of two competing players, a generalization of the IRAS algorithm. The players solve a min-max style optimization problem with respect to the ratio of filtering error variances.

Shuffle player: Given the current solution , update the surrogate PDF to a surrogate PDF by

(10)

where is a resampling function that is chosen such that the modified surrogate PDF satisfies

(11)

The goal of the shuffle player is thus to transform to a PDF such that the statistical properties of the filtering errors of the random variables z(T) and are identical. This overcomes the difficulty described in Remark 1. A closed-form expression for is given in Lemma 1 below.

Combination player: Set the parameters vector to minimize the ratio of filtering error variances,

(12)

The shuffle player is “adversarial” w.r.t the combination player because (11) implies that the ratio

in contrast to its minimization in (12). The two players mutually inform each other of their current step results, and the process continues iteratively until the combination player can no longer decrease the ratio in (12). We refer to this algorithm as IDRAS and depict its outline in Fig 1.

Lemma 1. The function guaranteeing that (11) holds is given by

(13)

Furthermore, this also guarantees that in (10) is indeed a PDF.

For the proof of Lemma 1 we refer the reader to the proof of Lemma 1 in [16] with the following modifications: , and . For example is replaced by .

Filter.

The architecture of the filter within the block in Fig 1 has many degrees-of-freedom and can be chosen by the user according to prior knowledge regarding the nature of the dynamic reference. To impose a minimal number of constraints on the filter it can be implemented by a fully connected deep neural-network.

Dealing with biological systems, we assume that the dynamics of the reference can be modeled by a continuous-time, time-invariant latent model (see [25, 26] for details on integrating differential equations using neural-networks). Our filter-block contains three parts: (i) An encoder , that given T consecutive values of the reference, ckT:k−1, infers a latent state (by a user-defined hyper-parameter), (ii) A drift function describing the deterministic term in the dynamics of the latent state that serves to time-advance the latent state to , (iii) An emission function that decodes the 1-step predicted value from the latent state . The following set of equations describe the filter (4),

(14)

depicted, as part of the block in Fig 2 and where . Concluding the presentation of IDRAS, we note that by construction (see Eq (8)) IDRAS is deliberately incapable of detecting stationary control objectives and therefore it is not a substitute for IRAS. The two algorithms query for fundamentally different control objectives and should be evaluated independently, as we demonstrate in the following examples. See the S1 Appendix for a detailed mathematical explanation.

thumbnail
Fig 2. Architecture of the block (5) from Fig 1.

The filtering error ek is the difference between the learned reference value ck and its prediction . The filter , (14) with , infers a latent-state and time-advances it to the state from which the estimation is decoded.

https://doi.org/10.1371/journal.pone.0325443.g002

Multiple time-series.

Often, multiple time-series are observed from a system or from similar systems. The algorithm can be applied to multiple time-series without any modifications. In the description above, we simply replace zk by zm,k which represents the sample of the time-series, and is distributed according to fz. Likewise, , and . We note that the algorithm processes multiple time-series concurrently in a single execution that results in a single solution .

Examples

After presenting the construction of IDRAS, we seek to validate it on datasets with a known control objective, so that the quality of the results can be assessed. To assess the performance, we calculate the Pearson correlation between the known control objective and the output of IDRAS ck. In addition, we calculate , the normalized-mean-square-error of the learned filter given by

Below we present two validation examples using models of biological dynamic processes: kinetics of protein interactions, and bacterial life cycle. In each model there is a known control objective, detected by IDRAS. In S1 Appendix we present validation of a synthetic example containing two independent (and therefore uncorrelated) control objectives; running IDRAS several times shows that in each evaluation it successfully converges to one of the control objectives and not to some combination of the two, which is an undesired result. The architecture of all parameterized functions and the values of all the example model parameters are listed in S1 Appendix. Also please find in S1 Appendix the terms for and for multiple time-series processing. Code reproducing the results is available at https://github.com/RonTeichner/IRAS.

A kinetic model of regulated gene expression

Circuits and networks of interacting proteins and other cellular components are thought to take part in control mechanisms inside the cell [27]. Specifically, the regulation of gene expression can be modeled at a coarse-grained level by kinetic equations, where continuous variables represent concentrations of participating proteins, mRNA or other molecules, and their interactions are formulated in mass-action approximation [28].

Our first example focuses on a model that describes the production of two proteins whose sum is maintained near a setpoint by a feedback loop (inspired by [29]). An mRNA molecule M is transcribed at a rate K and degraded at a rate . This molecule in turn determines the production rate of two proteins, P and S. The total amount of two proteins P and S, namely P + S, feeds back to affect the level of M. Fig 3a illustrates the kinetic interactions model. The mRNA M, and the two proteins P and S, are linked in a feedback loop; the strength of this negative feedback given by the rate constant f. In the limit of strong feedback, the combination P + S is maintained around a set-point proportional to K.

thumbnail
Fig 3. Validation of IDRAS on an oscillating setpoint in a kinetic model.

(a) Illustration of the regulated gene expression feedback-loop model. mRNA M, produced at a rate K(t), induces the production of proteins S and P and receives a negative feedback of their sum. Since biochemical interactions are faster than mRNA production, this circuit regulates the sum S + P to follow its time course of production K(t) (see Eq 17). (b,c) Comparison of algorithm outputs (dashed red) to the known regulated combination P  +  S (black). (b) The IRAS algorithm does not converge, since it searches for a constant regulation set-point that does not exist in this model (the depicted is not stable and continues to change with further iterations of IRAS). (c) The IDRAS algorithm captures with high precision the control objective and its oscillating trend.

https://doi.org/10.1371/journal.pone.0325443.g003

This model, with a fixed K resulting in a fixed set-point, was used as a validation for the IRAS algorithm in previous work [15]. Here, we incorporate a time-dependent setpoint: we assume that the external environment modulates transcription rate, resulting in an oscillatory setpoint K(t). The model is described by the stochastic differential equations

(15)

where WP(t) and WS(t) are Wiener processes that were added to the kinetic scheme to represent various sources of noise in the biological system. The rate is given by

(16)

We further assume that the typical timescale of environmental change is slower than that of the biochemical kinetic interactions. For example circadian cycles occur over 24 hours whereas protein production and degradation - over minutes. Under these conditions of timescale separation, the feedback loop operates faster than the modulations in transcription rate. Small changes in S or P, modeled by increments of the Wiener processes WP and WS, induce swift and sharp changes in the transcription of M and maintain around a reference level

(17)

extending the result for a fixed setpoint K by following its slow variations K(t). The indirect interactions between the two proteins is reflected in a negative covariation of S and P, here relative to the slowly-modulated setpoint. Our observables contain the levels of the two proteins P and S,

where is the sampling rate.

To demonstrate the performance of IDRAS and compare it to IRAS, we simulated multiple time-series (for ease of notation we drop the time-series index m throughout the example) of (15) and ran both IRAS and IDRAS algorithms in search of the control objective, obtaining the solutions and (containing , see (6)) respectively. Fig 3b depicts the output (dashed red) trained by the IRAS algorithm. Due to the lack of a regulated constant combination, IRAS did not converge and the time-series and c have a Pearson correlation value of ρIRAS(c*,c)=0.21. Fig 3c similarly depicts the output of IDRAS, , showing that the combination was precisely found, despite its oscillating nature with a Pearson correlation value of ρIDRAS(c*,c)=0.99.

When running IDRAS, the normalized-mean-square-error can also be applied to the quality of prediction. A score of was obtained, indicating that not only was the correct combination found, but also that its temporal modulation is well predicted.

Bacterial life cycle

The next example we consider is a realistic biological model of bacterial growth homeostasis, where growth and division proceed for many generations with significant variability and statistical stability. This is a problem with a long history but is still at the focus of much current research. We apply our algorithm to simulation data, which well mimic experimental measurements but where the regulation is known.

Most bacteria grow smoothly with their size accumulating exponentially, and divide abruptly [3032]. This behavior is consistent with a threshold crossing by some division indicator; specifically three distinct indicators corresponding to different regulation modes have been studied: cell size (“sizer” control mechanism), added size (“adder” mechanism) and elapsed time (“timer”) [33, 34]. More generally, phenomenological models can interpolate continuously between these three types [32, 35, 36]. It is common practice to identify the regulation mode by empirically observing the correlation of a quantity at the end of the cell cycle - size, added size or time - with its initial value. The intuition behind this heuristic is that if some quantity triggers division when crossing a threshold, it should appear uncorrelated with the initial value.

As in most threshold processes in biology, regardless of what is the indicator - the threshold for division is not expected to be strictly fixed, but to fluctuate over time. In a recent paper, Luo et al. [37] demonstrate that the heuristic approach based on correlation plots can only uncover the correct mode of regulation under the restricted condition of a fixed threshold. However, an alternative method to identify the division indicator, which takes into account threshold dynamics, was not offered. Below we simulate the dynamics of bacterial growth and division with a realistic "sizer" mechanism - namely, division occurs when cell size reaches a fluctuating threshold. We show that the IRAS algorithm, designed to detect fixed set-points, performs poorly in the presence of threshold dynamics. In contrast, IDRAS accurately identifies both the cell-division mechanism and the threshold dynamics, resulting in an excellent prediction of the time series.

The threshold u(t) is modeled by a stochastic Ornstein-Uhlenbeck process with a characteristic timescale ,

(18)

with initial condition and where dW(t) is an increment of a Wiener process. We assume that within the cycle the cell size x(t) grows exponentially at a rate , until it reaches the threshold size and divides at time by a factor . The equations describing these growth and division processes are:

(19)

where is the birth size, is the size at division. The exponential growth rate is found in experiment to be randomly distributed across cycles [31, 38]; therefore we assume in our model it is a random variable drawn independently each cycle from a Gamma distribution: αk~Γ(γshape,γscale). Symmetric division with added Gaussian noise implies that is the division fraction. The initial condition for the simulation is taken as .

We simulated these dynamics of cell size over multiple cycles of growth and division and over multiple lineages (for ease of notation, we drop the lineage index m throughout the example). Here division regulation is known and follows the sizer mechanism - the cell divides when its size crosses the dynamic threshold u(t). Fig 4a depicts one such simulated lineage over time - the cell-size, x(t) (dashed-black), and the stochastic threshold u(t) (blue). Our observables, derived from x(t), contain the initial size, growth rate and the cycle duration, namely

thumbnail
Fig 4. Validation of IDRAS on a model of bacterial growth and division.

(a) Bacteria life-cycles. Cell size increases continuously during each cell cycle (black curve) and divides approximately in half at discrete division events. See (19). Division occurs when cell size crosses the stochastic threshold process u(t), (18) (blue curve). Birth size and division size are indicated by green and red marks respectively. (b) IRAS does not decouple the cell-size mechanism (black-curve) from the threshold trend c(t) and yields a function (dashed-red) that represents their mixture. (c) IDRAS captures the division-mechanism.

https://doi.org/10.1371/journal.pone.0325443.g004

where . We note that our choice of the observables zk renders the identification task harder, as now to correctly detect the sizer mechanism the network has to learn the function and not just g(z) = xd, in the case xd was an entry of zk.

We ran both IRAS and IDRAS algorithms in search of the division control mode. We chose a realistic parameter set for which heuristic identification methods based on data correlations fail [37]. Fig 4b depicts the output (dashed red) trained by the IRAS algorithm as a function of time over many cycles, together with the ground-truth output, the size threshold (black line). Here IRAS, aiming for a combination regulated around a fixed setpoint (see (1)), outputs a function that results from fusing together the division indicator and the dynamic threshold u(t), yielding a relatively low correlation between its output and the ground truth combination: ρIRAS(c*,c)=0.614.

In contrast to other algorithms that try to fit a conserved quantity to a combination of known functions, IRAS is data-driven and model independent. Therefore, to understand the meaning of the resulting output of the neural network some intuition and knowledge about the problem needs to be employed. To interpret the result of IRAS, we note that our model simulations are performed in the regime where threshold fluctuations are significantly slower than the single-cycle time scale, namely . In this regime the threshold is approximately constant along a single-cell cycle, for . The birth size, about half the size at division of the previous cell is thus , and the size at division is . These approximations allow us to write the regulated quantity as the difference between final size and threshold, and express it with size variables:

(20)

a mixture of two terms. The first, , represents the size threshold while the second, , is an influence of the dynamics of the threshold process.

IDRAS, in contrast, is designed to decouple the two quantities - division indicator and dynamic threshold - by separately learning the threshold dynamics and a function which is regulated w.r.t. this dynamic reference value (see (2)). Fig 4c depicts , demonstrating the precise identification of the division control mode with a Pearson correlation value of ρIDRAS(c*,c)=0.97, where (in comparison, the correlations of the adder mechanism for which and timer mechanism, , are 0.74 and 0.1 respectively). The normalized-mean-square-error is , very close to the expected value of 0.25 derived from (18) and (19). To establish the latter value, note that the average cell has a growth rate of (the mean value of a Gamma distribution) and, from simple stability constraints one can conclude that it grows by a factor of 2. Therefore, its mean cell-cycle is approximately

Consider a cell-cycle that begins at time t and has an average duration of . The variance of given u(t) is known to be (see Chapter 4.4.4 in [39]) and since the variance of u(t) is we conclude that

if indeed IDRAS has converged to the solution

Substituting the model parameters yields up to 2 digit precision.

We conclude that in this example, judging by either measures of performance, IDRAS successfully detected the division indicator in the presence of a dynamic threshold. This task is already known to be unsolvable by heuristic methods, and an alternative identification method was not previously suggested.

Finally, we comment on relations to real experimental data. Modern experiments that integrate microfluidics techniques and advanced imaging analysis enable direct measurements of growth and division cycles within a lineage over extended periods, as simulated in this study. Observations and data collection are conducted across various laboratories [31, 32, 38, 4046]. These datasets provide a valuable test-bed for evaluating different hypotheses regarding regulatory mechanisms. In practical applications of the algorithm to real-world data, predefined hypotheses can be compared to the output of as demonstrated here. Alternatively, a comprehensive analysis of the input-output relationship of the trained network should be conducted as a complementary task following an IDRAS run. This will be further elaborated in the Discussion Section.

Discussion

Detecting regulated or conserved quantities in dynamic data is a technically challenging problem with many potential applications. Recently, the IRAS algorithm was introduced [15, 16]. The algorithm receives as input raw dynamic measurements and provides functions of the observables that are maximally conserved across time. Building on this advancement, we here presented IDRAS, an algorithm capable of identifying control objectives that are regulated with respect to a dynamic reference value. This removes a crucial obstacle in applying such identification algorithms to detect regulation in biological systems, which exhibit sustained oscillations and other modulations of their homeostatic set-points.

Using models with known ground truth control, we presented empirical results demonstrating that IDRAS can simultaneously identify the control objective - a combination of the observed variables that is regulated - and predict its dynamics. Being a purely data-driven method, our approach explains the system’s behavior in the ‘language’ of the observables, without prior assumptions. On the other hand, this approach is obviously constrained by the measurements; the underlying assumption is that with a large number of measurements, control objectives can be well approximated by their combination. Upon completion of an IDRAS run, the algorithm provides the normalized-mean-square-error, a performance metric indicating how closely the system adheres to the identified dynamical set point.

The control objective is represented by the function implemented by the trained network, . In contrast to other approaches, the network does not confine the result to a pre-defined class of functions but rather captures a huge range of possibilities. The drawback is that the network provides the output function without a closed-form equation or interpretation of the result. Therefore, analyzing the function of is required following an IDRAS run in order to provide such an interpretation. This analysis can be performed manually by fixing certain inputs and examining the effects of others on the output, or automatically using Symbolic Regression tools, which aim to derive an analytical expression of the network’s functionality [47] (see example B Eq 14 in [15]).

The data-driven approach incorporated in IRAS and IDRAS is not limited to a small number of variables, and this is a strength relative to other methods where combinatorics explode and make identification implausible. However, for very large system dimensionality, interpretation of the resulting combination can be difficult. Moreover, It can be argued that tracking of a setpoint in a high-dimensional system does not necessarily imply the existence of a control mechanism within the system - A set-point can result from the mutual interactions within the biological environment. Such a phenomenon aligns with the concept of distributed control, where autonomous controllers are dispersed throughout the system [48, 49], and a global control objective emerges. Another possibility is that, in a system with multiple feedback loops, a potentially meaningful combination will be detected which is not a control objective per se; it can nevertheless represent a meaningful compound quantity that may help shed light on the system’s functionality.

What can we learn about biological control in cases where dimensionality is too high, or direct interpretation is difficult for other reasons? one can simply plot the output of the learned function across time, and observe its characteristic timescale. This observation can potentially provide important information on the system’s behavior, under the assumption that it encodes a quantity of importance. An additional characteristic of the system that can be determined directly from the IDRAS output, is the effective memory time, derived from the hyper-parameter T. As time progresses, the impact of previous states on the current state of the system diminishes. Consequently, increasing T beyond the effective memory time does not improve the detection and inference results. Researchers are advised to conduct multiple iterations of IDRAS, incrementing T until no further improvement in the normalized-mean-square-error is observed. The final value of T thus represents the system’s effective memory time.

While our algorithm offers to be useful for analyzing a broad range of empirical data, we emphasize that such an analysis is not expected to be completely automatic and should always be accompanied by some understanding of the biological system. For example, the result can be compared to a-priori hypotheses generated by other more heuristic methods; competing hypotheses can be compared to one another to find the one most consistent with the data. In addition, caution should be taken as spurious correlations between measurements may result in artefactual regulated combinations.

Topics for further research include a rigorous analysis of the algorithm to prove convergence, equilibrium points and stability. We expect this algorithm to be widely applicable to experimental biological measurement at multiple levels of organization, and to contribute to revealing their regulatory logic.

Supporting information

S1 Appendix.

Please find the Appendix in the IDRAS_Appendix.pdf file

https://doi.org/10.1371/journal.pone.0325443.s001

(PDF)

Acknowledgments

Helpful discussions with Michael Margaliot are gratefully acknowledged.

References

  1. 1. Billman GE. Homeostasis: the underappreciated and far too often ignored central organizing principle of physiology. Front Physiol. 2020:200.
  2. 2. Hsiao V, Swaminathan A, Murray RM. Control theory for synthetic biology: recent advances in system characterization, control design, and controller implementation for synthetic biology. IEEE Control Syst Magaz. 2018;38(3):32–62.
  3. 3. Kotas ME, Medzhitov R. Homeostasis, inflammation, and disease susceptibility. Cell. 2015;160(5):816–27. pmid:25723161
  4. 4. Sontag ED. Mathematical control theory: deterministic finite dimensional systems. 2. New York: Springer. 1998.
  5. 5. El-Samad H. Biological feedback control-Respect the loops. Cell Syst. 2021;12(6):477–87. pmid:34139160
  6. 6. Goldstein H. Classical mechanics. 2. Addison-Wesley; 1980.
  7. 7. Murray JD. Mathematical biology: I. An introduction. Springer. 2007.
  8. 8. Kakizoe Y, Morita S, Nakaoka S, Takeuchi Y, Sato K, Miura T, et al. A conservation law for virus infection kinetics in vitro. J Theor Biol. 2015;376:39–47. pmid:25882746
  9. 9. Liu Z, Tegmark M. Machine Learning Conservation Laws from Trajectories. Phys Rev Lett. 2021;126(18):180604. pmid:34018805
  10. 10. Mototake Y-I. Interpretable conservation law estimation by deriving the symmetries of dynamics from trained deep neural networks. Phys Rev E. 2021;103(3–1):033303. pmid:33862698
  11. 11. Greydanus S, Dzamba M, Yosinski J. Hamiltonian neural networks. Adv Neural Inf Process Syst. 2019;32.
  12. 12. Dierkes E, Flaßkamp K. Learning hamiltonian systems considering system symmetries in neural networks. IFAC-PapersOnLine. 2021;54(19):210–6.
  13. 13. Ab Azar N, Shahmansoorian A, Davoudi M. From inverse optimal control to inverse reinforcement learning: a historical review. Annu Rev Control. 2020;50:119–38.
  14. 14. Arora S, Doshi P. A survey of inverse reinforcement learning: challenges, methods and progress. Artif Intell. 2021;297:103500.
  15. 15. Teichner R, Shomar A, Barak O, Brenner N, Marom S, Meir R, et al. Identifying regulation with adversarial surrogates. Proc Natl Acad Sci U S A. 2023;120(12):e2216805120. pmid:36920920
  16. 16. Teichner R, Meir R, Margaliot M. Analysis of the identifying regulation with adversarial surrogates algorithm. IEEE Control Syst Lett. 2024.
  17. 17. Thorsen K, Agafonov O, Selstø CH, Jolma IW, Ni XY, Drengstig T, et al. Robust concentration and frequency control in oscillatory homeostats. PLoS One. 2014;9(9):e107766. pmid:25238410
  18. 18. Krieger DT. The clocks that time us: (physiology of the circadian timing system). Psychosomat Med. 1982;44(6):559–60.
  19. 19. Dunlap JC, Loros JJ, DeCoursey PJ. Chronobiology: biological timekeeping. Sinauer Associates. 2004.
  20. 20. Lloyd D, Rossi EL. Ultradian rhythms in life processes: An inquiry into fundamental principles of chronobiology and psychobiology. Springer. 2012.
  21. 21. Di Rienzo M, Parati G, Radaelli A, Castiglioni P. Baroreflex contribution to blood pressure and heart rate oscillations: time scales, time-variant characteristics and nonlinearities. Philos Trans A Math Phys Eng Sci. 2009;367(1892):1301–18. pmid:19324710
  22. 22. Doelling KB, Assaneo MF. Neural oscillations are a start toward understanding brain activity rather than the end. PLoS Biol. 2021;19(5):e3001234. pmid:33945528
  23. 23. Critchley HO, Maybin JA, Armstrong GM, Williams AR. Physiology of the endometrium and regulation of menstruation. Physiol Rev. 2020.
  24. 24. Anderson BD, Moore JB. Optimal filtering. Courier Corporation. 2012.
  25. 25. Li X, Wong TKL, Chen RTQ, Duvenaud D. Scalable gradients for stochastic differential equations. In: International Conference on Artificial Intelligence and Statistics. 2020.
  26. 26. Kidger P, Foster J, Li X, Oberhauser H, Lyons T. Neural SDEs as infinite-dimensional GANs. In: International Conference on Machine Learning. 2021.
  27. 27. Sontag ED. Molecular systems biology and control. Eur J Control. 2005;11(4–5):396–435.
  28. 28. Hasty J, McMillen D, Isaacs F, Collins JJ. Computational studies of gene regulatory networks: in numero molecular biology. Nat Rev Genet. 2001;2(4):268–79. pmid:11283699
  29. 29. El-Samad H. Biological feedback control-Respect the loops. Cell Syst. 2021;12(6):477–87. pmid:34139160
  30. 30. Wang P, Robert L, Pelletier J, Dang WL, Taddei F, Wright A, et al. Robust growth of Escherichia coli. Curr Biol. 2010;20(12):1099–103. pmid:20537537
  31. 31. Brenner N, Braun E, Yoney A, Susman L, Rotella J, Salman H. Single-cell protein dynamics reproduce universal fluctuations in cell populations. Eur Phys J E. 2015;38:1–9.
  32. 32. Susman L, Kohram M, Vashistha H, Nechleba JT, Salman H, Brenner N. Individuality and slow dynamics in bacterial growth homeostasis. Proc Natl Acad Sci U S A. 2018;115(25):E5679–87. pmid:29871953
  33. 33. Amir A. Cell size regulation in bacteria. Phys Rev Lett. 2014;112(20):208102.
  34. 34. Jun S, Taheri-Araghi S. Cell-size maintenance: universal strategy revealed. Trends Microbiol. 2015;23(1):4–6. pmid:25497321
  35. 35. Brenner N, Newman CM, Osmanović D, Rabin Y, Salman H, Stein DL. Universal protein distributions in a model of cell growth and division. Phys Rev E Stat Nonlin Soft Matter Phys. 2015;92(4):042713. pmid:26565278
  36. 36. Biswas K, Brenner N. Universality of phenotypic distributions in bacteria. Phys Rev Res. 2024;6(2):L022043.
  37. 37. Luo L, Bai Y, Fu X. Stochastic threshold in cell size control. Phys Rev Res. 2023;5(1):013173.
  38. 38. Sassi AS, Garcia-Alcala M, Aldana M, Tu Y. Protein concentration fluctuations in the high expression regime: Taylor’s law and its mechanistic origin. Phys Rev X. 2022;12(1):011051. pmid:35756903
  39. 39. Gardiner C. Stochastic methods. Berlin: Springer 2009.
  40. 40. Nordholt N, van Heerden JH, Bruggeman FJ. Biphasic cell-size and growth-rate homeostasis by single Bacillus subtilis cells. Curr Biol. 2020;30(12):2238-2247.e5. pmid:32413303
  41. 41. Si F, Le Treut G, Sauls JT, Vadia S, Levin PA, Jun S. Mechanistic origin of cell-size control and homeostasis in bacteria. Curr Biol. 2019;29(11):1760-1770.e7. pmid:31104932
  42. 42. Stawsky A, Vashistha H, Salman H, Brenner N. Multiple timescales in bacterial growth homeostasis. iScience. 2021;25(2):103678. pmid:35118352
  43. 43. Taheri-Araghi S, Bradde S, Sauls JT, Hill NS, Levin PA, Paulsson J, et al. Cell-size control and homeostasis in bacteria. Curr Biol. 2015;25(3):385–91. pmid:25544609
  44. 44. Tanouchi Y, Pai A, Park H, Huang S, Stamatov R, Buchler NE, et al. A noisy linear map underlies oscillations in cell size and gene expression in bacteria. Nature. 2015;523(7560):357–60. pmid:26040722
  45. 45. Tiruvadi-Krishnan S, Männik J, Kar P, Lin J, Amir A, Männik J. Coupling between DNA replication, segregation, and the onset of constriction in Escherichia coli. Cell Rep. 2022;38(12):110539. pmid:35320717
  46. 46. Yang D, Jennings AD, Borrego E, Retterer ST, Männik J. Analysis of factors limiting bacterial growth in PDMS mother machine devices. Front Microbiol. 2018;9:871. pmid:29765371
  47. 47. Udrescu S-M, Tegmark M. AI Feynman: a physics-inspired method for symbolic regression. Sci Adv. 2020;6(16):eaay2631. pmid:32426452
  48. 48. Antonelli G. Interconnected dynamic systems: an overview on distributed control. IEEE Control Systems Magazine. 2013;33(1):76–88.
  49. 49. Ge X, Yang F, Han Q-L. Distributed networked control systems: a brief overview. Inf Sci. 2017;380:117–31.