Elucidation of Genetic Interactions in the Yeast GATA-Factor Network Using Bayesian Model Selection

Understanding the structure and function of complex gene regulatory networks using classical genetic assays is an error-prone procedure that frequently generates ambiguous outcomes. Even some of the best-characterized gene networks contain interactions whose validity is not conclusively proven. Founded on dynamic experimental data, mechanistic mathematical models are able to offer detailed insights that would otherwise require prohibitively large numbers of genetic experiments. Here we attempt mechanistic modeling of the transcriptional network formed by the four GATA-factor proteins, a well-studied system of central importance for nitrogen-source regulation of transcription in the yeast Saccharomyces cerevisiae. To resolve ambiguities in the network organization, we encoded a set of five interactions hypothesized in the literature into a set of 32 mathematical models, and employed Bayesian model selection to identify the most plausible set of interactions based on dynamic gene expression data. The top-ranking model was validated on newly generated GFP reporter dynamic data and was subsequently used to gain a better understanding of how yeast cells organize their transcriptional response to dynamic changes of nitrogen sources. Our work constitutes a necessary and important step towards obtaining a holistic view of the yeast nitrogen regulation mechanisms; on the computational side, it provides a demonstration of how powerful Monte Carlo techniques can be creatively combined and used to address the great challenges of large-scale dynamical system inference.


.1 Established interactions
The following table provides a non-exhaustive list of references for each of the interactions characterized as established in the main text. Note that GATA-factor names have varied in the past: GAT1 has also been known as NIL1, DAL80 as UGA43 and GZF3 as DEH1 and NIL2.

Hypothesized interactions
An extensive literature search resulted in several interactions in the GATA network that remain unvalidated to date. They are briefly enumerated below and denoted with dashed lines on Figure 1 of the main text.
2. There have been a few reports about the existence of an autoactivation loop on GAT1. While initially discovered in 1997 based on the observation that GAT 1 is capable of self-activation in the absence of Gln3 in a ∆gzf 3 background using a LacZ assay [32] (main text), the interaction has disappeared [16] (main text) and reappeared [1] (main text), [10] (main text) in descriptions of the GATA network published in the following years. Interestingly, [16] (main text) reached the conclusion that Gat1 is not self-regulated by noting that deletion of GAT 1 actually increased pGat1-LacZ activity compared to wild type in a proline-grown ∆gat1 strain. It was later shown that Gat1 can bind to its own promoter [31] (main text), however the regulatory capacity and importance of this interaction is still unclear. Both observations, in favor or against the presence of the interaction, can be explained by indirect regulatory effects. 3. Same as Dal80 self-repression, the negative regulation of DAL80 by Gzf3 has been shown using assays that cannot differentiate between direct and indirect effects [15], [16] (main text). 4. The evidence on GZF3 regulation is the most contradictory. Several papers consider GZF3 expression to be almost constitutive [32] (main text), while others demonstrate its dependence on Gat1 and Gln3 [16] (main text). There is also reference to strong negative regulation of GZF3 by Dal80 [15], [16] (main text), but no further experiments have verified or disproved this interaction yet. While repression of DAL80 by Gzf3 is thought to only be effective in a repressive medium (where the Dal80 protein has very low abundance), GZF3 down-regulation by Dal80 must take place when Dal80 is active, i.e. in derepressive conditions. 5. Every NCR-sensitive promoter contains more than one GATA elements, which implies that the two GATA activators can be found simultaneously on the same promoter. Study of single and double deletions of Gln3 and Gat1 indicate the existence of cooperative behavior for some promoters [31] (main text). This has also been demonstrated in several cases where Gln3 binding efficiency on target (non-GATA) genes is greatly reduced in the absence of Gat1 [17]. The opposite phenomenon has also been observed, namely that Gln3 and Gat1 can compete for the same binding sites, either by mutual exclusion or by reducing the binding affinity of each other when bound to the same promoter [39]. Obtaining the same information for the GATA factor promoters is not an easy task, especially when using steady-state deletion data, hence no concrete evidence exists.

Choice of target genes
Given that all GATA factors recognize the same target sequences, and the fact that most NCR genes (including the GATA factors except Gln3) contain multiple GATA motifs in their promoters, it is reasonable to assume that all GATA TFs compete for the same binding sites [10], [16] (main text). However, not all target genes are regulated to the same extent by all GATA factors, since the presence or absence of auxiliary binding sites, and the number, spacing and flanking sequences of the GATA motifs play significant role in the control of each target promoter [36], [1], [27] (main text). The precise way in which all these elements affect GATA factor binding is still under investigation, and some targets have been studied in much more detail than others. A common characteristic of the target genes considered in this work is that they are known to be mainly controlled by the GATA factors during NCR.
• DAL1 (allantoinase) and DAL5 (allantoin permease): Both genes are verified GATA targets, with the latter often serving as a readout gene for GATA-factor activity (e.g. [17]). In addition, their expression is not induced by any intermediates of the allantoine pathway, as happens in the case of DAL4 and DAL7 [2] (main text).
• GLN1 (glutamine synthetase) and GLT1 (glutamate synthetase): Several pathways impinge on the expression of these genes. However, they are mainly regulated in response to changes in intracellular glutamine and glutamate levels respectively, that result from different available nitrogen sources. They also display a much simpler, mostly Gln3-dependent regulation pattern than other core nitrogen-metabolic enzymes, such as GDH1 and GDH2 [40], [2] (main text).
• MEP2 (ammonium permease): From the three known ammonium permeases, Mep1, Mep2, and Mep3, the second displays the highest sensitivity to the nitrogen source (as well as the highest affinity for ammonium) and appears to be controlled by both GATA activators [40].
• PUT4 (proline permease): The high-affinity permease has been included as a representative of the proline degradation pathway, which has long been established to be NCR-sensitive. In contrast to PUT1 and PUT2, PUT4 is not under the control of Put3, a transcription factor whose activity is induced by proline and is necessary for the full activation of PUT1 and PUT2 [2] (main text).
2 ODE modeling of the GATA network 2.1 Implementation of the steady-state assumption All initial conditions are calculated based on the assumption that the system is at steady-state at the beginning of each shift. This implies that a system of nonlinear algebraic equations has to be solved for the given parameter set. Given the structure of the mRNA equations, obtaining analytical solutions is a formidable task. However, steady-state protein abundances are relatively easy to calculate, given a set of mRNA values. Hence, we follow a simple fixed-point iteration scheme to approximate the steady-state initial conditions of the model: starting from some initial mRNA abundance values, the corresponding protein abundances are calculated analytically, and the outcome of the calculation is fed back into the mRNA equations, which give a new set of mRNA abundances. The process is then repeated a few times until convergence.

Relative TF contributions to fractional occupancy
The relative contribution of each GATA factor to the fractional occupancy of a target promoter can be defined with the help of the small example presented on Figure 8 of the main paper: the relative contribution As a more general example, consider the GAT1 promoter, whose fractional occupancy function is given in subsection 2.7 below. This promoter is regulated both by repressors and activators. The relative contribution of Gln3 is then defined as while the relative contribution of Dal80, is given by Given the above definition, it becomes obvious that the TFs with the greatest relative contribution to the fractional occupancy of a given target promoter are those that mainly control the expression of the corresponding target gene.

Transcription factor (full) model equations
Model states comprise the abundances of Gat1, Dal80 and Gzf3 mRNA, the nuclear and cytoplasmic parts of Gln3 and Gat1 (further divided into active and inactive cytoplasmic parts), and the Dal80 and Gzf3 monomers and homodimers. Table A describes the notation for model variables (both states and inputs).
Using this notation, the model equations derived from the set of chemical reactions given above are the following: mRNA dynamics: Protein dynamics:

Target model
If we denote by [m] the target gene mRNA abundance, we can write the following differential equation for its evolution:

/ 41
where we used the notation of the previous section for the transcription factor abundances. This model has four unmeasured inputs, generated from the upstream GATA model (and thus not independently controllable) and one measured output, which coincides with its state.
Of the eight new parameters introduced, we had to fix k m to ensure that the model is structurally identifiable from relative mRNA data, while leaving the rest of the parameters to be estimated from the experimental data. We can easily see why this step is necessary, without using any of the more advanced tools presented in the next Section: considering the TF profiles as input signals, we see that k m multiplies a function of the inputs that varies between 0 and 1. As the available mRNA data come in the form of relative abundances (i.e. normalized with respect to the first measurement), we can easily see that an infinite number of (k m , k dm , b m ) combinations is able to reproduce the same output y = m/m(0), unless one of the parameters is fixed. A detailed list of nominal parameter values is provided in Section 2.6.
Each additional target contributes 16 datapoints (8 for each shift), which increase the constraining capability of the augmented dataset. On the other hand, this approach entails a significant model expansion, as one has to consider one extra differential equation with seven free parameters for each target. To maintain the consistency of the model selection process, we kept the model of each target gene exactly the same across the different TF models, as well as its prior parameter distributions. In this way, the first model selection step can be seen as a restricted form of the second, since using just the TF mRNA data makes the downstream effects of the GATA factors "invisible".

GFP reporter model
Since the original GATA model did not include the GFP reporter dynamics, a small model of the reporter system was appended to the TF network model. The extension comprised three species: GFP mRNA, immature GFP and mature GFP, as described below: The regulation function of the GFP gene (f ) was assumed to be the same as that of the corresponding GATA promoter driving it, hence dependent on the predicted active form abundances of the GATA factor proteins. Model parameters were either set to arbitrary plausible values, or obtained directly from the available data. More specifically, both production rates (k rg , k pg ) were assumed to be equal to one (given that only relative predictions are of interest), while the mRNA half-life (log(2)/k drg ) and GFP maturation half-life (log(2)/k mat ) were set to, respectively, 20 minutes (median yeast mRNA half-life [41]) and 30 minutes (a plausible value deduced from the previously measured 39 minutes of in vivo EYFP maturation half-life in yeast [18]). Finally, the concentration of GFP was assumed to decrease only due to cell growth, and thus the degradation rates of both mature and immature GFP were set equal to the experimentally measured (time-varying) growth rate of each yeast strain (k dg (t)).
It should be noted that all alternative TF models were originally fitted to microarray data, and therefore relative abundances between species were not considered during model selection due to the unreliability of microarrays to quantify inter-species changes. For this reason, we did not expect the TF models to be able to reproduce the ratios of initial GFP measured across strains and conditions. In order to account for relative levels of initial GFP concentrations in the model, an intermediate fine-tuning step was performed in which all candidate TF models were fitted using a likelihood function that, besides the mRNA data, took into account the initial GFP ratios measured with the DAL80 promoter in the different strains and conditions. This modification allowed us to capture the relative abundance of each GATA 6 / 41 species when predicting the evolution of each GATA-factor levels in the different deletion backgrounds, in the two dynamic shifts.

Interpreting the GFP reporter measurements
A quantitative comparison of experimental validation data and model predictions would not be appropriate for several reasons: besides the fact that the promoter dynamics (e.g. (basal) production rates) of the endogenous GATA promoters and the plasmid-borne promoter fragments are not necessarily identical, the simple linear GFP reporter model is itself simplistic. Moreover, deletions of GATA factors, although not detrimental for cell growth, are bound to alter the behavior of the system. This is not taken into account in our tests, where all parameter values of the GATA network are assumed to be unchanged across all deletion strains. Exponent range in prior:

GATA model parameters
Input signal parameters Nominal value Gln+Rap (Exponent range in prior) Nominal value Pro→Gln (Exponent range in prior) A more detailed description is provided below. Note that, to ensure the structural identifiability of the full model 1 , all protein and mRNA production rates, as well as the deactivation and import rates for the two activators were fixed. Note that the fixing of the production rates establishes arbitrary unit scales for the abundance of different species types in the model. Further details can be found in [29].
mRNA degradation rates: The nominal estimates were based on data reported in [30] and correspond to half-lives of ∼ 2 − 6 min (t 1/2 = ln(2)/k, were k is the degradation rate). Transcription factor mRNA degradation rates are known to be among the highest in yeast, however the dataset of [30] contains mRNAs with even faster degradation (0.2 min). Since the distribution of of yeast mRNA half-lives has a median of 11 min with the 3 rd quartile at 33 min [30], we allowed an order of magnitude variation above and below the nominal values to allow for the possibility of quantification errors.
Protein degradation rates: Nominal estimates were based on the dataset of [3], corresponding to half-lives of ∼ 20 min. Since protein degradation measurement assays such as the one of [3] are typically fraught with artifacts [43], we allow for two orders of magnitude variation in the prior (the dataset of [3] contains no proteins with half-lives smaller than 2 min, while it is known that transcription factors are typically actively degraded, so a 200 min half-life is a reasonable upper bound).
Translocation rates: Very little is known about precise values for these parameters for the proteins studied here and we are not aware of large-scale studies on protein localization dynamics in yeast. From various targeted studies of other yeast transcription factors such as Msn2 and Hog1 [21,34,35], it is known that the processes of import and export can be completed in just a few minutes upon a shift in growth conditions. We therefore selected a high nominal value for the export rate, with relatively large uncertainty (±2 order of magnitude).
Dimerization/dissociation rates: Again, no information is available for the GATA repressors. In general, it is known that protein dimerization is a fast process that evolves in the timescale of seconds to minutes [27], which suggests a high nominal dissociation rate. Moreover, the available data supports the fact that the Dal80-Dal80 and Gzf3-Gzf3 interactions are quite strong [40] (main text). Since these are also the functional forms of the GATA repressors [40] (main text), we assumed an even higher nominal dimerization rate.
Basal transcription rates: It is not known what fraction of the protein expression in the GATA network is due to constitutive promoter activity. Overall, results from mutant strains, suggest that basal expression is quite low.

Regulation function parameters/interaction constants:
The DNA binding properties of the GATA factors have not been studied quantitatively, while various studies have shown that the binding strength of the TFs depends on the target promoter. The regulation function parameters are intimately connected with the TF protein abundances: when a TF is much more abundant than the inverse of its corresponding parameter, the promoter to which it binds operates close to saturation. Given that the maximum nominally attainable TF abundance in our model is in the order of a few thousands (as determined by the transcription, translation and degradation rates for mRNAs and proteins), we chose the nominal regulation function parameters and their range of variation so that the GATA promoters can operate both under saturated and unsaturated conditions. The interaction constant values were chosen solely based on intuition, due to lack of any relevant data.
Input signal parameters: The functions describing the upstream activation/deactivation rates contain the same basic features: an initial (i.e. prior to perturbation) value, V 0 , a final value (determined by V ), a steepness factor (n) and the timepoint of 50% saturation (θ). The initial and final values for these rates were selected to yield a ∼ 100-fold change in activity after each perturbation, and also lie on both sides of the (fixed) inactivation rate. The timing parameter, θ, was selected based on biological evidence on the speed of upstream signaling processes: it is known that activator localization is already visible after only a few (5-10) minutes after rapamycin treatment or a change in nitrogen source [12], [34] (main text) so the upstream signaling processes most likely evolve at a comparable or faster rate. The lower exponential bound of θ reflects the fact that signaling processes cannot evolve faster than a few tens of seconds, while the upper bound corresponds to a very slow and gradual response.

Target model parameters
The same arguments presented for the choice of the TF model priors were are also used for picking the target model priors. It should only be noted that the mRNA degradation rate (corresponding to a half-life of about 15 min) was set based on the median half-life values reported in the literature for yeast [30,41], which range between 10 and 20 min.

Parametrization of models corresponding to different hypotheses
Models corresponding to different combinations of hypotheses can be obtained from the full model by fixing certain subsets of parameters. The parameters corresponding to each single hypothesis are listed on Table F.

SBML implementation
As stated in the main text, all models were implemented using SBTOOLBOX2 [52] (main text). In order to enable the simulation of the model in other biochemical network simulators, we also converted it into the SBML format. One issue that proved difficult to address was the time varying Gln3 and Gat1 activation rates (and the Gln3 mRNA data interpolation), which are not supported by the SBML format; as a consequence, the SBML representation of our model is not simulation-ready. To simulate the SBML version of our model, the user will first have to import the SBML file into simulation software that allows the definition of time-varying reaction rates. Then, the activation rates of Gln3 and Gat1 (parameters k 1w and k 1x ) will need to be assigned the functional form given in the Methods section of the main text, depending on the shift that is simulated. Moreover, the Gln3 data (given below and shown on Fig. A) will need to be interpolated over time, to be used on the right-hand side of the GLN3 protein production ODE. To the best of our knowledge today, this assumption is valid [11], [42] (main text). Of course, all GATA factors have many confirmed and potential regulators according to the Yeastract database (http://www.yeastract.org). However, none of these regulators is considered to change significantly during the nitrogen source, or rapamycin shifts. This hypothesis is made implicitly in all studies of the GATA network, but remains to be proven conclusively in the future.
2. TF active forms: Dal80 and Gzf3 are functional as homodimers [40] (main text). On the other hand, both activators are thought to act as monomers. Regarding subcellular localization, Gln3 and Gat1 can be both nuclear (active) and cytoplasmic (both active and inactive), while Dal80 and Gzf3 are only nuclear [1] (main text). It is known that phosphorylation controls the localization of the activators, but no post-translational regulation of repressor activity, other than dimerization, is known.
3. Activator localization mechanism: TOR inhibition upon rapamycin treatment results in a nitrogen starvation phenotype, as it induces fast TF activation and localization in the nucleus. Conversely, the nutrient upshift results in restoration of TOR activity and a quick de-activation and nuclear export of the two activators. These (in)activation signals are assumed to drive the changes in Gln3 and Gat1 localization in response to nitrogen or rapamycin shifts, and therefore act as the primary driving input signals to the system. Capturing them accurately would require a full, validated mathematical model of the TOR network, as well as precise knowledge of how the nitrogen source affects TOR activity. However, the existing state-of-the-art TOR models [37] (main text) still contain unverified hypotheses. Our model was chosen to be simple, but still provide a realistic (in)activation and translocation mechanism, in line with evidence accumulated for similar systems in yeast [22], [18] (main text). According to this mechanism the TFs first get activated (e.g. by phosphorylation, release from an anchor protein or both) and then shuttled across the nuclear membrane, with both processes being reversible and governed by mass-action kinetics. Once quantitative dynamic translocation data become available, this model can be further tested and refined.
4. Gln3 input: We consider Gln3 mRNA (Fig. A) as a secondary input to our model, and further assume that mRNA changes are reflected in the abundance of Gln3 protein, which consequently has to change over time. This assumption is based on the observation that Gln3 mRNA displays moderate fluctuations during both shifts considered. These small but significant Gln3 mRNA changes in all timecourse experiments imply that GLN3 is weakly controlled by proteins other than the GATA factors (Gcn4 could be implicated in this regulation [42] (main text)).
5. Steady-state conditions: The system is assumed to be at steady-state prior to each shift, an assumption that is approximately correct given that the yeast cells were growing exponentially before the perturbations.
6. No Gat1-Gzf3 complexes: A recent study presented evidence for negative regulation of Gat1 by Gzf3 through the formation of a Gat1-Gzf3 heterodimer [31] (main text). Since it is still very unclear how this interaction occurs mechanistically, what is its strength and its physiological significance, we decided not to include it in the model. Previous work has presented hints of interactions between the zinc-finger regions of Gln3 and Dal80, as well as Gln3 and Gzf3, both much weaker than the interactions in the Dal80 and Gzf3 homodimers respectively, and has cautiously avoided giving biological significance to the finding [40] (main text). It is thus very likely that the Gat1-Gzf3 complexes are also a result of a "residual" interaction that arises due to the large degree of homology shared by all GATA factors.
7. No Dal80-Gzf3 dimers/active Gzf3 monomers: We elected to ignore the Dal80-Gzf3 heterodimer from our model, for two reasons: no DNA binding site has yet been identified for Dal80-Gzf3, while the strength of this interaction is most likely much weaker than the Dal80-Dal80 and Gzf3-Gzf3 interactions [40] (main text). We further assumed that Gzf3 monomers play no role in gene regulation: the experimental data in favor of the opposite is very weak [32] (main text), while it was recently shown that deletion of the Gzf3 dimerization domain abolishes its repressing ability [31] (main text), a very strong indication that repression is exerted by the homodimer.
8. Non-competitive TF binding: Transcription factor binding is assumed to be non-competitive, i.e. several TFs are assumed to be able to bind to the same promoter simultaneously. This assumption is reflected in the form of the fractional occupancy functions for each GATA factors promoter. The plethora of GATA binding sites both on GATA-factor and target promoters indicates that this simplification is realistic. Moreover, testing all possible binding configurations for each promoter would be infeasible computationally.
9. Model changes between shifts: The only parts of the model that are assumed to vary from one shift to the other are the parametrization of the input signal and the Gln3 mRNA input. All other parameters are assumed not to vary between shifts, as no concrete evidence is currently available regarding which parameters should vary and to what extent.
Among our assumptions, #5 is reasonable and necessary for preventing nonsensical model predictions, and #9 is made to prevent unnecessary overfitting given the available amount of data and the lack of biological evidence for the alternative. Assumptions #6, 7 and 2 are based on currently available evidence that allows us to simplify our model, while the lack of specific mechanistic details (such as the unknown binding site of Dal80-Gzf3 or the fact that Gzf3 monomers can still weakly bind to DNA but cannot repress expression) make their mathematical encoding very difficult. Similarly, assumption #8 is the most plausible choice based on the lack of concrete biological evidence and computational limitations. Finally, assumptions # 1, 3, and 4 are required for defining the model "boundaries" and its interactions with the rest of the cell.
also called the evidence for M k . The second is the Bayes factor [25] B ij of M i to M j , given by which can be interpreted as the "weight of evidence" provided by the data in favor of model i as opposed to model j. For our purposes, we will use the fact that a Bayes factor greater than 10 is commonly interpreted in practice as strong evidence in favor of model i. A scale that matched Bayes factor values to evidence support categories can be found in [25]. The Bayes factor is used to update our initial beliefs in models i and j and obtain the posteriors ratio 13 / 41 The model posterior probability P (M k |D) is the fundamental quantity in Bayesian model selection.
Having the Bayes factors, we can compute it as P (M k |D) can be interpreted as a measure of the "plausibility" of model k after all information provided by D has been considered. Thus, model selection is accomplished by choosing the most probable model (or models, in case the data is not discriminative enough) [8]. This approach to model selection naturally strikes a balance between data misfit and model complexity. The key to this balance comes from (2): generally speaking, a simple model M s has a limited predictive ability, which means that its evidence P (D|M s ) is concentrated in a small region of the space of all possible datasets D. On the contrary, a more complex model M c is able to generate a larger variety of datasets, which however implies that its evidence is spread over a larger region of the data space. If the data we observe fall within the high-density region of P (D|M s ), the simpler model will end up as the most probable model assuming that both models have equal priors (Ch.28 of [48] (main text)).

Discriminating among multiple hypotheses using a single multimodal prior
Assume that that there are K hypotheses to be tested, and that the absence or presence of hypothesized interaction k is equivalent to a certain model parameter, say θ k , being zero or nonzero. In this case, one can consider the "full" model (i.e. the model with all interactions present) and assign to each of those K parameters a "spike and slab" prior for, as in [31]: This prior is a mixture of a point mass at zero and an absolutely continuous distribution f k , whose support does not include zero. The weight w k indicates our prior belief that θ k is zero (eqv. the corresponding hypothesis is not present). If the w k 's are fixed for all θ k 's (e.g. to 0.5) and the priors for different parameters are independent, then the total posterior P ({θ k } (k=1,,K) |D) will be a mixture distribution with 2 K components, {p m } (m=1,,2 K ) , each corresponding to one of the possible products of point mass and f k priors. With respect to the model selection problem, the precise form of these posterior components does not matter. What matters instead, is their relative contributions to the overall posterior, because these indicate which combination of parameters is more likely to be non-zero. In turn, for each posterior mixture component, p m , this contribution is given by the integral of the likelihood with respect to π m , and these are precisely the numbers we report in this work.
A further generalization of the above problem would be a fully Bayesian approach, where the prior weights w k are not assumed constant, but are assigned a hyperprior distribution (as described, for example, in [16]), according to which each weight can only be 0 or 1 with certain probability. In this case, the goal is to infer the posterior weight distribution given the data; weights with a high posterior probability of being 1 correspond to parameters that should be set to zero. Here as well, it is easy to see that the computation of the weight posterior necessarily involves the computation of the marginal likelihoods corresponding to each combination of "spikes" and "slabs". Under the uninformative, uniform prior that assigns probability 2 −K to each weight combination, the maximum of the weight posterior coincides with the top-ranking model, as identified by our approach (different, more informative weight hyperpriors would of course result in different model ranking).

Algorithm
As stated in the main text, simple approaches for approximating the evidence, such as plain Monte Carlo integration, Importance Sampling-based estimators or even Markov Chain Monte Carlo (MCMC) samplers quickly become inefficient as the number of parameters grows: given the fact that the fraction of measured to the total number of states in large models is usually very small, likelihood distributions may become highly correlated and/or multimodal, due to model nonlinearities and the inevitable practical identifiability problems that follow. Moreover, the ratio of the feasible parameter volume (i.e. the volume of the parametric sets leading to good model fits) to the total available volume determined by the (usually very diffuse) priors typically decreases substantially with increasing model complexity.
The recently introduced Sequential Monte Carlo (SMC) (or Population Monte Carlo) [7,23,24], [23] (main text) methods are able to overcome the difficulties of sampling from high-dimensional, correlated and multimodal distributions. These methods are intimately connected with sequential importance (re)sampling and less related to Markov Chain Monte Carlo (although MCMC kernels are often among their ingredients), and are able to handle complex distributions by building a sequence of increasingly better proposals. Each distribution is represented by a large collection of random samples ("particles") that get propagated through the intermediate steps using ideas from importance sampling. Contrary to MCMC methods, the population at each step can depend in an arbitrary way on past samples, while the approximation to the target at each iteration remains unbiased and does not require throwing away burn-in samples [37]. On the other hand, bias can be accidentally introduced in these methods if the population diversity (quantified for example by its effective sample size) gets reduced significantly, and it is necessary to use large population sizes (at an increased computational cost) to properly sample high-dimensional distributions. Fortunately, recent results indicate that SMC algorithms are stable with an O(N d 2 ) computational cost, where N is the number of particles and d the dimension of the target distribution, when O(d) intermediate distributions are considered [4]. For comparison, the computational cost to maintain stability of importance sampling increases exponentially with d.
The SMC sampler implemented in this work is able to circumvent the problem of sampling from complex, high-dimensional posteriors by defining a sequence of bridging distributions f β according to a "cooling schedule": f βi (θ) ∝ P (D|M, θ) βi P (θ|M), for 0 = β 0 < β 1 < · · · < β N = 1. The basic idea of the algorithm is to propagate a population of particles sampled from the prior through this sequence of intermediate distributions that gradually morph into the target posterior. At each step, distribution f βn (θ) serves as an importance distribution for f βn+1 (θ). Starting from the sampled points of f βn (θ), the new samples from f βn+1 (θ) are drawn using a Markov chain transition kernel K n (·|θ j (n−1) ) that leaves f βn (θ) invariant. The great flexibility of the method stems from the fact that this is the only requirement imposed on the kernels, which can be selected to be arbitrarily complex. One of the simplest choices is to let K n (·|θ j (n−1) ) consist of a few (10-20) Metropolis-Hastings updates of θ j (n−1) with f βn (θ) as the target density. It is not necessary that the K n 's mix fast, although this is desirable for reducing the weight variance [33]. In any case, the importance weights may still end up with large variance after some number of annealing steps, resulting in a small effective sample size (ESS). This problem can be avoided by monitoring the ESS at each step n and resampling the particles from the current approximation of f βn if ESS < 0.5M [23] (main text).
The pseudocode for the SMC sampler is given in Algorithm 1. This is a modified version of Algorithm 3.1.1 given in [23] (main text), to take into account the specific form of Markov kernels used here, which are a special case of the generic transition kernels assumed in [23] (main text). Below we proceed to describe in more details our choices for the individual algorithm components.

Prior selection and parametrization
Only a few parameters in our model have been measured and reported in the literature, often for different growth conditions and yeast strains. One has thus to cope with the problem of large parameter uncertainty, which necessitates the use of wide prior distributions spanning several orders of magnitude. To model this uncertainty we use the log-uniform prior, which assigns equal likelihood to all orders of magnitude in a given interval and is scale-invariant. The same prior distributions were used for all common parameters across the different model structures.
To arrive at the log-uniform prior for each model parameter θ ′ i , we first determined a nominal value θ ′ i,nom . We subsequently defined the log-ratios θ i = log 10 . The uniform priors for the θ i 's, centered at θ i = 0 and supported on [−a i , a i ], correspond then to log-uniform distributions in the original (linear) parameter space. In paragraph 2.6.1.1 we provide further information on the specifications of these priors, namely the nominal values θ i,nom and exponential ranges [−a i , a i ]. From this point on, to simplify the presentation the θ i 's will be called the system parameters.

Noise model
With only three biological replicates for three timepoints in each experimental timecourse (as described in the main text), the reliable estimation of a noise model for the different biological replicates is not feasible. On the other hand, there is also a noise component stemming from the various sources of errors in the microarray processing, which is quite hard to model accurately. However, a general consensus seems to be that this noise is of a broadly multiplicative nature, i.e. its variance grows as mRNA abundance increases. In an attempt to combine both noise sources in a simple and tractable noise model, we thus made the assumption of independent, identically distributed additive Gaussian noise with time-varying variance, which makes each measurement normally distributed with a CV derived from the average CV of the fold changes in all mRNAs measured with the microarrays in each shift (3 timepoints x 3 replicates per gene). More concretely, we assume each measurement where T = {3, 7, 10, 14, 24, 56, 120} is the vector of measurement times, to be of the form where x is the actual mRNA fold change and w ∼ (0, y(t) 2 σ 2 ). This implies that y(t) ∼ N (x(t, θ), y(t) 2 σ 2 ), and CV y ≃ σ (if we assume x lies close to y). We used σ 2 1 = 0.025 for the Pro→Gln dataset and σ 2 2 = 0.01 for the Gln+Rap data.
Based on this noise model, the appropriate likelihood function P (D|θ) was defined. Its negative logarithm has the form of weighted squares sum: where c is the logarithm of the normalizing constant.

Algorithm 1
Require: A sequence of distributions {π n } N n=1 defined according to (7) Ensure: A weighted sample approximation {W

Cooling schedule
As described in Section 3.2, we start by defining a sequence of intermediate distributions that connect the diffuse (and known) parameter prior, P (θ|M), to the concentrated posterior P (θ|D, M), for which only know that is proportional to P (D|M, θ) · P (θ|M). We thus define the intermediate distributions as where 0 ≤ β 1 ≤ β 2 ≤ · · · ≤ β N = 1 is a predetermined cooling schedule, i.e. an increasing set of parameters that determines how fast we move from the initial to the final distribution, and can be thought of as "inverse temperature". The term cooling is used to reflect the fact that the sequence of distributions evolves from the most "disordered" one (the prior) to the most "ordered" (the posterior). While theoretically an optimal 2 temperature schedule can be derived for each problem instance, its practical evaluation is not possible except for very simple cases. The results obtained in [5] suggest however that geometric temperature schedule should be close to the optimal. In the same study it was also noted that as the uncertainty encoded by the priors increases, the more the intermediate temperatures have to be clustered towards 0, since the low temperature region is where the rate of change of the intermediate densities is greatest. This prompted us to use a temperature schedule of the form where n ranges from 0 to n and a large number of intermediate distributions is considered (N = 70). We arrived at this specific form by repeatedly testing different exponents and step numbers, with the objective of reducing the frequency of resamplings, particularly at high temperatures (low β values), where the parameter space has to be sufficiently explored.

Transition kernels
The SMC algorithm requires the use of a sequence of transition kernels {K n } N n=1 , that are used to propose new particles at each cooling step n by perturbing the particles {X (i) n−1 } obtained at step n − 1. We set each K n as an MCMC kernel with invariant distribution π n ), defined by 15 iterations of an independent Metropolis-Hastings (IMH) sampler with proposal distribution q n obtained by fitting a Gaussian mixture distribution to the particle population {X (i) n−1 } from step n − 1. The use of the more straightforward Gaussian random walk proposals in place of the IMH, while giving satisfactory results in smaller parameter dimensions, was not adequate for sampling in more than 15-20 dimensions. Its main drawback derives from the great differences between the extremes of the distribution sequence: for small β;s, large moves need to be proposed, in order to fully explore the parameter space and locate regions of potential interest. As the β increases, however, and the particles concentrate in high-density regions of the posterior, the proposed moves must be much smaller to ensure non-zero acceptance probabilities, otherwise the particles will be left essentially unperturbed and the SMC sampler will degenerate to the extremely inefficient Importance Sampler. One could try to alleviate this problem by tuning the covariance matrix of the proposal density in a way that depends on the cooling schedule, or use a mixture of kernels with a large range of bandwidths, yet this is neither an easily implementable or computationally efficient approach. Finally, use of Adaptive MCMC schemes [1,2,20] is not advisable in this case, since they cannot be used for such a small number of iterations (they converge to the target distribution after an adaptation phase).
On the other hand, Gaussian mixtures (GM's) are a versatile class of semi-parametric models that are very commonly used for density estimation [28]. Although the optimal number of components for a given distribution can be obtained using model selection methods, we chose to keep it constant in this work, after extensive offline experimentation with distributions generated from the AIS-SMC sampler, which showed that the optimal number of components very rarely exceeded six or seven. We used the Expectation-Maximization (EM) algorithm [32] to obtain the maximum-likelihood estimates for the mixture parameters, and considered full covariance matrices for each component.
The use of GM's as importance densities has recently been introduced in the framework Adaptive and Multiple Adaptive Importance Sampling with very promising results [6,11,42]. It has also been shown [11] that the maximum-likelihood parameter estimates for these mixtures are also the ones that minimize the Kullback-Leibler divergence from the target distribution (in our case π n ) to its mixture approximation, provided one considers the particle weights in the EM algorithm. In our GM models the particle population is considered unweighted to simplify a bit the computations, but our fits do not deviate much from the optimal case, since the variance of the particle weights is not allowed to increase much 3 . This is accomplished by monitoring the Effective Sample Size (ESS) of the population, which is not allowed to decrease below M/2, where M is the number of particles.

Resampling strategy
As mentioned above, the quality of our particle population at each cooling step is estimated using the Effective Sample Size (ESS) criterion, that takes values between 1 and M (where M is the population size), and represents the number of particles that effectively contribute to the current Monte Carlo approximation of the density at that step. The ESS is calculated from the particle weights at each step [23] (main text) and when it drops below M/2, the particles are resampled according to the residual resampling scheme [26], that offers reduced variance in comparison to the standard multinomial sampling method. The resampled particles are then assigned equal weights before getting propagated to the next target distribution.

Population size
The evidence estimates provided by the SMC algorithm have been shown to converge almost surely to the true value as the number of particles tends to infinity [23] (main text). In the same work, an analytical expression of the variance of the SMC sampler is provided, which however is impossible to evaluate practically. We therefore had to resort to numerical experiments to determine the necessary number of particles that achieves an acceptable 4 variance of the estimates. Therefore, the same practice was used in all of our tests: starting from a small population size, the SMC sampler was run for increasing numbers of particles, until all estimates obtained for any evidence were constrained sufficiently tightly. For each run of the first model selection round, a population of 15000 particles was used; runs of the second round used 10 5 particles, due to the increased dimensionality of the problem.

Sampling parameters for modular systems
The evidence decomposition described in Materials and Methods can be generalized when multiple subsystems are jointly affected by the first one but do not interact with each other. Consider for example the following form of f : 3 We have also verified experimentally that this minor change does not have any impact at all on the obtained parameter posteriors and evidence estimates 4 In the sense of allowing us to draw safe conclusions about model ranking

/ 41
where x i and θ i have to be interpreted as in the simpler case above. If we assume that there are available measurements of both x 2 and x 3 states (denoted by D 2 and D 3 respectively), we can write This setup is summarized in Figure B.
In this case we exploit the fact that the marginal posteriors of θ 2 and θ 3 are conditionally independent given θ 1 . More analytically, Arriving at the same form of integral as previously, we observe that the integrand factorizes into a product of two terms, whose common element is θ 1 . This enables us to use a blocking approach for its numerical evaluation, in which we first sample the θ 1 vector from its posterior and, given this, draw independent θ 2 and θ 3 samples to calculate the product term.

Parameter sampling for decomposed systems using SMC
To demonstrate how the reformulation of the evidence translates into the SMC implementation, we will take up the example of an upstream subsystem (denoted by M 1 ) and two downstream modules (denoted by M 2 and M 3 ) with the same parameter and dataset definitions as above (note that model notation bears no relation to GATA model naming). The module interconnections are shown on Figure B. Figure B. A system consisting of an upstream and two downstream subsystems. In the case of the GATA network, the upper module comprises the four GATA factors and their interactions, while the lower ones correspond to GATA-factor target genes. This structure of interactions is exploited in the next Section for facilitating parameter sampling in the high-dimensional parameter space of the composite system. Note that our approach can be easily generalized to the case of multiple downstream modules (i.e. target genes).

/ 41
The parameter "prior" in this case consists of the product of the upstream module posterior and the downstream module priors, and can be approximated arbitrarily well if we replace the posterior distribution by a good GM approximation. From this approximate prior we can then draw the initial samples required by the SMC algorithm. These samples are subsequently perturbed and propagated through the sequence of intermediate distributions, following the basic scheme of Algorithm 1, with a crucial change: the conditional independence of parameter blocks allows us to perform randomized block updates of the total parameter vector, θ = [θ 1 θ 2 θ 3 ], during the Metropolis Hastings iterations.
This block updating can be implemented efficiently by constructing three GM approximations (additional to the initial θ 1 posterior approximation) at each annealing step n: the first approximates π n−1 (θ 1 ) = P (D 2 |θ 1 ) βn−1 P (θ 1 |D 1 ) (the annealed target distribution of the previous step, marginalized over θ 2 and θ 3 ), while the second and third represent π n−1 (θ 2 ) = P (D 2 |θ 2 ) βn−1 π(θ 2 ) and π n−1 (θ 3 ) = P (D 3 |θ 3 ) βn−1 π(θ 3 ) (the corresponding annealed marginal target distributions at step n − 1). All GM distributions were fitted to the particle approximation {X (i) n−1 } of the annealed target distribution at step n − 1. Notice that it would have been more precise to employ GM approximations to π n−1 (θ 1 , θ 2 ) and π n−1 (θ 1 , θ 3 ) (the conditional distributions of downstream parameters given θ 1 ), however the cost of calculating the conditional variances at each proposal step made the computational cost of the method forbiddingly large. We thus resorted to a cruder -but at least two orders of magnitude fasterapproximation of the conditional distributions by the marginals. This approximation proved quite good in practice, since the correlations between the upstream and downstream parameters were not observed to be too strong. Even if this was not the case, we believe this approach would be much more efficient than true conditional approximation by a Gaussian Mixture.

Modular sampling for the GATA system
In the case of the GATA network, the module decomposition we consider is displayed on Fig. C below, with the "upstream" system being the transcription factor network, and the "downstream" systems being the six GATA targets.
To write the evidence decomposition for this particular example, we first define the following notation (cf. with Subsection 3.1): • D T F : TF mRNA dataset for both shifts considered together : target mRNA datasets for both shifts • θ T F k : parameter vector for the k-th TF model structure • {θ target,i } 6 i=1 : parameter vectors for the target models • M k : the k-th TF model structure, augmented with the six target models With these definitions, we can write the complete data evidence of model M k : This structure of interactions is exploited in the next Section for facilitating parameter sampling in the high-dimensional parameter space of the composite system.
following the approach (and notation) of the corresponding Materials and Methods section. We thus see that the evidence of M k is equal to the evidence of M k (already available from the first model selection round), multiplied by a factor F k that can be estimated with a subsequent SMC run, as shown in the previous Subsection. We should remark that the above evidence decomposition for model M k does not involve any approximation, only a sequential computation that requires two SMC runs, with the result of the first run feeding into the second.
5 Parameter inference and model selection with SMC 5.1 Example: parameter inference for the full model Using our SMC sampler we obtained parameter posteriors for each of the 32 models considered. Below we present in more detail the results for the full model (M 0 ), but we should remark that a detailed comparison of posteriors of different models indicates that several parameter marginals remain unchanged, while others display significant changes. This phenomenon can provide useful insight into how different model structures "adapt" to the available data, and how the different parts of the model contribute to its overall behavior, however it will not be further examined here.
Proper visualization of high-dimensional distributions is quite hard, so we have to base our presentation on several mutually complementary views of the results. Figure D displays the marginal posteriors for the free parameters of M 0 . (Note: whenever we refer to the "system parameters", the logarithmic fold-change parametrization will be implied -cf. with Section 3.3). We observe that some of the distributions are very wide, while other model parameters have very tight posteriors, indicating high likelihood sensitivity to these coordinates.
To get a better idea about parameter correlations in the posterior (which could be responsible for several of the wide marginals as well), we calculated the correlation matrix for the posterior population, which is displayed on Figure E. We can observe several strong correlations among parameters -a direct consequence of the limited amount of available experimental data, which are insufficient to properly 22 / 41 constrain all parameters independently of each other. Several wide posteriors can be thus explained as the result of strong pairwise correlations, such as the distributions of k xx , k xwc , k xz etc. We can also observe the behavior of a few truly insensitive parameters, such as k Y Y and k ZZ , which lack any significant correlation with others and yet display very wide posteriors (this result confirms our findings from an independent global sensitivity analysis of these parameters [38]). Shifting our view from individual parameters to parameter sets, we can get a good idea about "stiff" and "sloppy" parameter directions (following the terminology introduced in [19]) by examining the eigenvalues and eigenvectors of the population covariance matrix (Principal Component Analysis). The number of large eigenvalues of the covariance matrix, shown on Figure F indicates that several sloppy parameter combinations probably exist. On the other hand, the combinations of parameters to which the model appears to be the most sensitive (i.e. the "stiff" directions) are relatively few, as the left side of the eigenvalue distribution indicates. Figure G displays the eigenvectors that correspond to the 10 smallest eigenvalues of the covariance matrix, where we can observe the parameters with the greatest contributions to each eigenvector. We can see that the largest components of each eigenvector typically correspond to a few parameters, which exert the greatest control on the quality of the model fit to the data.
Overall, the above analysis confirms that the GATA model demonstrates the common features of large biological models, namely the existence of many parameters that cannot be precisely estimated and display an intricate correlation structure [15]. It also confirms the inadequacy of using point parameter estimates to characterize the behavior of such models and make meaningful predictions for future experiments.
As a final check of model consistency, Table G reports the absolute state values (i.e. not fold changes) for Gat1, Dal80 and Gzf3 mRNA, at the beginning and end of each shift, as predicted from the maximum-  Table G. Initial and final values of GATA-factor mRNA abundances the biological system, despite the fact that it was fitted with fold-change instead of absolute data. More specifically, Dal80 mRNA levels are very low in glutamine, in comparison to the other two transcription factors, which agrees with the fact that Dal80 mRNA is at the limit of detectability in nitrogen-rich conditions, while the Dal80 protein still remains unquantified for the same reason. The situation changes completely in proline, where Dal80 levels exceed Gzf3 levels. The similarity between proline-grown and rapamycin-treated phenotypes is manifested also at the GATA factor mRNA levels, as the third and fourth columns show, while the last column indicates that mRNA levels two hours after the shift to glutamine agree very well with the steady-state levels in glutamine, displayed on the second column.

Verification of SMC convergence and variability of estimates
In general, verification of SMC convergence is an extremely difficult problem when the correct "answers"   We verified by several runs of the SMC that the evidence of the candidate models is not particularly sensitive to small shifts or scaling of the support intervals of the priors (an example is displayed below). Of course, model ranking may change to a small extent for the closely ranked models of the first model selection round, but this effect is counteracted by our conservative approach at model elimination. Despite these precautions, however, it is still not possible to predict how the evidence of each model would be affected if 20 or 30 parameter priors were varied simultaneously and non-infinitesimally. Unfortunately, to the best of our knowledge, this is an open question.

Random perturbations to the prior:
In this test, we computed the evidence of model M 123 (top-ranking in the first model selection round) under small random perturbations of its prior support, by shifting individually the support of every prior in the model by a uniformly distributed random variable in [−0.2, 0.2] (note that the perturbations are not too small, considering that they are done in the log scale). The results are shown on Table H  . N ef f decreases at every temperature step, but is restored by the resampling procedure, which is described in Section 3.4. Lower panel: logarithm of the temperature schedule used for all models and described in Section 3.4. Figure N shows the model posterior probabilities obtained from three individual runs of the SMC algorithm using the GATA TF mRNA data. For each model, the results of each run are denoted with different color and demonstrate the very low variance of the estimates, especially for the top 20 models. Evidence estimates for the low-ranking models are more noisy. This is expected: for such models, only a very small volume of the parameter space can generate predictions that are even roughly compatible with the experimental data. Consequently, locating these regions becomes increasingly hard for the SMC sampler, resulting in higher variance in the evidence estimates. Model ranking is not affected, however, since the low-ranking models are anyway several orders of magnitude less likely than the top-ranking ones.  Table I summarizes the estimates of the factor F k ({D target,i } 6 i=1 ), for each of the top-ranking models considered in the second model selection round. The estimates from at least two independent SMC runs are shown.

Broadening selected prior ranges
The histograms of marginal posterior above show that some parameter posteriors end up close to the boundary of the prior distribution. To ensure that this behavior does not affect adversely the results of parameter inference and model selection, we broadened the prior ranges of selected parameters, as shown on Table J Table J. Although individual evidence estimates (not shown) increase by 1-2 orders of magnitude, the posterior model probabilities which depend on the relative evidence values (cf. with (5)) remain relatively unaffected. Despite a few exchanges in ranking positions, the overall ranking of Fig. N is preserved, indicating that no model "benefits" excessively by the increased prior ranges.  Table J. Parameter labels whose prior range was changed are in boxes. The rest of the plot details are the same with Fig. H.