Computational Modelling of Genome-Side Transcription Assembly Networks Using a Fluidics Analogy

Understanding how a myriad of transcription regulators work to modulate mRNA output at thousands of genes remains a fundamental challenge in molecular biology. Here we develop a computational tool to aid in assessing the plausibility of gene regulatory models derived from genome-wide expression profiling of cells mutant for transcription regulators. mRNA output is modelled as fluid flow in a pipe lattice, with assembly of the transcription machinery represented by the effect of valves. Transcriptional regulators are represented as external pressure heads that determine flow rate. Modelling mutations in regulatory proteins is achieved by adjusting valves' on/off settings. The topology of the lattice is designed by the experimentalist to resemble the expected interconnection between the modelled agents and their influence on mRNA expression. Users can compare multiple lattice configurations so as to find the one that minimizes the error with experimental data. This computational model provides a means to test the plausibility of transcription regulation models derived from large genomic data sets.


Introduction
The contribution of transcriptional regulatory proteins to the expression of every gene in a genome depends upon the DNA regulatory sequences present at each gene and the physiological environment of the cell. One of the challenges in genomics, systems biology, and the study of how genes are regulated is the integration of the myriad of regulatory interactions into a meaningful network [1]. Current intuitive approaches can handle a small number of parameters, but become unwieldy as the complex interrelationships of gene regulation are expanded. Moreover, with the advent of microarray technologies that allow the RNA output of thousands of genes to be monitored in a single experiment, it becomes increasingly more difficult to interpret and integrate thousands of output values and their changes, when the system is perturbed in multiple distinct ways.
Biological networks have been thought of in at least three categories. Genetic networks describe an unfolding cascade of gene expression events in which one or more genes influence the expression of other genes that go on to influence the expression of more genes [2,3]. Protein networks describe the set of all measurable protein-protein interactions within a cell. Biochemical networks describe the flow of metabolites from one enzyme to another [4,5,6]. Our goal here is to integrate parts of these networks with respect to the biochemical assembly of the transcription machinery at eukaryotic promoters, resulting in the conversion of nucleotide substrates into RNA.
Genes represent the source code for the components of a cell. When individual genes are ''read'', nucleotide substrates are converted to an RNA polymer product. Using a fluidics analogy, one can think of a pipe lattice in which a single fluid, mRNA, flows at rates determined by properties of the pipes in the lattice and subject to the influence of individual valves placed, by the modeller, on selected pipes in the lattice. The single fluid enters/ exits the lattice from external nodes driven by a pressure head whose value is part of the model specification. A valve on a pipe controls the flow through that pipe, and in the fluidics analogy the net effect of all valves in the lattice controls RNA output. Biologically, the valves correspond to proteins that control the assembly/disassembly of the transcription machinery at the beginning of genes (promoters). Once assembled, the RNA polymerase II component of the transcription machinery transcribes the gene, in effect converting nucleotides to RNA. Assembly of the transcription machinery involves a wide variety of proteins, including both positive and negative regulators. Thus, a piping analogy would include many valves. Our goal is to define a pipe network analogy and its associated valves that properly model RNA output at every modelled gene.
Flux simulators, which model the movement of substrates through a reaction pathway using deterministic and/or stochastic approaches [5,6,7], require explicit declarations of reaction steps, rate constants, and reactant concentrations. As a result they may be less suitable for modelling the assembly of the transcription machinery, where such parameters are ill-defined. Rather, a cruder approach may be needed to model less-defined systems with the purpose of evaluating the plausibility of potential regulatory mechanisms.
Based upon a wide variety of biochemical, genetic, and genomic experiments and conventional wisdom, we and others (see Ref. [8], and references therein) devised a minimal framework for the assembly of the transcription machinery through the TATA binding protein (TBP) at all measurable promoters, using the budding yeast Saccharomyces cerevisiae as a model system. Because the transcription machinery is fundamentally conserved in the eukaryotic lineage, this framework is potentially applicable to higher eukaryotes including humans. In this framework, two protein complexes, TFIID (transcription factor IID) and SAGA (Spt Ada Gcn5 Acetyltransferase), compete to assemble the transcription machinery via recruitment of TBP to promoters ( Figure 1). Consequently, they are functionally redundant, but only partially redundant since each pathway potentially produces quantitatively distinct outputs (i.e. different mRNA levels). Therefore within this framework, TFIID and SAGA provide two potential levels of control, one in which TFIID and SAGA compete for promoter binding and a second where promoterbound TFIID or SAGA drives a quantitatively distinct mRNA output. The combined action of the physiological milieu (protein network) and promoter DNA regulatory elements further tweaks these pathways to achieve gene-specific transcriptional control.
Previously, we utilized a set of mutations that eliminated or reduced the individual contributions of TFIID and SAGA (as well as other regulators) to monitor changes in gene expression on a genome-wide scale in yeast [8,9]. In one of those studies, we measured the effect of over 60 mutant combinations on ,6000 yeast genes. After clustering the genes into six groups based upon similar responses to a set of mutations (i.e., co-regulation), there were 360 (6066) summarized data points, all of which needed to be reconciled in the context of a generalized gene regulation model. Model plausibility was evaluated using a kinetic simulator [7,10] that allowed us to define the transcription process in terms of elementary steps that were relevant to the mutants under study. While this strategy was effective in shaping an all-inclusive model, it was primarily designed to model a forward pathway in which input parameters were declared and the algorithm calculated the output (mRNA production). The process was used interactively to test whether a series of reaction steps and input parameters defining assembly of the transcription machinery could accommodate the bulk of the data. This process, while effective, was laborious because it lacked an iterative optimization routine. For this reason we sought to develop a computational optimization process by which measured mRNA output levels could be used to derive input parameters that would model the output in a modelspecific manner.
TFIID and SAGA are but two of the many protein complexes that control mRNA output ( Figure 1A) [9,11,12,13,14]. The ultimate goal is to create a linked biochemical network that integrates all regulatory interactions. The magnitude of such a problem is substantial if one considers that in yeast alone there are ,6000 genes potentially regulated by up to ,400 proteins, thereby producing ,600062 400 possible bound/unbound states. Exhaustive experimental testing of such a theoretical network would require the impractical construction of 2 400 mutant strains to produce each state. As a tractable, albeit limited, means of elucidating parts of the network, we have focused on key components (i.e., TFIID, SAGA, and several other proteins) with the goal of creating a mathematical ''fluidics'' model that describes the contribution of TFIID, SAGA, and other selected proteins to genome-wide gene expression. The mathematical model is intended to evaluate the plausibility of ad hoc conceptual models of transcription regulation that explain changes in gene expression in response to defects in the regulatory interactions under study.

The Fluidics Analogy Model
A biochemical network describing the assembly of the transcription machinery at promoters can be thought of as a series of reversible fluid-flows that dynamically move forward (transcription machinery assembly) and backward (disassembly or inhibition), with mRNA output being the net flux of these forward and reverse flows. To model regulated mRNA production from a gene, we developed a piping analogy ( Figure 1B) in which a single fluid, namely mRNA, flows through the pipe at a rate governed by the pipe's resistance and the pressure drop across the ends of the pipe. In addition, a valve is deployed (on selected pipes in the lattice) to represent a regulatory event such as TFIID binding to a promoter. Since TFIID contributes to mRNA production [15], the valve is considered ''on'' when TFIID is bound at the promoter. When TFIID is experimentally removed by creating a mutation in TFIID, the valve is then ''off'' and mRNA output is decreased. Addition of SAGA to the system creates two inputs, or pipes, that converge to produce a single mRNA output. In this work a set of such pipes is referred to as a lattice, and is constructed by the user to conceptualize experimental observations. In principle, the 6000 yeast genes can be modelled by 6000 individual lattices. However, rather than creating a computationally unwieldy set of 6000 lattices, clusters of genes that behave similarly within experimental error, when ''valves'' are turned on/off via mutation, are approximated using a single lattice. In this paper, we describe the modelling of six clusters with six lattices. An important aspect of this model is that a valve is experimentally turned on by mutating a negative regulator or turned off by mutating a positive regulator.
Under the normal physiological state of the cell (i.e. wild type), valves will have default settings ranging from zero (off) to some maximal value (fully on). A valve in one lattice, representing a given gene cluster, may have a different default (wild type) setting than the same valve in a different lattice (representing a different gene cluster). For example, lowly expressed genes might have default valve settings for TFIID close to zero. Highly transcribed genes might have the TFIID valve set to a high value. When TFIID is mutated so as to turn off all TFIID valves, mRNA output from the lowly expressed gene is relatively unaffected, whereas mRNA output from highly transcribed genes might be severely curtailed. In a two-pipe lattice involving TFIID and SAGA, the change in mRNA output upon mutation of TFIID (or SAGA) will depend upon the default valve settings for TFIID and SAGA. Since complete elimination of certain regulatory proteins such as TFIID is lethal to cells [14], we must either measure RNA output soon after TFIID inactivation, or use TFIID mutants that are only partially defective, thereby requiring the model to tolerate a relatively high background level of RNA output when the TFIID valve is turned ''off''. The first option is employed in modelling the 2-branch lattice below, and the latter when modelling the 4branch lattice; the 2-and 4-branch lattices are described in the remainder of this paper. In either case, the effect of mutations, i.e. changing on/off valves' setting in our model, is measured as the difference of the resulting mRNA outflow from the all-valves-off state.
The resulting flow across a given valve has three possible states: no-flow, flow in the forward direction, or flow in the reverse direction depending (within the scope of the fluid-flow analogy of this model) on the pressure drop across the pipe holding the valve in question. The effect of the collective settings of all valves in a given lattice on the resulting net-outflow (i.e., mRNA production) has three possible states: increase, decrease, or no-change. Experimentally, this corresponds to a positive, negative, or no change, respectively, in mRNA levels for each gene (or gene cluster) being measured. Only the on/off states of each valve are controlled, and the corresponding net outflow is measured thereby enabling a quantitative inference of the change in flow through all valves.
Our previous study on the TFIID/SAGA assembly pathway included a third non-productive transcription complex assembly pathway [8]. This non-productive pathway loads TBP onto promoters in a state that fails to direct proper assembly of the transcription machinery. Promoter activation therefore requires removal of this inactive TBP [16,17], which is catalyzed by the potentially cooperative action of Mot1 (Modulator of transcription 1) and NC2 (Negative cofactor 2) [16,18,19,20,21]. In this model, Mot1 and NC2 also remove TBP delivered by SAGA [8,9], but not TBP delivered by TFIID [22]. The six clusters of co-regulated genes derived from that study are the subject of four-branched lattice modelling presented here.

A Two-branch Pipe Network
As a first step towards modelling complex lattices, we created a simplified two-branch model representing the dual contributions of TFIID and SAGA to mRNA production ( Figure 1B) [9]. This model is defined by 1) its connection scheme; 2) the pressure heads at all external nodes, v 0 , v 1 , v 2 ; 3) the resistance to flow along each pipe link, r 0 , r 1 , r 2 ; and 4) the on/off setting of valves s 1 , s 2 . Since the flow depends only on pressure drops between the input and output pressure heads, all pressure heads in the model can be specified relative to the output pressure head (v 0 ) which, without any loss of generality, is set to zero for convenience. The external pressure heads and pipe resistances are fixed ''model parameters'' which are specific to each lattice. The measurable ''output'' of an experiment is the single flow variable i 0 that is analogous to the production rate of mRNA for the given s 1 , s 2 setting minus the production rate of mRNA when both valves are closed; the latter is termed the ''background state''. The internal pressure head (p 0 ), and all flow variables i 0 , i 1 , i 2 , are computed from the model parameters consistent with a specific setting of the valves, and in accordance with the standard fluidics model equations: where the constant s i is assigned a value of 1 if valve s i is open (i.e. on), and a value 0 indicating that valve s i is in its off position. In addition, flow continuity at pipe connections requires: The model parameter r 0 defines a class of solutions under the transformation: Without any loss of generality, we arbitrarily set r 0 = 1, recognizing that a different setting of this model parameter will scale p 0 and the remaining model parameters according to Eq. (5) leaving all other model variables, most importantly i 0 , invariant. Simultaneously solving equations (1)-(4), with r 0 set to 1, yields the expression for the output flow i 0 The arguments (s 1 , s 2 ) of i 0 in Eq. (6) are intended to affirm the dependence of the output flow on the valves setting. It is evident from Eq. (6) that when both valves are closed i 0 vanishes providing the background case against which other valve-settings' measured outflow is compared. Hence in Eqs. (1)-(6) i 0 (s 1 , s 2 ) denotes the difference of the mRNA outflow for the valve setting (s 1 , s 2 ) from the mRNA outflow when both valves are closed, i.e. s 1 = 0 = s 2 .

Optimal Solution of the Two-Branch Model for the Model Parameters
With two valves in the two-branch model, each permitting two states, on or off, there is a total of four possible valve-setting combinations available, each yielding a model value of i 0 that correlates with a correspondingly measured value of mRNA relative to the background. Labeling branches 1 and 2 in Figure 1B as D and S, respectively, we designate the wild type state (i.e. unperturbed or starting state) as the experiment where the two valves are set to the on position, and set the value of i 0 to the differential measured mRNA flow: (Note that, for example, the subscript 3 on m 3 is the integer represented by the binary number 11 corresponding to the valves setting for the corresponding state). In Eq. (7), m 0 is the measured value for the background type state defined above. All other three states corresponding to the remaining settings of the valves are experimentally altered states corresponding to differential measured mRNA flows m i , i = 0,1,2:  (8) is automatically satisfied, hence it cannot be used in the process of evaluating the model parameters. Consequently the system of equations (7), (9), and (10) is underdetermined in its unknowns, the four model parameters in this case. Thus values of the model parameters are sought that best fit the model-computed values of i 0 to their measured values. This defines an optimization procedure and the optimal state was achieved by searching for the set of model parameters that minimizes the following quantities:  to pipe lattices it is physically acceptable to have negative pressure heads, v i , and flows, i 1 and i 2 , but not resistances, r i . A negative resistance would imply flow from low to high pressure, an unsustainable proposition in view of the intended analogy.
The optimization problem is formulated as a constrained nonlinear programming (NLP) problem where the objective is the minimization of the squared difference between the experimental and predicted differential outflow (from the background state's outflow) for all the included experiments. The resulting model is implemented using the General Algebraic Modelling System (GAMS) [23] which is a high-level language for the compact representation of mathematical programming models. Subsequently, the model is solved using the CONOPT3 solver which implements the Generalized Reduced Gradient (GRG) algorithm [24,25]. Search procedures perform the minimization locally so as to reduce computational demand. Hence, they do not guarantee a global minimum.
Applying the GAMS [23] optimization procedure to the measured mRNA output values presented in Table 1 yields the model parameters shown in Table 2. These values correspond to the constrained minimum error stated in equation (11) obtained in 1,000 iterations designed to explore a wider region in modelparameter space thus improving the chance of approaching the global minimum at a reasonable computational cost. Each iteration is comprised of a complete minimization sequence, with the various iterations differing from one another by the values assigned as initial guess to the model parameters and variables. For example the optimal model parameters presented in Table 2 were obtained by randomly selecting an initial guess as follows: These model parameters are physically acceptable in that no negative resistances appear. These optimal model parameters reproduced, within experimental error, the measured experimental output of mRNA [9] for each of the modeled six clusters [8] under each permutation of the wild type and mutant (TFIID and SAGA mutants) states ( Table 3). The C:E error is defined as the ratio of the computed value of i 0 for a given valve setting to its measured value corresponding to wild/altered-type yeast minus one. The error is the maximum of |C:E-1| over all valve settings for a given cluster. Table 3 shows that the error for all clusters is below 8.1%, well within the experimental variability. This modeling involves experimental data generated 45 min. after complete inactivation of TFIID via a temperature-shift in the mutant strain taf1-2.
Effectively this optimization procedure amounts to solving the inverse problem, whereby the measured mRNA output is known and attempts are made to infer the model parameters that most closely reproduce that output. Importantly, perturbations to the experimental system via mutation are used to alter the corresponding valve setting(s). While the under-determined nature of the two-branch model is unlikely to repeat in more complex models with more branches, the optimization procedure applied to this two-branch model is equally applicable to over-determined systems. This is further illustrated for the four-branch model below.
The construction of a mathematical model governing transcription complex assembly amounts to determining all model parameters that when deployed in the pipe-lattice model will   Figure 1B for lattice arrangement. Valve settings are denoted by s. Mutant status is indicated by S (spt3D) and D (taf1-2) [9]. Error is defined as the maximum absolute value of the error obtained between the measured [9] and calculated i 0 values. Measured i 0 for WT (experiment 3) corresponds to the average transcription frequency (mRNA/hr) using the data of Holstege [26] for the indicated clusters of genes (3, 4, 5, 6, 8, and 9) defined in Huisinga et al. [8].

A Four-Branch Model
Next we constructed a more complex multi-branch lattice reflecting contributions of the transcriptional regulators Mot1 and NC2 to the two-branch model involving TFIID and SAGA ( Figure 2). The four-branch model's construction is based upon evidence of these interactions presented elsewhere [9]. The number of control valves was set to the number of individual mutations being modelled. Six identical lattices were constructed, each modelling one of the six previously defined clusters of co- regulated genes [9]. Each lattice was tailored by allowing its model parameters, namely the external pressure head (v) at each external node and the resistance of each pipe link, to vary from one cluster to the other. In addition, the model variables, flow currents (i) and internal pressure heads at pipe intersections (p) vary across clusters and with varying valves settings.
The data set used for the four-branch model is different from that used for the two-branch model. In particular, TFIID was only partially inactivated using the taf1(DTAND) mutation, and all mutant states were constitutive (i.e. not induced by a temperature shift, as in the previous example). Consequently, in the all-off background state, a relatively high level of background mRNA remains.
The four-branch model is ''over-determined'' in that there are more conditions to satisfy (measured output from experiment) than model parameters to compute. This feature derives from the fact that the number of model parameters (i.e. number of pipes and external nodes) increases linearly with lattice complexity, while the number of valve-setting combinations increases exponentially, i.e. 2 K , with the number of valves, K. An optimal set of model parameters is sought that minimizes the deviation of the computed values of i 0 under various valve settings from their corresponding experimentally measured values. Additional constraints placed on the optimization procedure are described below.
This optimization procedure permits multiple optimal states and does not guarantee a global minimum in a reasonable amount of computational time. Hence, the results of this procedure, i.e. the determined model parameters, are understood to be neither unique nor physical, measurable quantities. Rather, the ''optimal'' set comprises one possible fully specified lattice (connection scheme and model parameters) that is capable of replicating a corresponding set of experiments to within the observed discrepancy. Different choices of the model parameters might produce equally good, or even better, agreement between model and experiment depending on the computational effort expended in their computation. Thus, the procedure is intended to provide a means of assessing the plausibility of a model by bringing to light internal inconsistencies or conflicts. In such event, the user could then alter the lattice connections and rerun the algorithm for all clusters to assess whether alternative lattice arrangements provide a better fit to the experimental data ( Figure 3).
The four-branch pipe-lattice model representing TFIID, SAGA, Mot1, and NC2 ( Figure 2B) comprises 10 equations for each valve-setting state. Three equations define the flow continuity conditions at the pipe intersections where the pressure head is denoted p i , i = 0,1,2: Two equations define the pressure-head drop relations across the pipes whose resistances are denoted r 5 and r 6 : Five equations define the pressure-head drop relations across the pipes whose resistances are denoted r i , i = 0,…,4: Here too i 0 (s 1 , s 2 , s 3 , s 4 ) denotes the difference of the mRNA outflow for the valve setting (s 1 , s 2 , s 3 , s 4 ) and the mRNA outflow with all valves closed.

Optimal Solution of the Four-Branch Model for the Model Parameters
Like the two-branch model, the model parameter r 0 defines a class of solutions realized by scaling the values of v i , q i , and r i with r 0 ; hence, without any loss of generality, we arbitrarily set the value of r 0 to 1. Using any real, positive value of r 0 will produce effectively the same flow in the lattice by the corresponding scaling of the model parameters and internal pressure drops. This model permits a total of 2 4 = 16 states corresponding to the combination of on/off settings of the four valves s i , i = 1,…,4. However, in contrast to the two-branch model, here the model is overdetermined in the sense that there are 15 non-background experimental values of i 0 (the difference of mRNA outflow for a given state minus mRNA for the background state) that must be replicated by the model via adjustment of only ten model parameters v i , i = 1,…,4, and r i , i = 1,…,6.
Equations 13-15 permit an analytical solution for i 0 in terms of the model parameters and the valve settings. The resulting expression is supplied to the optimization package, GAMS, requiring that the optimal set of model parameters v i , i = 1,…,4, and r i , i = 1,…6, satisfy the following conditions: 1) The difference between i 0 and the experimentally measured mRNA in excess of its background value for the corresponding valves-setting is minimized in the L ' norm (the maximum absolute value over all experiments in a cluster); 2) All model resistances are nonnegative: r i $0, i = 1,…6; 3) The sense of change (increase/ decrease) relative to the all-valves-on state (Experiment 15) is preserved.
In the pipe-lattice model, the default (wild type) setting for each valve is either on or off, reflecting whether the modelled biological interaction represents a facilitating or inhibiting interaction, respectively. As such, the lowest measured mRNA output in each cluster is assigned to the ''all off'' state (Experiment 0 in Table 4), regardless of the mutant status and is representative of the background mRNA level. If multiple states possess values that are experimentally indistinguishable from this smallest value, e.g. Cluster 8 discussed below, we arbitrarily select one of them to correspond to the background state. The effect of turning on one or more valves is computed as the difference of the resulting mRNA outflow from this background value. The mutant status for the ''all-off'' experiment is then directly linked to these valve settings. For example, in cluster 4 the off state of valve s 1 is represented by the SAGA mutant spt3D (S), the off state of s 2 is represented by the TFIID mutant taf1(DTAND) (D), the off state of s 3 is represented by wild type MOT1, and the off state of s 4 is represented by wild type BUR6 (NC2).
Next, each valve is turned on, one at a time (experiments 1-15), and assigned to the appropriate mutant status. Thus, in cluster 4 experiment 1, s 1 is turned on (wild type SAGA) while all other valves (s 2 , s 3 , s 4 ) remain off (mutant TFIID, wild type Mot1, and wild type NC2). When all valves are turned on TFIID and SAGA are in the wild type state and Mot1 (M) and NC2 (N) are mutant (experiment 15). This process is applied independently to each cluster, then using the GAMS optimization procedure described above, the model parameters shown in Table 5 are computed. The modelled mRNA output (i 0 +background mRNA) computed using these parameters is compared to the measured output in Table 4.   Table 3, except that the four-branch lattice in Figure 2B was modeled, data sets were from Huisinga et al. [8], and the valve-settings were adjusted such that ''all off'' (experiment 0) corresponded to the lowest mRNA output for each cluster. doi:10.1371/journal.pone.0003095.t004 With the exception of cluster 5, the error in the modelled output is well below 10%. This value provides a measure of uncertainty with regards to plausibility of the model in Figure 2B to represent the microarray expression data. The higher error associated with cluster 5 indicates that additional regulatory complexity may be associated with this cluster of genes that is not captured by the model. A similar conclusion was drawn regarding the overall validity of the model and the exception of cluster 5 using a different modelling paradigm [9].
Finally, we applied this tool to assess arbitrary alternative arrangements of the lattice model ( Figure 3). These alternative arrangements produced larger error when used to model all clusters ( Table 6), suggesting that they are poorer models of the underlying transcription mechanism.

Discussion
The approach described here provides a tool to help interpret large genomic data sets in the context of a model for transcription complex assembly that has ill-defined reaction steps, rate constants, and reactant concentrations. We applied this fluidics model to a large genome-wide microarray expression profile derived through the perturbation of one central aspect of transcription complex assembly (regulation of the TATA binding protein). The approach provided a measure of plausibility of the proposed model by demonstrating that within experimental error the four-branch model adequately represents the data. The results also illustrate the advantages and limitations of our new model in distinguishing good from poor pipe-lattice connection schemes. This modelling tool is not intended to prove that any particular model is correct, nor is it intended to derive a model for assembly. Rather, it provides a computationally expedient means to assess whether a conceptual model of the system that is grounded in conventional wisdom is inherently consistent with, or contradictory to, genomic microarray data.

Materials and Methods
Microarray expression data were obtained from public sources [8,9,26]. Model parameters were determined via GAMS optimization as described in the text.  Figure 2B (MNSD represents the lattice shown in Figure 2B). doi:10.1371/journal.pone.0003095.t006