^{*}

CLB conceived and designed the experiments, performed the experiments, analyzed the data, and contributed reagents/materials/analysis tools. BOP made the work possible by

The authors have declared that no competing interests exist.

The number of complete, publicly available genome sequences is now greater than 200, and this number is expected to rapidly grow in the near future as metagenomic and environmental sequencing efforts escalate and the cost of sequencing drops. In order to make use of this data for understanding particular organisms and for discerning general principles about how organisms function, it will be necessary to reconstruct their various biochemical reaction networks. Principal among these will be transcriptional regulatory networks. Given the physical and logical complexity of these networks, the various sources of (often noisy) data that can be utilized for their elucidation, the monetary costs involved, and the huge number of potential experiments (~10^{12}) that can be performed, experiment design algorithms will be necessary for synthesizing the various computational and experimental data to maximize the efficiency of regulatory network reconstruction. This paper presents an algorithm for experimental design to systematically and efficiently reconstruct transcriptional regulatory networks. It is meant to be applied iteratively in conjunction with an experimental laboratory component. The algorithm is presented here in the context of reconstructing transcriptional regulation for metabolism in

In recent years, the exploration of life has been bolstered through the advent of whole genome sequencing. This new data source significantly enables the reconstruction of genome-scale metabolic networks. After a metabolic reconstruction, it will be necessary to discover the genetic control mechanisms that operate within an organism. Transcriptional regulatory network (TRN) reconstruction is costly both in terms of time and money, so it is critical that the reconstruction efforts be made as efficient as possible. Experiments must be designed so that the most new regulatory knowledge is discovered in each experiment. The huge number of possible experiments (~10^{12}) and the vast amount of heterogeneous data available for designing experiments overwhelms the human ability to assimilate. The authors have developed an algorithm that utilizes a mathematical model of a reconstructed metabolic network integrated with a partially reconstructed TRN to identify the experiment designs with the highest potential of yielding the most new regulatory knowledge. The authors show that the produced experiment designs are similar to those a human expert would produce, and that the algorithm has a facility to incorporate any relevant data source to design such experiments.

As of January 2006, the TIGR Comprehensive Microbial Resource [

This paper presents an algorithm for systematically and efficiently reconstructing the topology and condition-dependent logic of TRNs. Practically, this means discovering new transcription factor (TF)–target gene regulatory connections, the (Boolean) logic of how TFs regulate target genes, and the environmental stimuli that modulate TF activity. The algorithm is presented here in the context of reconstructing transcriptional regulation for metabolism in

The computational algorithm utilizes a dynamic simulation of the current integrated transcriptional regulatory and metabolic network reconstruction to design experiments. The new regulatory logic rules discovered by the experiments are then added to the reconstruction.

TF_{i}, transcription factor _{j}, transcription factor

Efficient reconstruction means minimizing the number of experimental iterations. Efficiency gains are realized from fewer iterations in two key ways: time spent researching the next most informative experiment design (which could conceivably require weeks to months); and, laboratory supply and personnel costs required to perform each iteration. Minimizing the total number of iterations necessitates maximizing the number of new regulatory interactions discovered in each iteration. Given the experimental protocol described above, maximal rule discovery in each iteration depends on a simultaneous maximization and minimization. Assuming for a moment that one had a KG in mind for which one wanted to discover the TRN, it would be necessary to identify two growth environments that each maximally activate the regulatory connections for the identified TFs. Simultaneously, one would want to minimize the connections identically activated in both environments to minimize redundant discoveries.

This strategy is complicated by three facts. First, one would not have a complete TRN by which to judge whether the proposed growth environment shift–TF KG combination would be the one to yield the most new regulatory logic at the current stage of the reconstruction. Second, the best picture one could draw for the complete, real TRN would come from various data sources (e.g., literature, homology, expression profile based algorithms, location analysis, and TF binding site prediction algorithms) that are characterized by uncertainty and noise. Third, it would not be an optimal strategy to first choose a KG and then the best growth environment shift for that KG. Since the growth environment determines how a TRN is activated, and each of the possible KGs would be associated with different patterns of regulatory activity depending on the environment, it is the combination of growth environment shift and TF KG that determines the maximum yield of new regulatory logic.

The goal of designing a maximally efficient experimental strategy thus requires the resolution of three critical issues. First, how does one use criteria from various sources—with their attendant uncertainties, incompleteness, and noise—to infer as complete a picture as possible of the structure and logic of the TRN? Second, how does one utilize this incomplete picture to decide which growth environment shift–TF KG combination is most likely to yield the greatest amount of new regulatory knowledge? And third, how does one do this while steering away from previously discovered regulatory logic and towards new knowledge? After detailing the algorithm that addresses these issues in the next section, we describe how it has been applied and assessed in the Implementation section below.

The logic of the algorithm, depicted in

The algorithm ultimately produces experiment designs ranked by their potential for producing the most new regulatory rules.

TF_{i}, transcription factor _{j}, transcription factor

The procedure begins with a growth simulation using the

The example model M in (A) contains genes for three transcription factors and two enzymes and shows the 0/1 (“off/on”) state of each gene in each time step t_{n} of a simulation for a defined growth environment E. The dashed line in the model indicates that a regulatory connection between the two genes is not explicitly modelled, but is suspected to exist.

(B) shows how a “basic unit” is defined and shows the general formula for the parameterization of each cell. One basic unit exists for every known or suspected TF–target gene pair.

As shown in (C), for such regulatory connections explicitly modelled, each cell of the basic unit gets either a “0” or a “1,” depending on whether its associated transcription factor and target gene were observed to be in the indicated 0/1 combination in any simulation time step.

(D) illustrates how the inferred TF–target gene regulatory connections and logic derived from experimental and/or computational data are incorporated in a basic unit.

In (E) the basic units from (C) and (D) are concatenated to form the activity profile for the simulation.

Any two environments that define a shift must differ by only one component; otherwise, the inference of logic rules would be ambiguous because the true causative agent of a regulatory response would be ambiguous. Of the possible growth environment shifts, we identify the “legal” ones as those that differ by only one component.

A “shift activity profile” summarizes how the integrated transcriptional regulatory and metabolic (ITRAM) network is utilized in both environments of a growth environment shift. It is created by combining the two activity profiles of the two growth environments, as illustrated in

To prevent the algorithm from re-suggesting previously implemented experiment design(s), a record is kept of those cells of the shift activity profiles that were used to suggest the experiments that were previously implemented. In this step, the history mask is applied to each shift activity profile generated in Step 3 (see

Enumerate all combinations of TFs for KGs in a range of sizes (e.g., all combinations of two, three, four, and five TFs). (See

The interconnectedness of a KG is the sum of regulatory connection weights between all pairs of TFs in a KG, where a regulatory connection is an instance in which one TF regulates the other, or the two TFs regulate a common (third) gene. The interconnectedness for a KG depends upon the particular growth environment. For each KG _{I}), N_{I}(KG_{k},_{ij}_{ij}

The regulatory activity of a KG is the sum of the TF–target gene regulatory interaction weights for all TFs in a KG, where a regulatory interaction is a probabilistic TF–target gene link. The regulatory activity for a KG will be growth environment dependent. For each KG _{R}), N_{R}(KG_{k},_{ij}_{ij}

With all growth environment shifts and all KGs enumerated, identify the growth shift–KG combination that maximizes first N_{R}(KG_{k},_{ij}_{I}(KG_{k},_{ij}

After an experiment design has been implemented in the experimental phase of the iterative cycle, a set of new logic rules will have been generated and old rules will have been confirmed. Some of these rules will be for TF–target gene interactions that were in the shift activity profile, whereas others will correspond to newly discovered regulatory interactions. For those rules in the former set, “mark” their corresponding cells in the history mask.

In practice, the algorithm does not produce just one experiment design. Since its purpose is to aid the experimentalist, it produces many designs in a structured format so that the experimentalist can factor in any additional criteria not considered by the algorithm. This aspect of the algorithm output is discussed further in the following Implementation section.

The issue of algorithm validation presents a difficult situation, for there is no complete TRN in existence for any organism. We were able, though, to perform a limited retrospective analysis using two experimental TRN reconstruction iterations performed before the development and completion of this algorithm. For these two experiments, human research and intuition were used to choose the growth environment shifts and TF KOs. The algorithm and its development were in no way influenced by the human-made choice of experiments for the first two iterations.

To date, only a limited amount of research has been reported towards the development of fictitious TRNs that would be complex, realistic, and logically consistent enough to test this algorithm. The logical consistency requirement is critical, for a “randomly-wired” TRN joined to a metabolic network would not be able to support simulated growth, and certainly not in a large number of growth environments. When such networks are successfully developed, they will be an additional test bed on which to evaluate the algorithm.

The two experimental reconstruction iterations previously performed are given in

Human-Made Designs of Double Perturbation Experiments

Although the Δfnr and ΔarcA strains from the first iteration and the ΔnarL strain from the second iteration were found to be highly informative (producing roughly 130 new rules for

For the comparison we implemented the algorithm using iMC1010^{v1}, a genome-scale reconstruction of the integrated transcriptional regulatory and metabolic network for ^{v1}, we constructed a library of 108,723 minimal media growth environments. By implementing Step 1 of the algorithm, we found that 15,580 of these were able to support growth. From these environments, we were able to form 21,121 legal growth environment shifts. Only published regulatory interactions (contained within iMC1010^{v1}) were utilized, so all probability values in the activity profiles were unity. We considered KGs composed of between two and five TFs. For metabolic regulation in ^{v1} currently includes 104 TFs, combinatorial calculation gives C(104,2) + C(104,3) + C(104,4) + C(104,5) = 96,748,106 unique KGs available for analysis. With 9.6 × 10^{7} KGs and 21,121 legal growth shifts for ^{12} potential experiment designs.

Computer-Generated Designs of Double Perturbation Experiments

Inspection of the growth environment shifts in

The last four experiment designs are of a qualitatively different nature. These last four are mainly aimed at elucidating nitrogen-related regulation, and so do not probe regulation states that are as different as seen with electron acceptor and glucose shifts. They do this by shifting from a growth environment containing two components, where at least one can also function as a nitrogen source, to an environment containing just one component that functions as both the carbon and nitrogen source. These experiments are expected to reveal less, and less globally acting, regulation—as implied by the lower N_{R} values. The ranking of experiment designs in

The TFs suggested for the experiment designs in

Both of the experiment designs of

Just as the algorithm was run to retrospectively suggest experiments for the first iteration, we used the updated reconstruction resulting from the completed first iteration to retrospectively suggest experiment designs for the second iteration. We do not show an associated table for these results, but report that they are very similar to those of

The rationale for Step 8 of the algorithm is based on the power law nature of TRNs [

To grasp why the algorithm produces the experiment designs that it does and how it arrives at a rank ordering of their information potential, it is helpful to consider a result from previous work [

The observed clustering in

The demonstration of the algorithm with

Systems biology is characterized by the integration of heterogeneous high-throughput data into a mathematical model and the subsequent use of this model to gain understanding of the cellular systems(s) under study in a way that would not be possible or feasible otherwise. Moreover, this is done iteratively, whereby new understanding is used to design subsequent experiments. Such a framework has been established and demonstrated in a four-step iterative procedure [

The algorithm has a modular aspect due to its ability to accommodate alternative transcriptional regulatory modeling strategies or different dynamical modeling frameworks altogether. In order to implement such modifications, four adaptations to the algorithm would be required. First, the basic unit will need to be able to capture the logical relationship between each TF–target gene pair, and do so in such a manner that external probabilistic data can be added. Second, a history mask will be needed that is appropriate for the new basic unit design. Third, a function for defining and quantifying the regulatory activity, N_{R}, of a KG will be required. And last, depending on the new approach, a function for defining and quantifying the regulatory connectedness, N_{I}, of a KG will be needed.

We have presented an algorithm for systematically reconstructing a TRN with efficiency and human expert–like reasoning as prime considerations. Efficiency is based on time and cost; time is minimized through the algorithm-based experiment design process that would likely take a human expert weeks to months, and cost is minimized through the minimization of the number of laboratory experiments that need to be performed. The algorithm operates by deciding, given the current state of knowledge embodied in the TRN reconstruction, which experiment design—consisting of a group of TF KO strains and a growth environment shift—is most likely to yield the greatest number of new regulatory logic rules. The designs are equally applicable to overexpression studies. The algorithm has the ability to base its decisions on any data source that can assign a probability to the direct interaction of a TF and its regulatory targets and/or to the logical nature of its interactions. This aspect of the algorithm is significant, for it overcomes the finiteness inherent in a model and synthesizes a potentially vast amount of experimental and computational data that would not be possible by a human. We performed a limited retrospective comparison involving two previously performed TRN reconstruction iterations and found that the experiment designs that would have been suggested by the algorithm closely match the experiment designs that were chosen by human experts. This result illustrated our second goal of developing an algorithm with human expert–like reasoning ability. We expect this ability, when coupled with large amounts of probabilistic experimental and computational data, will significantly augment the limited assimilation ability of human experimentalists. Given the increasing number of organisms whose genomes have been sequenced and whose gene complement characterization is ever improving, we expect that this algorithm or others like or based on it will be necessary to efficiently uncover their transcriptional regulatory systems.

The algorithm presented herein is demonstrated using the first available genome-scale reconstruction of the integrated transcriptional regulatory and metabolic network for ^{v1}) [

Dynamic batch culture growth simulations [

Every growth simulation must occur in a defined nutritional environment. In this work, we simulated growth in all possible minimal media growth environments. In order to enumerate all such possible environments, we collected all environmental components that could have an effect in the model described above and categorized each one as a carbon, nitrogen, phosphate, sulfur, or electron acceptor source. Some components were placed in multiple categories (e.g., glucose 6-phosphate serves as both a carbon and a phosphate source). Additionally, each category contains “None.” All combinations consisting of one component from each category formed the library of minimal media.

The central element in the algorithm is the basic unit. The purpose of a basic unit is to summarize the observed logical relationship between a single TF–target gene pair during a particular growth simulation. Implicit in this summary is also the expression states of the two genes. A basic unit is created for every TF–target gene pair between which a regulatory interaction is known or suspected to occur. Each basic unit is composed of four cells, which correspond to the four possible (Boolean) combinations of TF and target gene states (i.e., TF “inactive”/target “not expressed,” TF “inactive”/target “expressed,” TF “active”/target “not expressed,” and TF “active”/target “expressed”). Each of these four cells contains a single numeric value, which is a probability value reflecting the degree to which it is believed that the physical and logical interactions actually occur in the organism in the given growth environment. Each probability value is computed as the joint probability of the TF activity state

Basic units are parameterized using data from two qualitatively different sources. These two approaches are discussed in the following two sections and are illustrated in

For TF–target pairs whose direct interaction and logic of interaction have been experimentally confirmed (and so are explicitly accounted for in the model), the equation for parameterizing the cells of a basic unit is derived from Equation 1 and Bayes rule:

For regulatory connections between TF–target pairs that have not been experimentally confirmed but are suggested by data, computational and/or experimental data can be used to estimate the probability that (1) the TF represses/activates the target gene, (2) the TF is inactive/active in a particular growth environment, and/or (3) the TF binds to the promoter of the target. For integrating these varied types of data, Equation 2 is further expanded using Bayes rule to give

The first term is the probability that the TF is an activator (corresponding to the basic unit cells for TF/Target states 0/0 and 1/1) or is a repressor (corresponding to the basic unit cells for TF/Target states 0/1 and 1/0). The second term is the probability that the TF is inactive or active in the environment E. The third term is the probability that the TF binds the promoter for the target in E.

An “activity profile” summarizes the regulatory logic and gene expression states observed in a single growth simulation for the entire model. The procedure for creating an activity profile begins by recording the computed expression state of the genes in the model in every time step (see

We used a “shift activity profile” to summarize the regulatory activity and gene expression states for a simulated shift between two growth environments. It is created from the two activity profiles corresponding to simulated growth in each of the two environments. The shift activity profile has the same dimensions as the activity profiles, and each of its cells has the larger of the two values observed in the corresponding cells of the two activity profiles. The larger value is chosen because it reflects the highest confidence of having observed the particular TF–target gene off/on state combination.

The algorithm presented herein is intended to be applied repeatedly and iteratively to systematically discover the TRN of an organism. To prevent the algorithm from repeatedly suggesting the same experiments, it is necessary to record which criteria were used to suggest any experiments performed in earlier iterations. As is explained in Steps 6–8 of the Algorithm section, these criteria consist of particular cells of the shift activity profiles. Thus, we record in a “history mask” those cells of the shift activity profile that were the criteria for choosing experiment designs that were actually implemented in previous iterations. The history mask has the same dimensions as a shift activity profile, and any cell whose corresponding logic has been confirmed by an experiment that was used to suggest a design is “marked.” The history mask is applied to each shift activity profile; those shift activity profile cells whose corresponding history mask cells are marked are overwritten with the value 0.0.

One of the fundamental outputs of the algorithm is the identification of groups of TFs for which to create single-deletion KO strains. Such groups of TFs we term “knockout groups,” or KGs. A KG is characterized by the number of TFs it contains and the identity of the TFs. For a total of m TFs, there are C(m,n) unique KGs composed of n TFs.

In regards to the TFs of a KG, we define two types of TF interconnections. In the first type, one TF directly regulates the other. In the second type, both TFs directly regulate a common (third) gene. To quantify the (total) interconnectedness of a KG _{total}(_{ij}_{ij}

We define the regulatory activity of a TF to be the probabilistically-weighted count of genes that it directly regulates. This number will be growth environment dependent. To quantify the regulatory activity of a TF _{ij}_{ij}_{ij}_{ij}

The clusters are projected into three-dimensional space, allowing visualization of the “space” of transcriptional regulation and metabolic functional capabilities. The numbers in parentheses by each cluster in the key are the numbers of different activity profiles in the cluster. Comparison of the clusters shows that they can be distinguished by the available electron acceptor (indicated by the ellipses) and the carbon source, and to a lesser degree by the nitrogen source. The units of each axis are in bits, as given by the Hamming distance computed between computation-based activity profiles that are contained within the clusters.

(190 KB DOC)

(32 KB DOC)

We thank Eric Knight for critical discussions and Markus Herrgard for helpful review of this manuscript. We also thank the reviewers and editor for helpful comments that improved the readability and presentation of the manuscript.

knockout group

knockout

interconnectedness of a knockout group

regulatory activity of a knockout group

open reading frame

transcription factor

transcriptional regulatory network