^{1}

^{2}

^{2}

^{3}

^{*}

Conceived and designed the experiments: TG. Performed the experiments: EGT. Analyzed the data: EGT TG. Wrote the paper: EGT TG.

The authors have declared that no competing interests exist.

Genotype-to-phenotype maps exhibit complexity. This genetic complexity is mentioned frequently in the literature, but a consistent and quantitative definition is lacking. Here, we derive such a definition and investigate its consequences for model genetic systems. The definition equates genetic complexity with a surplus of genotypic diversity over phenotypic diversity. Applying this definition to ensembles of Boolean network models, we found that the in-degree distribution and the number of periodic attractors produced determine the relative complexity of different topology classes. We found evidence that networks that are difficult to control, or that exhibit a hierarchical structure, are genetically complex. We analyzed the complexity of the cell cycle network of

‘Genetic complexity’ is an often-discussed property of genotype-to-phenotype maps, but the term is used vaguely and inconsistently in the literature. We derived a definition of genetic complexity that assigns to every genotype-to-phenotype map a unique quantitative measure of its genetic complexity. This definition allows the meaningful comparison of complexity across systems, and also allows the identification of genetic-network features that contribute most significantly to genetic complexity. We applied this definition to ensembles of Boolean networks. Because all relevant quantities are precisely defined, Boolean networks provide an ideal arena in which to study genetic complexity systematically. Using this approach, we identified relationships between topological properties of networks and their genetic complexity. We also identified features specific to the cell-cycle network of yeast that impart its genetic complexity.

Biologists currently enjoy unprecedented access to genotype and phenotype data, and as the price of DNA sequencing continues to fall and high-throughput automated experimental techniques continue to develop, the amount of data will increase exponentially. The challenge that arises is to extract as much useful information from these data as possible. Molecular, cellular, and behavioral phenotypes are amenable to experimental measurement, but connections to relevant features of the genotype are often out of reach. Likewise, DNA sequencing provides quick and inexpensive access to vast amounts of genotype data, but using the genetic sequence of an organism to predict its phenotype remains a broadly unfulfilled goal. The genotype-to-phenotype map (GPM) encodes the relationship between genetic variations and phenotypes of interest. It is a mapping which assigns to genotypes their corresponding phenotypes. An understanding of GPMs is desirable both to facilitate the prediction of the phenotypic results of genetic perturbations and to identify underlying genotypic features on the basis of phenotypic measurements. The elucidation of GPMs is therefore of primary interest to contemporary genetics.

The properties and construction of GPMs have been the subject of recent studies (see

Consider the two GPMs depicted in

(A) An injective mapping illustrating one-to-one mapping of genotype to phenotype. (B) A non-injective mapping illustrating greater genetic complexity.

With this as a guiding principle, a rigorous set-theoretic definition for the genetic complexity of a GPM is derived in

The mapping describes a diploid organism with a phenotype that has two relevant loci (‘A’ and ‘B’), each with two alleles (differentiated by uppercase and lowercase letters).

We stress that the genetic complexity is a property a GPM, not of a single genotype and its corresponding genetic network. Thus, the genetic complexity of a GPM is specific to a defined set of genotypes and the experimental protocol and precise parameters of the phenotypic measurements to be carried out on the genotypes. Multiple phenotype measurements can be accommodated readily. In order for the complexity to be a meaningful quantity, the set of genotypes in the GPM should consist of all possible combinations of the alleles under consideration. We refer to this theoretical set of genotypes as a ‘Mendelian library’.

The definition also allows for stochastic behavior. If the phenotype of a particular genotype is measured several times and it is found that two or more distinct phenotypes are produced, then each distinct phenotype will contribute to the quantity

In

Having considered some general features of the definition of

Dynamic Boolean network models (“Boolean networks”, henceforth) have long been used to model biological networks

As with any mapping, a GPM is not fully specified until the domain on which the mapping acts is delineated. In the case of a GPM, this consists of a specification of the genotypes present in the system. In the abstract arena of Boolean networks, we could conceivably choose any arbitrary set of genotypes to form genotype libraries. However, to identify the effects of some property of the networks on the complexity, we should compare genotype libraries (in which each genotype takes the form of a truth table specifying a Boolean network) that differ only in the property of interest, and that otherwise include all relevant genotypes. In this section, we compare the complexities of network libraries as a function of

Initial questions to ask are how the complexity depends on the order and on the size of networks. The natural expectation is that if all other properties of networks are held fixed while the number of nodes and edges is increased the genetic complexity will increase. We obtained results in accord with this expectation.

To test the dependence of complexity on network order, for each order we computed the complexity of the library of networks that comprises all possible genotypes with the appropriate number of nodes. Calculating the complexity of such libraries is computationally tractable only for orders of three or less (for example, there are 10^{18} possible genotypes with order four). We therefore calculated the complexity for all libraries with order less than four. The results are presented in

Another question to ask is how the complexity of a map depends on the size of the networks considered. To address this, for a given order and size we computed the complexity of the library of networks that comprises all possible genotypes with the appropriate number of nodes and edges. Our method of counting edges in a Boolean network is described in the

The topology of a genetic network encodes much information about the function of the network. Conserved topological motifs impart similar properties in different biological systems

We found that, for a fixed order and size, the relative complexity of topology classes is determined by two pieces of information:

The variance of the distribution of in-degrees in the topology class

The number of periodic attractors produced by the topology class

The in-degree distribution is the set of numbers of incoming edges incident to each node (see the ^{9} networks. No exception to the above rule was found. This strongly suggests that the rule will continue to hold for yet larger networks. A sample of data supporting this result is given in

Topology Class | In-degrees | C |

111 000 000 | (3, 0, 0) | 78.1818 |

011 001 000 | (2, 1, 0) | 4.57143 |

110 001 000 | (2, 1, 0) | 2.54545 |

110 000 100 | (2, 0, 1) | 2.54545 |

110 000 010 | (2, 0, 1) | 2.54545 |

100 101 000 | (1, 2, 0) | 2.54545 |

011 100 000 | (2, 1, 0) | 1.6 |

110 100 000 | (2, 1, 0) | 1.05263 |

110 000 001 | (2, 0, 1) | 1.05263 |

110 010 000 | (2, 1, 0) | 0.695652 |

100 100 100 | (1, 1, 1) | 0 |

100 100 010 | (1, 1, 1) | 0 |

010 100 100 | (1, 1, 1) | −0.36364 |

100 100 001 | (1, 1, 1) | −0.46154 |

010 001 100 | (1, 1, 1) | −0.5625 |

100 001 010 | (1, 1, 1) | −0.5625 |

100 010 001 | (1, 1, 1) | −0.65 |

The difference in complexity associated with different in-degree distributions is most pronounced when one topology class has a node with zero inputs and the second does not. This difference can be understood heuristically as follows. If a topology class has a node with zero in-degree, then that node's update rule does not depend on the current state of any node. The node is immediately locked into whatever state it is in at time

This effect is also active across libraries of different sizes, and accounts for the unexpectedly small difference in

If two classes have the same distribution of in-degrees but a different topology, then the relative complexity is determined by which class realizes more periodic attractors. The class with more periodic attractors has a lower complexity. An example of this is given in

Any two Boolean-network topology classes with the same in-degree distribution have the same number of genotypes.

Every topology class will realize all fixed point attractors.

C | No. periodic attractors | No. edges in loops | No. loops |

0 | 0 | 1 | 1 |

0 | 0 | 1 | 1 |

−0.36 | 4 | 2 | 1 |

−0.46 | 6 | 2 | 2 |

−0.56 | 9 | 3 | 1 |

−0.56 | 9 | 3 | 2 |

−0.65 | 13 | 3 | 3 |

Because of fact 1 above, the relative genetic complexity of two topology classes with the same in-degree distribution is determined solely by the number of attractors each class realizes. Because of fact 2, the difference in the number of attractors between two such classes is completely determined by the difference in the number of periodic attractors. The topology class with more periodic attractors will have more phenotypes overall, and thus will have a lower complexity.

The number of periodic attractors realized provides the second level of structure for the relative genetic complexity of topology classes. However, calculating the number of periodic attractors produced is typically just as difficult as calculating the complexity in the first place. One would like to have a way to predict the relative complexity of two libraries based solely on topological features that can be immediately assessed when presented with two libraries. We found that the number of loops in a topology class provides a qualitative predictor for the number of periodic attractors produced, and thus of the relative complexity of topology classes. An example of the relationship between number of loops and complexity is shown in

Topology classes with more loop structures have less complexity. This behavior is summarized in

It was proposed by R. Thomas that negative feedback loops are necessary for the dynamical realization of periodic attractors

Thus, given two topology classes with the same order and the same size, we can make an informed prediction about their relative complexity. If the in-degree distributions of the two topology classes differ, then we know with certainty which will have lower complexity. If the in-degree distributions are the same, the topology class with more loops is likely to have lower complexity. One class of biological networks that exhibit a paucity of loops are gene-regulation networks employing master regulators

In a recent paper

The authors of

Having studied the complexity of libraries of generic Boolean networks, we examined the complexity of a specific biological system. The cell cycle network (CCN) of

The nodes that make the greatest contribution to the complexity of the system are blue.

Genetic complexity characterizes a library of networks, not a single network, so in order to analyze the complexity of the CCN we must first construct a library which the yeast CCN determines. In the following analysis, for any given threshold network, its corresponding library consists of all threshold networks that have the same topology as the base network and where each node can exist in one of three states: 1) the wild type allele, where the update rule for the node is given by its inputs as defined by Li

We found that for the library generated by the yeast CCN, when all 177147 genotypes were allowed to update from all 2048 initial states, no periodic attractors were produced. Because all fixed attractors were guaranteed to be produced by our construction of the CCN library, the lack of periodic attractors indicates that the complexity of the yeast CCN is maximal. It might seem surprising that a model of a cyclic process gives rise to no periodic attractors. However, this is consistent within the framework of the model, where the G1 phase of the cell cycle is a steady state of the system and initiation of the cell cycle corresponds to an external perturbation. This reflects the biological reality that, at any given time, most cells are not actively progressing through the cell cycle. We then systematically perturbed a number of features of the yeast CCN in order to identify which aspects of the network were most crucial to maintaining its maximal complexity.

First, we calculated the complexity of the 29 networks created by removing a single edge from the network. We found two edges which when removed resulted in a network with lower complexity than the CCN. One of these edges is a positive edge from Clb1,2 to Cdc20, and the other is a negative edge from Clb5,6 to Sic1. Both edges relay information from B-type cyclins, which govern the transition from the G2 phase to the M phase, suggesting that this process contributes significantly to the complexity of the CCN. It is known that the G2/M transition is a key checkpoint of the cell cycle, and that the B-type cyclins play a crucial role in this process

We next systematically perturbed the inputs to each node. For each node, we considered all possible reassignments of its inputs, holding all other features of the network fixed. We then calculated how often the complexity is decreased by perturbing the inputs to each node. We found that there is a clear separation between nodes whose inputs are more or less important to maintaining maximal complexity, as shown in

Node | Ave. No. of Periodic Attractors per Perturbation | Out-degree |

MBF | 9 | 2 |

SBF | 1 | 2 |

Cln1,2 | 3 | 2 |

Cdh1 | 3 | 2 |

Swi5 | 2 | 1 |

Cdc20,14 | 32 | 5 |

Clb5,6 | 40 | 5 |

Sic1 | 24 | 3 |

Clb1,2 | 49 | 8 |

Mcm1/SFF | 55 | 3 |

We saw evidence in the previous section that networks with more loops tend to give rise to more periodic attractors. Investigations of the CCN also support this conclusion. This can be seen in two ways. In one analysis, we have added all possible single edges to the network and calculated the complexity. When adding an edge, we can also compute how many loops are created, and of what size. We found that those edge additions which lead to the creation of several small loops or very many large loops are more likely to produce periodic attractors (and thereby decrease the complexity) than a random edge addition.

We can also make a connection between the large-scale loop structure and information flow in the CCN and the number of periodic attractors produced. The dominant modes of information flow in the CCN are as shown in

We have formulated a rigorous, quantitative definition of the genetic complexity of a GPM. This definition provides a tool to unravel the properties of GPMs by providing a consistent means of comparing the relative complexities of genetic networks and identifying features of networks that lead to greater or lesser complexity. Genetic complexity is a surplus of genotypic diversity for a given level of phenotypic diversity. Conversely, it is a dearth of phenotypic diversity. It is this dearth that results in the intellectual sensation of surprise when a complex phenotype arises. In biomedicine, such surprises are often unwelcome, for example when the complex phenotype is an adverse reaction to a drug or treatment. With an increased understanding of the quantitative basis of genetic complexity, such surprises can be more predictable, detrimental surprises can be avoided, and the likelihood of salubrious surprises can be increased. Potential applications of the rigorous definition of complexity include evaluating different strategies for collecting data and designing experiments, evaluating the usefulness of statistical methods to determine relevant genes in genome-wide association studies (GWAS) or alternatives to GWAS, and investigating how genetic complexity arises evolutionarily.

We found that the genetic complexity of libraries of Boolean networks increases monotonically as a function of size and order, fulfilling a basic expectation of genetic complexity. We also found that the key determinants of the relative complexity of different topologies are the in-degree distribution and the number of periodic attractors produced by the class (which is qualitatively related to the number of loops in the topology). The central role played by the in-degree distribution in determining the genetic complexity also suggests that those topology classes that are more difficult to control are additionally more genetically complex.

We found that the cell-cycle network of yeast has maximal genetic complexity. In

The definition of complexity should continue to produce other such insights, when applied both to other computational models and to experimental results. The definition allows the identification of those features of genetic interaction networks that lead to more or less complexity and thus leads to a greater understanding of the structure of GPMs in general. Such insights can provide guidance to engineer genetic systems of desired complexity and to design experiments optimally so as to maximize the information gained by performing measurements.

In the Boolean network framework, the interacting entities (genes, proteins, complexes, etc.) are represented as binary nodes which can be either active (‘1’) or inactive (‘0’). The current state of the network is then completely specified by stating whether each node is 1 or 0. The network evolves dynamically in discrete time steps. The update rule at time

In the context of Boolean representations of genetic interaction networks, specification of a truth table for a node corresponds to specifying how a gene interacts with all other genes. We therefore equate specifying a truth table for a node with specifying an allele for the corresponding gene. In order to fully specify a genotype, then, we must assign a truth table to each node.

Because there is a finite number of states available to a Boolean network of a given size, and because the update rules of the network are fully deterministic, if allowed to evolve in time a Boolean network will necessarily reach an attractor. The attractor can consist of a single state in which the network is forever stuck (a fixed attractor), or it can consist of a series of states that the network continuously visits in the same order (a periodic attractor). For a network with given genotype and initial state, we associate the resulting attractor with a phenotype.

With definitions of genotype and phenotype in hand, we construct a library of Boolean networks according to some unifying principle. We then allow each network in this library to evolve from each possible initial state and record the resulting phenotype. We count the number of unique phenotypes reached and calculate the complexity according to our definition Equation 1. These computations were carried out on a desktop PC, using programs written in C++.

A network can be characterized by its

A set of nodes and edges constitutes a topology. Two ways to characterize a topology are by its in-degree distribution and by its out-degree distribution. For an order

In order to prove that the second level of structure of the genetic complexity of topology classes is determined by the number of periodic attractors produced, we relied on two facts:

Two topology classes with the same in-degree distribution have the same genotypic diversity.

Every topology class will realize all fixed point attractors.

We prove these two facts now.

First, we show that all topology classes realize all fixed point attractors. Consider any network belonging to a given topology class. This network is fully described by its truth table. For an n-node network, the truth table will have 2^{n} rows and n columns. Each column is the update rule of the nth node, specifying how that node will update for each of the 2^{n} possible states of the network. If we flip the bits of the nth column of the truth table, the nth node will still have inputs from the same set of nodes. Its dependence on these nodes will simply be reversed. Thus, we can flip the bits of any column of the truth table and produce another network in the same topology class. Suppose that we are given any single network representing any topology class. We can construct a member of the same topology class that will realize any state of the network, S = (S_{1}, S_{2}, … S_{n}), as a fixed point attractor, as follows. From the truth table, (S_{1}, S_{2}, … S_{n}) updates to the state (S′_{1}, S′_{2}, … S′_{n}). In order for S to be a fixed point attractor, we must find a truth table in the same topology class such that S_{i} = S′_{i} for all i. Such a truth table can be generated by flipping the bits for each column i in which S_{i}≠S′_{i}. As shown above, the resulting network will reside in the same topology class, and the state S will be a fixed point attractor of the network. Thus, every topology class will realize all fixed point attractors.

Next, we prove that any two topology classes with the same in-degree distribution have the same genotypic diversity. We consider two topology classes with order

In order to show that any two topology classes with the same in-degree distribution also have the same genotypic diversity, ^{n}^{n} = |G|_{1}, A_{2},…A,…A_{n}_{1}, A_{2},…a,…A_{n}_{1}, A_{2},…A_{n}^{n} = |G|

The CCN as constructed by Li

A network given in the threshold formalism can always be converted to one in the truth table formalism. One potential stumbling block involves a small difference in topological notation between our earlier truth table framework and the threshold network framework. In the threshold network formalism, a self-regulation is represented by an arrow pointing from a node to itself. However, when translated into the truth table formalism, such a node will actually not have an edge pointing from it to itself. Nodes in the threshold network formalism without self-regulation

As mentioned in the

Update according to the threshold rules given above

Update always to off

Update always to on

Since there are 11 nodes in the CCN, there are 3∧11 = 177147 genotypes in the CCN library. For each of these genotypes, we cycle through all possible 2048 initial states and find the resulting attractor state. We count the number of unique attractors as the number of phenotypes and calculate the complexity

For perturbations of the CCN, the calculation is carried out in an analogous manner. We start with the perturbed threshold network, construct the library as above, and count the number of unique phenotypes that result. The 2048 fixed states are always guaranteed to appear and the complexity is fully determined by the number of periodic attractors that are realized.

(TIF)

(TIF)

(TIF)

(TIF)

(XLSX)

(XLSX)

(DOC)

We thank Gregory Carter and Ilya Shmulevich for their contributions.