^{1}

^{*}

^{2}

^{3}

Conceived and designed the experiments: WN JBP JD. Performed the experiments: WN. Analyzed the data: WN. Wrote the paper: WN JBP JD.

The authors have declared that no competing interests exist.

The mechanisms by which adaptive phenotypes spread within an evolving population after their emergence are understood fairly well. Much less is known about the factors that influence the evolutionary accessibility of such phenotypes, a pre-requisite for their emergence in a population. Here, we investigate the influence of environmental quality on the accessibility of adaptive phenotypes of

Adaptation involves the discovery by mutation and spread through populations of traits (or “phenotypes”) that have high fitness under prevailing environmental conditions. While the spread of adaptive phenotypes through populations is mediated by natural selection, the likelihood of their discovery by mutation depends primarily on the relationship between genetic information and phenotypes (the genotype-phenotype mapping, or GPM). Elucidating the factors that influence the structure of the GPM is therefore critical to understanding the adaptation process. We investigated the influence of environmental quality on GPM structure for a well-studied model of

During adaptation, a population “moves” in genotype space in search of genotypes associated with high-fitness phenotypes. The success of adaptation depends crucially on the accessibility of such adaptive phenotypes. While adaptive phenotypes rely on natural selection for their fixation, their accessibility depends, primarily, on the structure of the genotype to phenotype mapping (GPM) and, secondarily, on the forces that move a population in genotype space – i.e. selection and genetic drift (see

By studying the factors that influence biologically relevant GPMs, we may gain insight into the accessibility of adaptive phenotypes. To that end, we have taken advantage of recent advances in the understanding of bacterial metabolic networks

We define a genotype's phenotype (equivalent, for our purposes, to fitness) using a model of metabolic flux. Specifically, a growing body of experimental and theoretical work

The protein products of the genes b0116, b0726, and b0727 combine to form a protein complex that catalyzes production of succinate coenzyme A (SUCCOA) from alpha-ketoglutarate (AKG) and coenzyme A, with the concomitant reduction of nicotinamide adenine dinucleotide (NAD) and release of carbon dioxide (CO_{2}). A matrix _{1}, _{2}, _{3}, _{4}, _{5}, and _{6} denote the rates of the above reaction, the production of AKG, NAD, and COA, and the utilization of CO_{2}, NADH, and SUCCOA, respectively (Note that this is a simplification of the way the reaction is actually represented in our model). At steady state

Below, we describe the results of our analyses of the influence of the environment on aspects of the structure of the _{glucose}>μ_{glycerol}>μ_{lactate}>μ_{pyruvate}>μ_{succinate}>μ_{acetate}

Before we begin presenting our results, we find it useful to put the results into perspective. The structure of an organismal GPM changes on both ecological and evolutionary time scales; changes to the GPM's structure may result from, among other factors, changes to the environment and the outcomes of interactions among individuals within a population. For a given GPM, our ability to make meaningful predictions about its structure by considering only a subset of the factors that determine that structure will depend on the degree of coupling between the underlying factors. The first set of results we describe below takes into account the effects of the environment on the GPM's structure, independently of population-level processes. For a particular environment, these results give insights into (static) statistical structures of the GPM, and they should be interpreted in that light. Subsequently, we show that some of these static insights are consistent with population-level simulations of the adaptation process and with analytic predictions of the relative speed of adaptation to different environments.

We begin by asking: how does the phenotype change as we move in genotype space, in search of genotypes associated with adaptive phenotypes? To answer this question, we computed the PPD, that is, the probability that two genotypes that are separated by a Hamming distance _{h} in genotype space map onto phenotypes whose fitnesses differ by _{e} _{h} ( = 1,…,166) from each reference genotype. We find that the PPD has a less rich structure in acetate, the poorest of the environments, than in the other six environments (see _{h}<30). At Hamming distances _{h}≥30 the PPD exhibits an interesting bi-modal behavior that is largely independent of _{h}. Therefore, _{h} = ∼30, which is equivalent to 18% of the metabolic network's genome size, can be thought of as a critical Hamming distance that marks a transition from local to global features of the distribution of the magnitude of phenotype changes that accompany changes to the genotype. In contrast, in acetate the PPD (see

The PPD was computed in acetate and glucose environments.

To gain further insight into the dependence of phenotype changes on genotype changes, we computed the CL of phenotype differences, which quantifies the robustness of the phenotype to genotype changes. The longer the CL, the more robust is the phenotype. Longer CLs are also characteristic of GPMs that have a relatively smooth structure _{glucose}>CL_{glycerol}>CL_{pyruvate}>CL_{lactate}>CL_{lactose}>CL_{succinate}>CL_{acetate}) is consistent with the rank-ordering based on quality (see above), with the exception of pyruvate, which is poorer than (but is associated with a greater CL than) lactate. The CL for glycerol, lactate, and pyruvate are similar, which is consistent with the fact that both glycerol and lactose are converted into pyruvate by a small number of metabolic reactions. These results suggest that the GPM has a less rugged structure in qualitatively better environments.

Shown are the correlation length (CL) of phenotype differences, the normalized mutual information (NMI) of genotype differences relative to phenotype differences, and the number of essential genes (essentiality) found in the metabolic network, under different environmental conditions. The environments are listed in increasing order of quality, except in the case of lactose whose position in the rank-ordering is not known precisely. The NMI was computed as described in

In addition to the CL, we defined another statistic called the NMI of genotype differences relative to phenotype differences (see

We will use a simple example to explain what the NMI measures. Consider a hypothetical population of individuals with known fitnesses. Suppose we wish to know the difference _{f} between the fitnesses of any two individuals randomly selected from the population. According to standard information-theoretic principles _{f} is given by the entropy of the distribution _{f}) of fitness differences between individuals found in the population: _{f} will be _{f}) = 0 “bit” – we will know with certainty the value of _{f}. If, on the other hand, there are _{f} will be maximal: _{f}) = log_{2}(_{g}. If there is a consistent relationship between genotype differences and phenotype differences, then knowledge of _{g} should decrease our uncertainty about _{f}, that is, it should provide us with information about _{f}. The amount of information that _{g} provides about _{f} is called the mutual information of _{g} relative to _{f} (denoted by _{g}; _{f})). The NMI is the ratio of _{g}; _{f}) to _{f}), that is, it measures the proportional reduction in the uncertainty about _{f} due to knowledge of _{g}.

It is important to keep in mind that here we are concerned with measuring the amount of information that genotype differences convey about phenotype differences, on which natural selection acts during adaptive evolution, and not, as is often the case (e.g., see _{acetate}>NMI_{glycerol}>NMI_{succinate}>NMI_{lactate}>NMI_{glucose}>NMI_{pyruvate}>NMI_{lactose}) is inconsistent with the rank-ordering based on quality (see

Additional information about the structure of the GPM and its potential impact on the accessibility of adaptive phenotypes is provided by the sizes of neutral networks. Neutral networks are important because they allow the search for adaptive phenotypes to proceed (by neutral drift) even if the GPM has a rugged structure. We estimated the distribution of the sizes of neutral networks by performing neutral walks on the GPM (see

Neutral walks were performed as described in

In the preceding section we inferred, based on static pictures of the structure of the metabolic GPM, that the GPM has a less rugged structure in qualitatively better environments, suggesting that adaptive phenotypes could be comparatively more accessible in such environments. To gain further insight into the possible impact of environmental quality on the dynamics of adaptation, we simulated the evolutionary search for the highest-fitness phenotype in different environments. Specifically, we simulated the adaptive evolution of a population of size 1000, starting at randomly chosen genotypes with fitnesses ≤20% of the highest possible fitness (i.e., 1.0) (see

All evolving populations found the highest-fitness phenotype during adaptation to acetate, while 82% and 78% of the populations did so during adaptation to glycerol and succinate, respectively. In contrast, the highest-fitness phenotype was found by only 67% of populations adapting to glucose and by 63% of populations adapting to lactose. In addition, the populations that found the highest-fitness phenotype did so at a much faster rate in acetate, glycerol, and succinate than in either glucose or lactose (see

The fraction ^{th} generation is plotted against

The results presented above suggest the existence of a positive correlation between the NMI and the speed of adaptation. To shed additional light on this result, we now describe a simple mathematical model that makes explicit the relationship between the NMI and the speed of adaptation to a given environment, under the assumptions of Fisher's fundamental theorem of natural selection (e.g., see _{i}^{th} type. Let each type be characterized by its genotype, which is assumed to contain ^{th} type differ from the genotype of the abovementioned individuals, and let

Mathematically, we can express the relationship between the genotype and fitness differences as: ^{th} locus, and

The relationship between genotype and fitness differences for all types of individuals found in the population can be written as:

The mutual information of genotype differences relative to fitness differences is given by (e.g., see

Bacterial evolution experiments have demonstrated that the environment can exert an important influence on the structure of the genotype-phenotype map (GPM). For example, Remold and Lenski

We found that in all environments (except acetate) large genotype changes (>∼30) induce phenotype differences that follow an interesting bi-modal distribution. This bi-modal distribution is characteristic of the expected distribution of phenotype differences between randomly sampled genotypes, suggesting that in the considered environments the

In spite of the predicted ruggedness of the GPM in acetate, the poorest of the considered environments, very long (∼74% of the genotype length) neutral walks could still be performed on the GPM, suggesting that neutral drift can alter a substantial fraction of the phenotype during evolution. In other words, a population evolving in acetate could explore large portions of genotype space by drifting on neutral networks, increasing its likelihood of discovering adaptive phenotypes. Furthermore, the NMI was largest in acetate and smallest in lactose, suggesting that the information-transmission capacity of the GPM does not necessarily increase in better environments.

In order to gain further intuition about how qualitative changes to the environment could influence the dynamics of adaptation, we simulated the adaption of

Note that previous work

We conclude by pointing out some limitations of our empirical GPM model, and we discuss possible directions for future work. Firstly, our approach to analyzing

The GPM model we studied will add to the suite of available models (e.g., see

In addition, since the NMI affords an analytically tractable measure of evolvability, it could be useful to the mathematical investigation of the evolutionarily important relationship between evolvability and robustness (e.g., see

A number of reconstructions of the

A genotype of the metabolic network corresponds to a particular state of the network's genome (defined above). Mathematically, we represent the genotype as an ordered list of binary values (0 or 1), with a “1” at position ^{−2}, while the fitnesses of unviable genotypes were ≤∼1×10^{−9} (essentially equal to 0. No fitness values occurred between these two limits

A genotype (respectively phenotype) space refers to a structural arrangement of genotypes (respectively phenotypes) based on the Hamming (respectively Euclidean) distances between those genotypes (respectively phenotypes). A GPM is a mapping from genotype space onto phenotype space. When the phenotype is fitness, as is the case in the present study, the geometric structure of the GPM is called a fitness landscape.

The probability _{e}|_{h}) that two genotypes that are a separated by a Hamming distance _{h} in genotype space map onto phenotypes that have a phenotype difference of _{e} is given by _{e}|_{h}) denotes the number of instances when two genotypes separated by Hamming distance _{h} map onto phenotypes that differ by _{e}. _{e}|_{h}) is computed by the following uniform sampling algorithm

Choose a reference genotype at random.

Sample exactly

Compute the phenotype/fitness (i.e., the optimal biomass yield) of each genotype sampled in step 2. Normalize the computed fitnesses by dividing by the highest-possible fitness in the current environment (this facilitates the comparison of fitnesses across environments). Calculate the absolute difference between the computed fitnesses and the fitness of the reference genotype.

Arrange the fitness differences computed in step 3 into (_{e}|_{h}) bins; note that only the _{e} values were binned. Bins of size 0.01 were used (there were 100 bins, with right edges at 0.01, 0.02,…,1.0). Both smaller (0.001) and larger (0.05) bin sizes gave qualitatively similar distributions for (_{e}|_{h}) (e.g., see

Repeat the above steps until convergence of _{e}|_{h}).

The above algorithm converges relatively fast (i.e., _{e}|_{h}) does not vary by >10% at convergence; e.g., see ^{6} data points in the process.

The correlation function describes, for example, how the similarity between the phenotype of a given genotype and that of an ancestral genotype decays as the two genotypes diverge. The correlation function of phenotype differences can be obtained directly from the quantity _{h}. In (6), _{h}/CL), via minimization of the sum of squared errors.

Note that the above statistical methods are applicable to any mapping from a combinatorial set (e.g., the set of possible metabolic genotypes, which consist of sequences defined on a binary alphabet) onto a set consisting of either continuous- (e.g., the set of possible metabolic phenotypes/maximum biomass yields) or discrete-valued entities, whenever both the domain and range of the mapping are equipped with appropriate metrics (e.g., _{h} and _{e}). The applicability of the methods does not depend on the specifics (e.g., folding thermodynamics, in the case of RNA GPMs, or flux-balance analysis, in the case of our metabolic network GPM) of the mapping under consideration.

The mutual information is a standard information-theoretic quantity

When computing the NMI,

In this work, we estimated the value of

A neutral walk proceeds as follows

A “walker” starts at an initial, randomly chosen viable genotype,

A genotype,

The walker moves to

Steps 2 and 3 are repeated until it becomes impossible for the walker to move further.

We ran 100

Computer implementations of the methods and algorithms described above are available upon request.

Reactions found in the E. coli central metabolic network analyzed in this study

(0.14 MB DOC)

Conditional probability distribution of phenotype differences. The distributions were computed as described in the main text. Phenotype differences were binned using bins of sizes 0.01.

(1.05 MB TIF)

Conditional probability distribution of phenotype differences. The distributions were computed as described in the main text. Phenotype differences were binned using bins of sizes 0.05.

(1.72 MB TIF)

Convergence of the conditional probability distribution of phenotype differences. Shown is the Kolmogorov-Smirnov distance between the distribution of phenotype differences de conditioned on genotype differences dh obtained after t iterations of the uniform sampling algorithm described in the main text (denoted p(de|dh,t)) and the distribution p(de|dh,t-10), for three values of dh spanning a wide range. The Kolmogorov-Smirnov distance is given by max{abs(p(de|dh,t)-p(de|dh,t-10))}. The data were collected in a glucose environment.

(0.21 MB TIF)

Rank-ordering of metabolic environments based on the normalized mutual information (NMI). The environments are listed in increasing order of quality, except in the case of lactose whose position in the rank-ordering is not known precisely. The NMI was computed as described in the main text, using different values of p, the mutation rate per genotype position. The measurement scales of NMI values corresponding to different values of p were adjusted in order to facilitate their presentation on the same graph.

(0.20 MB TIF)

We are grateful to Simon Levin, Ned Wingreen, and three anonymous reviewers for very constructive comments on an earlier version of this manuscript; Michael Desai for very helpful discussions; and the laboratory of Bernhard Palsson at the University of California, San Diego, CA, for providing access to the latest reconstruction of the