Table 1.
Mutation encoding scheme (dummy variables).
Figure 1.
Stem plot of the linear coefficients.
Three circles on each stem represent the changes in phenotype for each of the three possible mutations per site. CRP and RNAP are known to each bind at two sites (magenta and cyan areas). Red circles correspond to the mutations needed to get the consensus sequences.
Figure 2.
Histogram of phenotype values of
uniformly random sequences for the inferred epistatic model.
Random sequences have very low inferred phenotype values because of the specificity of binding sites. The peak of the distribution indicates what phenotype values evolve under neutral conditions. The the wild-type value, (green line), is much higher than the neutral value indicating selective pressure.
Figure 3.
a) Matrix of the sum of the absolute values of the pair interaction coefficients for each pair of sites (3 mutations per site equals 9 interactions) for the chosen statistical model. The clusters near the diagonal are interactions within the RNAP and CRP binding sites, and the off-diagonal clusters are interactions between the binding sites. b) Red: Site-specific sum of absolute values of additive coefficients, divided by 3 (the number of possible mutations). Black: site-specific sum of absolute values of epistatic coefficients, divided by 9 (the number of possible mutation pairs). Epistatic and additive effects are strongly correlated, with the correlation coefficient 0.90.
Table 2.
The interaction coefficients for are clustered around the subunits of the system: CRP, RNAP, and their constituent binding sites (defined by white rectangles in figure 3a).
Figure 4.
(blue) coefficients for the non-epistatic model with no-glucose (normal levels of cAMP) (red) with glucose (no cAMP).
CRP is activated by cAMP and does not bind without it.
Figure 5.
2D histogram of expression for the two environments, no cAMP (glucose), and cAMP (no glucose) for random sequences (orange), and sequences from the experiment (blue), which are closer to the wild type (plus sign).
The wild-type is nearly on the optimal front in that very few sequences have both higher expression with cAMP and lower expression without cAMP (above and to the left of the plus sign). The phenotype values range from 1 to 5 in these experiments. The dis-similarity of measured expression and expressions predicted for random sequences along the vertical, but not the horizontal axis, likely signals presence of poorly understood biophysical mechanisms differentially employed in the two considered environments.
Figure 6.
Generalizing the fitted function by replacing the output values with a non-linear function
improves the least squares fit.
Constrained non-linear optimization found the optimal for the linear model with
. The non-linearity is due to the first few bins being dominated by background fluorescence and not gene expression.
Figure 7.
The LASSO solution of the quadratic model was computed for 100 values of .
Blue is the value, and red is the 10-fold cross-validated
. The green curve is the variance of
for randomly generated sequences. The variance is too large even for values of
that are larger than the optimal value predicted by the maximum of the
curve. We choose the model with
(dashed line) for further analysis. This model has
non-zero coefficients, most of which are epistatic.
Figure 8.
Sensitivity of the epistatic coefficients to the choice of the regularization parameter .
As in Fig. 3, we show the matrices of the sums of the absolute values of the pair interaction coefficients for each pair of sites . a) Coefficients for the model with maximum
(
). b) Coefficients for the full model:
. Notice the same general structure of the coefficients for varying
, including
in Fig. 3. This indicates stability under changes of the parameter.