Fig 1.
Schematic visualization of the fitness landscape over the sequence space (green curve). Two models (red and blue curves) are inferred to assign high fitness values to sequences found in the Multi-Sequence Alignment (MSA) of a protein family. A complex model (red curve) can be a better predictor of the landscape globally while scoring poorly in predicting single-point mutations around a specific wild-type sequence, see local fitness landscape in the zoomed area. Conversely, a simple model (blue line) fitted on a local subset of sequences can give a better local approximation of the landscape, but will likely fail in distant regions of the fitness landscape.
Fig 2.
Behaviour of model predictive performance with different selections of training data.
A. Distribution of Hamming distances to the wt sequence (RNA-binding domain of Pab1-Yeast) in the MSA of [37]. Note the log scale on the y axis. The three colored lines correspond to three possible sequence selections performed by excluding sequences farther than a certain threshold dcut from wt. A smaller dcut corresponds to fewer sequences with a lower mean Hamming distance to the wt, denoted as D. B Comparison between predicted and experimental fitness mutational effects for an independent-site model trained on the three sub-MSAs corresponding to, respectively, dcut = 32 (orange), 43 (purple), and 82 (green). The Spearman correlation coefficient ρ between predicted and experimental values defines the predictive performance of the model. C Same analysis as panel B repeated for all possible cutoffs between dcut = 32 and dcut = 82 (the sequence length). The non monotonous behavior of the predictive performance indicates that a trade-off between number of sequences (denoted as B) and proximity to wt is controlling the predictive performance of the inferred model. D. Systematic analysis of the predictive power ρ as a function of the mean Hamming distance D of sub-alignments with fixed size B (top), and of the sub-alignment size B at fixed Hamming distance D (bottom). Each individual point shows the average over n = 5 sub-samples obtained at the corresponding values of D and B (see Methods). The dashed curves and error bars are computed by binned average and standard deviation over the displayed individual points. All significance levels refer to Spearman rank correlation of the individual points. *** P < 0.001.
Fig 3.
Quantity-relevance trade-off for lattice proteins.
A: Cubic fold that defines the protein family in the lattice model. Amino acids on sites that are in proximity to each other interact and define the energy of the protein (Methods). B: Predictive performance ρ for single mutations of 5 Sparse Potts models with different degrees of sparsity (defined by K, the number of pairwise links included in the energy function; K = 0 is the independent model) vs. . The collapse of the results is in agreement with Eq (5). C: Squared bias
vs. mean Hamming distance in the sequence data, see Eq (4), for the same sparse Potts models as in panel B. Line plots and error bars show mean and standard deviation at a given D and different Bs. D: Variance σ2 vs. estimated variance σ2 in Eq (3) for the same Sparse Potts models as in panel B. E: Bias factor J0(K) (divided by J0(0)) obtained by fitting the squared bias as a linear function of the mean Hamming distance for the various K-link models in panel C. F: Visualization of pairwise couplings inferred by a fully-connected Potts model, highlighting the larger variance of couplings associated to structural contacts (in orange) compared to non-structural ones (in blue)—note the log scale on the y-axis. G: Normalized value of J0(K) (divided by J0(0)) obtained with an effective theory using the variance of couplings associated to modeled and un-modeled structural contacts, see Appendix A in S1 Text. H: scaling for predictive performance ρ of our statistical models for single point mutations as a function of the sum of the estimated squared bias J0D and of the variance σ2 in Eq (3). J0(K) (denoted as
in the plot axis label) is fitted to for each value of K by maximizing the scaling correlation as explained in the main text. I: Bias factor J0(K) (normalized by J0(0)) inferred from maximizing the scaling correlation as in panel H.
Fig 4.
Relevance-quantity trade-off explains the predictive performance of statistical modelling.
A predictive performance of single-point mutations using the Independent-site on the RNA-bind protein, shown as a function of the mean Hamming distance of the MSA (top) and variance estimated from the alignments (bottom). B predictive performance of single-point mutations as a function of the linear sum of squared bias and variance. The scaling correlation rS is computed as the absolute value of the Spearman correlation coefficient of J0D + σ2 vs. ρ. The bias factor J0 is inferred by maximizing rS, as done in Fig 2E. C scaling correlation rS for the seven protein families, compared to chance levels. The chance distribution is built by destroying the relationship between the performance ρ and the two descriptors by random order shuffling, then repeating the J0 inference procedure to account for the scaling optimization during its estimation. Error bars show standard deviations over n = 100 repetitions of the random shuffling. D top: RNA-bind family, predictive performance ρ as a function of the cutoff distance dcut, showing the existence of an optimal cutoff dopt (black dashed line). Bottom: individual contributions of squared bias (J0 D, purple line), variance (σ2, green line) and their sum (blue line). The red dashed line indicates the minimum of J0D + σ2, which corresponds to the predicted maximum performance cutoff dbv. E Values of predictive performance ρ at the optimal cutoffs compared to the full alignments for the 7 protein families. F ratio between performance increase at cutoffs of interest and at the optimal cutoff for the 7 protein families.
Fig 5.
The bias factor J0 depends on the model expressivity.
A Scaling correlation between predictive performance ρ and J0D + σ2 for the RNA-bind protein, modeled with the Sparse Potts Model with different numbers K of couplings. N is the length of the protein (82 sites). B: values of the bias factor J0 as a function of the number of modelled couplings in the Sparse Potts Model for the RNA-bind protein. C: same as B for the seven protein families combined; the black line and the blue area represent the mean and the standard deviation over the seven protein families. D Relation between bias factor J0(K) and improvement at best cutoff Δρ(dopt) for the RNA-bind protein. E same of D for the seven families combined. Values of K range from K = 0 to K = N. Each color corresponds to a different protein family as reported in the legend.
Table 1.
From left to right: Numbers N of sites, B of sequences (after removal of redundant sequences from the alignment), of tested single mutations, M1 of possible single mutations, and corresponding references.