Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data

doi:10.1371/journal.pcbi.1011521

Fig 1.

Schematic visualization of the fitness landscape over the sequence space (green curve). Two models (red and blue curves) are inferred to assign high fitness values to sequences found in the Multi-Sequence Alignment (MSA) of a protein family. A complex model (red curve) can be a better predictor of the landscape globally while scoring poorly in predicting single-point mutations around a specific wild-type sequence, see local fitness landscape in the zoomed area. Conversely, a simple model (blue line) fitted on a local subset of sequences can give a better local approximation of the landscape, but will likely fail in distant regions of the fitness landscape.

More »

Expand

Fig 2.

Behaviour of model predictive performance with different selections of training data.

A. Distribution of Hamming distances to the wt sequence (RNA-binding domain of Pab1-Yeast) in the MSA of [37]. Note the log scale on the y axis. The three colored lines correspond to three possible sequence selections performed by excluding sequences farther than a certain threshold d_cut from wt. A smaller d_cut corresponds to fewer sequences with a lower mean Hamming distance to the wt, denoted as D. B Comparison between predicted and experimental fitness mutational effects for an independent-site model trained on the three sub-MSAs corresponding to, respectively, d_cut = 32 (orange), 43 (purple), and 82 (green). The Spearman correlation coefficient ρ between predicted and experimental values defines the predictive performance of the model. C Same analysis as panel B repeated for all possible cutoffs between d_cut = 32 and d_cut = 82 (the sequence length). The non monotonous behavior of the predictive performance indicates that a trade-off between number of sequences (denoted as B) and proximity to wt is controlling the predictive performance of the inferred model. D. Systematic analysis of the predictive power ρ as a function of the mean Hamming distance D of sub-alignments with fixed size B (top), and of the sub-alignment size B at fixed Hamming distance D (bottom). Each individual point shows the average over n = 5 sub-samples obtained at the corresponding values of D and B (see Methods). The dashed curves and error bars are computed by binned average and standard deviation over the displayed individual points. All significance levels refer to Spearman rank correlation of the individual points. *** P < 0.001.

More »

Expand

Fig 3.

Quantity-relevance trade-off for lattice proteins.

A: Cubic fold that defines the protein family in the lattice model. Amino acids on sites that are in proximity to each other interact and define the energy of the protein (Methods). B: Predictive performance ρ for single mutations of 5 Sparse Potts models with different degrees of sparsity (defined by K, the number of pairwise links included in the energy function; K = 0 is the independent model) vs. . The collapse of the results is in agreement with Eq (5). C: Squared bias vs. mean Hamming distance in the sequence data, see Eq (4), for the same sparse Potts models as in panel B. Line plots and error bars show mean and standard deviation at a given D and different Bs. D: Variance σ² vs. estimated variance σ² in Eq (3) for the same Sparse Potts models as in panel B. E: Bias factor J₀(K) (divided by J₀(0)) obtained by fitting the squared bias as a linear function of the mean Hamming distance for the various K-link models in panel C. F: Visualization of pairwise couplings inferred by a fully-connected Potts model, highlighting the larger variance of couplings associated to structural contacts (in orange) compared to non-structural ones (in blue)—note the log scale on the y-axis. G: Normalized value of J₀(K) (divided by J₀(0)) obtained with an effective theory using the variance of couplings associated to modeled and un-modeled structural contacts, see Appendix A in S1 Text. H: scaling for predictive performance ρ of our statistical models for single point mutations as a function of the sum of the estimated squared bias J₀D and of the variance σ² in Eq (3). J₀(K) (denoted as in the plot axis label) is fitted to for each value of K by maximizing the scaling correlation as explained in the main text. I: Bias factor J₀(K) (normalized by J₀(0)) inferred from maximizing the scaling correlation as in panel H.

More »

Expand

Fig 4.

Relevance-quantity trade-off explains the predictive performance of statistical modelling.

A predictive performance of single-point mutations using the Independent-site on the RNA-bind protein, shown as a function of the mean Hamming distance of the MSA (top) and variance estimated from the alignments (bottom). B predictive performance of single-point mutations as a function of the linear sum of squared bias and variance. The scaling correlation r_S is computed as the absolute value of the Spearman correlation coefficient of J₀D + σ² vs. ρ. The bias factor J₀ is inferred by maximizing r_S, as done in Fig 2E. C scaling correlation r_S for the seven protein families, compared to chance levels. The chance distribution is built by destroying the relationship between the performance ρ and the two descriptors by random order shuffling, then repeating the J₀ inference procedure to account for the scaling optimization during its estimation. Error bars show standard deviations over n = 100 repetitions of the random shuffling. D top: RNA-bind family, predictive performance ρ as a function of the cutoff distance d_cut, showing the existence of an optimal cutoff d^opt (black dashed line). Bottom: individual contributions of squared bias (J₀ D, purple line), variance (σ², green line) and their sum (blue line). The red dashed line indicates the minimum of J₀D + σ², which corresponds to the predicted maximum performance cutoff d^bv. E Values of predictive performance ρ at the optimal cutoffs compared to the full alignments for the 7 protein families. F ratio between performance increase at cutoffs of interest and at the optimal cutoff for the 7 protein families.

More »

Expand

Fig 5.

The bias factor J₀ depends on the model expressivity.

A Scaling correlation between predictive performance ρ and J₀D + σ² for the RNA-bind protein, modeled with the Sparse Potts Model with different numbers K of couplings. N is the length of the protein (82 sites). B: values of the bias factor J₀ as a function of the number of modelled couplings in the Sparse Potts Model for the RNA-bind protein. C: same as B for the seven protein families combined; the black line and the blue area represent the mean and the standard deviation over the seven protein families. D Relation between bias factor J₀(K) and improvement at best cutoff Δρ(d^opt) for the RNA-bind protein. E same of D for the seven families combined. Values of K range from K = 0 to K = N. Each color corresponds to a different protein family as reported in the legend.

More »

Expand

Table 1.

From left to right: Numbers N of sites, B of sequences (after removal of redundant sequences from the alignment), of tested single mutations, M¹ of possible single mutations, and corresponding references.

More »

Expand