Benchmarking Inverse Statistical Approaches for Protein Structure and Design with Exactly Solvable Models

doi:10.1371/journal.pcbi.1004889

Fig 1.

From lattice-protein sequence space to inferred Potts model.

Protein families, each corresponding to a particular structure S, represent portions of sequence space (colored blobs), in which all sequences (colored dots) fold into a unique conformation. Many sequences are expected to be non folding, and not to belong to any family (black dots). Protein families differ by how much they are designable, i.e. by the numbers of sequences folding onto their corresponding structures, represented here by the sizes of the circles. S_A and S_B are the least designable folds, while S_C and S_D are realized by larger numbers of sequences, see Table 1. From a multi-sequence alignment (MSA) of one family, we infer the maximum-entropy pairwise Potts model reproducing the low-order statistics of the MSA. The model is then tested for structural prediction and generating new sequences with the same fold. An important issue is to unveil the meaning of the inferred pairwise couplings J, which depend both on the family fold, as well as on the competitor folds.

More »

Expand

Fig 2.

Inverse statistical approaches are able to extract structural information from sequence data of lattice proteins.

A. Left: Structures S_A, S_B, S_C, S_D. Amino acids (blue circles) are shown with their number, running from 1 to 27 along the protein backbone (black line). There are 28 contacts between nearest-neighbor amino acids not supported by the backbone. Right: Positive predictive values (PPV), defined as the fraction of contacts among the k top scores, with the MI, DCA, PLM, ACE, and Projection procedures (Methods). Multi-sequence alignments with M = 5 ⋅ 10⁴ sequences were generated with Monte Carlo sampling at inverse temperature β = 10³(Methods). B. Structural predictions for biased alignments (M = 10⁵ sequences). The left panel shows the PPV with the PLM procedure for MSA for structure S_B, generated with four values of the bias b (Methods). Squares and dots correspond to predictions done with and without reweighting respectively (Methods). Predictions for the weakest bias (b = 0.01) are identical to the unbiased case (b = 0) shown in Fig 2A. The right panel reports the histograms of the Hamming distances to the Wild Type (WT) sequence, KCLIDRTEFKAREVLVPAKCCEFKECL, randomly chosen among the unbiased MSA of S_B. The effective number of sequences (Methods) in the MSAs where M_eff = 100000, 75000, 6378, 102 for, respectively, b = 0.01, 0.05, 0.075, 0.1.

More »

Expand

Table 1.

Designabilities and entropies of structures S_A to S_D.

Some estimates of how designable are the protein families associated to structures S_A, S_B, S_C, S_D: largest eigenvalue of the contact map matrix c (1st column) [25], entropy of the Potts-ACE model (2nd column), shown to be generative in Fig 3A, and mean percentage of identity between sequences in the attached multi-sequence alignments (MSA) (3rd column). The mean identity is defined as the number of sites carrying consensus amino acids, averaged over all sequences in the MSA and divided by the length of the protein (L = 27); low identity corresponds to diverse MSA, and, hence, to large designability. According to our estimates of the entropies, the volumes (Fig 1) associated to structures S_B and S_C are, respectively, of about 410²⁰ and 8.510²⁴ sequences, while the total number of sequences is 20²⁷ ≃ 10³⁵. For more information about the meanings of designability and entropy, see Section I.B in S1 Text.

More »

Expand

Table 2.

Quality of contact prediction with the different methods of inference, and average pressure.

Number of correctly predicted contacts after 28 predictions with the methods MI, DCA, PLM, ACE, and Projection, see Fig 2A, and pressures λ, defined in Eq [2], averaged over the pairs of amino acids in contact in the native folds.

More »

Expand

Fig 3.

Inferred Potts-ACE model generates sequences with high folding probabilities and diversities.

A. Folding probabilities P_nat(S_B|A), Eq [5], for four sets of 10⁴ sequences A randomly generated with the Independent-site Model (IM, green), the Potts-ACE (red), the Potts-PLM (orange) and the Potts-Gaussian (blue) models vs. their Hamming distances to the consensus sequence of the ‘natural’ MSA of structure S_B used to infer the four models. Black symbols show results for the ‘natural’ sequences, sampled at inverse temperature β = 10³ (Methods). Most sequences drawn from the Potts-ACE model have high folding probabilities, while most sequences drawn from the IM have low values of P_nat; sequences generated with the Potts-PLM model lie somewhere in between. Sequences drawn from the Potts-Gaussian model have very high folding probabilities, but are very close to the consensus sequence, and fail to reproduce the diversity of sequences seen in the ‘natural’ MSA (black) and Potts-ACE (red) data. Hamming distances for the Potts-ACE and PLM models have been shifted by, respectively, and to improve visibility. Filled ellipses show domains corresponding to one standard deviation of the effective Hamiltonian , Eq [6]. B. Scatter plot of the ‘energy’ , Eq [9], with the inferred Potts-ACE (x-axis) vs. effective Hamiltonian , Eq [6] (y-axis), for the sequences in the MSA generated with the Potts-ACE model for structure S_B. Only sequences within the 90%-100% percentiles of P_nat values are shown. Colors identify intervals of values for P_nat, see legend in panel. The energy of the sequences computed with the Potts-ACE model have been subtracted the energy of the best folder, such that the minimal energy is zero.

More »

Expand

Fig 4.

Inferred Potts couplings encode energetics and structural information about native and competitor folds, reflecting both positive and negative designs.

A. Values of J_ij(a, b) (inferred from a MSA of structure S_B with the Potts-ACE method) vs. −E(a, b) across all pairs of sites i, j and of amino acids a, b (found at least once in the MSA on those sites). Couplings and MJ energy parameters are shown in the consensus gauge, in which the entries attached to the most probable amino acids in each site are fixed to zero. Red symbols correspond to pairs (i, j) in contact, while blue symbols correspond to no contact. B. Lower-triangle: contact map c_ij of structure S_B. Full blue squares correspond to pairs of sites i, j in contacts. Green and red dots show, respectively, true and false positives among the 28 largest scores with the ACE method (Methods). Upper triangle: average contact map , computed over all competitor folds weighted with their Boltzmann weights (Methods). The four missed contacts (all touching the central site 4) correspond to large . Red squares locate the four false positives. C. Pressure λ_ij for each pair of sites (i, j), computed from Eq [2], vs. for structure S_B. The 195 pairs of sites which can never be in contact on any fold due to the lattice geometry are shown with magenta pluses. The 28 contacts on S_B (red symbols) are partitioned into the Unique-Native (UN, 14 full triangles) and Shared-Native (SN, 14 empty triangles) classes, according to, respectively, their absence or presence in the closest competitor structure, S_F (Fig 4D). The remaining 128 pairs of sites (blue symbols) are not in contact on S_B, and are partitioned into the Closest-Competitor (CC, 14 full squares) and the Non-Native (NN, 114 empty squares) classes, according to, respectively, whether they are in contact or not in the closest competitor structure, S_F. Similar results are found for S_A, S_C and S_D, see Table 2 and Figs H, I, and J in S1 Text. As in Fig 4A, we use coupling and MJ entries expressed in the consensus gauge, since the consensus sequence corresponds, or is close to the best folding sequence, used as a reference sequence in our theoretical calculation of the pressure (S1 Text, Section III). Changing the gauge e.g. to the least-probable gauge affects the amplitudes of the pressures λ_ij, but does not qualitatively alter the results. D. Structure S_F, the closest competitor structure to S_B. Note that the four missed contacts (among the top 28 F^APC scores with the ACE method) are carried by the center of the cube (site i = 4 on S_B and S_F), see fold S_B in Fig 2A and its contact map in Fig 4B. Two of the four false positives are contacts on S_F, and are thus in the CC class.

More »

Expand