Skip to main content
Advertisement

< Back to Article

Fig 1.

From lattice-protein sequence space to inferred Potts model.

Protein families, each corresponding to a particular structure S, represent portions of sequence space (colored blobs), in which all sequences (colored dots) fold into a unique conformation. Many sequences are expected to be non folding, and not to belong to any family (black dots). Protein families differ by how much they are designable, i.e. by the numbers of sequences folding onto their corresponding structures, represented here by the sizes of the circles. SA and SB are the least designable folds, while SC and SD are realized by larger numbers of sequences, see Table 1. From a multi-sequence alignment (MSA) of one family, we infer the maximum-entropy pairwise Potts model reproducing the low-order statistics of the MSA. The model is then tested for structural prediction and generating new sequences with the same fold. An important issue is to unveil the meaning of the inferred pairwise couplings J, which depend both on the family fold, as well as on the competitor folds.

More »

Fig 1 Expand

Fig 2.

Inverse statistical approaches are able to extract structural information from sequence data of lattice proteins.

A. Left: Structures SA, SB, SC, SD. Amino acids (blue circles) are shown with their number, running from 1 to 27 along the protein backbone (black line). There are 28 contacts between nearest-neighbor amino acids not supported by the backbone. Right: Positive predictive values (PPV), defined as the fraction of contacts among the k top scores, with the MI, DCA, PLM, ACE, and Projection procedures (Methods). Multi-sequence alignments with M = 5 ⋅ 104 sequences were generated with Monte Carlo sampling at inverse temperature β = 103(Methods). B. Structural predictions for biased alignments (M = 105 sequences). The left panel shows the PPV with the PLM procedure for MSA for structure SB, generated with four values of the bias b (Methods). Squares and dots correspond to predictions done with and without reweighting respectively (Methods). Predictions for the weakest bias (b = 0.01) are identical to the unbiased case (b = 0) shown in Fig 2A. The right panel reports the histograms of the Hamming distances to the Wild Type (WT) sequence, KCLIDRTEFKAREVLVPAKCCEFKECL, randomly chosen among the unbiased MSA of SB. The effective number of sequences (Methods) in the MSAs where Meff = 100000, 75000, 6378, 102 for, respectively, b = 0.01, 0.05, 0.075, 0.1.

More »

Fig 2 Expand

Table 1.

Designabilities and entropies of structures SA to SD.

Some estimates of how designable are the protein families associated to structures SA, SB, SC, SD: largest eigenvalue of the contact map matrix c (1st column) [25], entropy of the Potts-ACE model (2nd column), shown to be generative in Fig 3A, and mean percentage of identity between sequences in the attached multi-sequence alignments (MSA) (3rd column). The mean identity is defined as the number of sites carrying consensus amino acids, averaged over all sequences in the MSA and divided by the length of the protein (L = 27); low identity corresponds to diverse MSA, and, hence, to large designability. According to our estimates of the entropies, the volumes (Fig 1) associated to structures SB and SC are, respectively, of about 41020 and 8.51024 sequences, while the total number of sequences is 2027 ≃ 1035. For more information about the meanings of designability and entropy, see Section I.B in S1 Text.

More »

Table 1 Expand

Table 2.

Quality of contact prediction with the different methods of inference, and average pressure.

Number of correctly predicted contacts after 28 predictions with the methods MI, DCA, PLM, ACE, and Projection, see Fig 2A, and pressures λ, defined in Eq [2], averaged over the pairs of amino acids in contact in the native folds.

More »

Table 2 Expand

Fig 3.

Inferred Potts-ACE model generates sequences with high folding probabilities and diversities.

A. Folding probabilities Pnat(SB|A), Eq [5], for four sets of 104 sequences A randomly generated with the Independent-site Model (IM, green), the Potts-ACE (red), the Potts-PLM (orange) and the Potts-Gaussian (blue) models vs. their Hamming distances to the consensus sequence of the ‘natural’ MSA of structure SB used to infer the four models. Black symbols show results for the ‘natural’ sequences, sampled at inverse temperature β = 103 (Methods). Most sequences drawn from the Potts-ACE model have high folding probabilities, while most sequences drawn from the IM have low values of Pnat; sequences generated with the Potts-PLM model lie somewhere in between. Sequences drawn from the Potts-Gaussian model have very high folding probabilities, but are very close to the consensus sequence, and fail to reproduce the diversity of sequences seen in the ‘natural’ MSA (black) and Potts-ACE (red) data. Hamming distances for the Potts-ACE and PLM models have been shifted by, respectively, and to improve visibility. Filled ellipses show domains corresponding to one standard deviation of the effective Hamiltonian , Eq [6]. B. Scatter plot of the ‘energy’ , Eq [9], with the inferred Potts-ACE (x-axis) vs. effective Hamiltonian , Eq [6] (y-axis), for the sequences in the MSA generated with the Potts-ACE model for structure SB. Only sequences within the 90%-100% percentiles of Pnat values are shown. Colors identify intervals of values for Pnat, see legend in panel. The energy of the sequences computed with the Potts-ACE model have been subtracted the energy of the best folder, such that the minimal energy is zero.

More »

Fig 3 Expand

Fig 4.

Inferred Potts couplings encode energetics and structural information about native and competitor folds, reflecting both positive and negative designs.

A. Values of Jij(a, b) (inferred from a MSA of structure SB with the Potts-ACE method) vs. −E(a, b) across all pairs of sites i, j and of amino acids a, b (found at least once in the MSA on those sites). Couplings and MJ energy parameters are shown in the consensus gauge, in which the entries attached to the most probable amino acids in each site are fixed to zero. Red symbols correspond to pairs (i, j) in contact, while blue symbols correspond to no contact. B. Lower-triangle: contact map cij of structure SB. Full blue squares correspond to pairs of sites i, j in contacts. Green and red dots show, respectively, true and false positives among the 28 largest scores with the ACE method (Methods). Upper triangle: average contact map , computed over all competitor folds weighted with their Boltzmann weights (Methods). The four missed contacts (all touching the central site 4) correspond to large . Red squares locate the four false positives. C. Pressure λij for each pair of sites (i, j), computed from Eq [2], vs. for structure SB. The 195 pairs of sites which can never be in contact on any fold due to the lattice geometry are shown with magenta pluses. The 28 contacts on SB (red symbols) are partitioned into the Unique-Native (UN, 14 full triangles) and Shared-Native (SN, 14 empty triangles) classes, according to, respectively, their absence or presence in the closest competitor structure, SF (Fig 4D). The remaining 128 pairs of sites (blue symbols) are not in contact on SB, and are partitioned into the Closest-Competitor (CC, 14 full squares) and the Non-Native (NN, 114 empty squares) classes, according to, respectively, whether they are in contact or not in the closest competitor structure, SF. Similar results are found for SA, SC and SD, see Table 2 and Figs H, I, and J in S1 Text. As in Fig 4A, we use coupling and MJ entries expressed in the consensus gauge, since the consensus sequence corresponds, or is close to the best folding sequence, used as a reference sequence in our theoretical calculation of the pressure (S1 Text, Section III). Changing the gauge e.g. to the least-probable gauge affects the amplitudes of the pressures λij, but does not qualitatively alter the results. D. Structure SF, the closest competitor structure to SB. Note that the four missed contacts (among the top 28 FAPC scores with the ACE method) are carried by the center of the cube (site i = 4 on SB and SF), see fold SB in Fig 2A and its contact map in Fig 4B. Two of the four false positives are contacts on SF, and are thus in the CC class.

More »

Fig 4 Expand