Benchmarking Inverse Statistical Approaches for Protein Structure and Design with Exactly Solvable Models
Fig 1
From lattice-protein sequence space to inferred Potts model.
Protein families, each corresponding to a particular structure S, represent portions of sequence space (colored blobs), in which all sequences (colored dots) fold into a unique conformation. Many sequences are expected to be non folding, and not to belong to any family (black dots). Protein families differ by how much they are designable, i.e. by the numbers of sequences folding onto their corresponding structures, represented here by the sizes of the circles. SA and SB are the least designable folds, while SC and SD are realized by larger numbers of sequences, see Table 1. From a multi-sequence alignment (MSA) of one family, we infer the maximum-entropy pairwise Potts model reproducing the low-order statistics of the MSA. The model is then tested for structural prediction and generating new sequences with the same fold. An important issue is to unveil the meaning of the inferred pairwise couplings J, which depend both on the family fold, as well as on the competitor folds.