Size and structure of the sequence space of repeat proteins

doi:10.1371/journal.pcbi.1007282

Fig 1.

Repeat proteins fold into characteristic accordeon-like folds.

Example structures of three protein families are shown, ankyrin repeats (ANK), tetratricopeptide repeats (TPR), leucine-rich repeat (LRR), with the repeating unit highlighted in magenta. All show regular folding patterns with defined contacts in and between repeats.

More »

Expand

Table 1.

Entropies (in bits, i.e. units of ln(2)) of sequences made of two consecutive repeats, for the three protein families shown in Fig 1.

Entropies are calculated for models of different complexity: model of random amino acids (S_rand = 2L ln(21), divided by ln(2) when expressed in bits); independent-site model (S₁), pairwise interaction model (S₂); pairwise interaction model with constraints due to repeat similarity λ_ID (S_full); pairwise interaction model of two non-interacting repeats learned without (S_ir) and with (S_ir,λ) constraints on repeat similarity. Fig 2 shows graphically some of the information contained in this table.

More »

Expand

Fig 2.

Contributions of within-repeat interactions (S₁ − S_ir green), repeat-repeat interactions (S_ir,λ − S_full, purple), and phylogenic bias between consecutive repeats (S_ir − S_ir,λ, blue), to the entropy reduction from an independent-site model.

All three contributions are comparable, but with a larger effect of within-repeat interactions and phylogenic bias in TPR. The fourth bar (orange) quantifies the redundancy between two constraints with overlapping scopes: the constraint on consecutive-repeat similary, and the constraint on repeat-repeat correlations. This redundancy is naturally measured within information theory by the difference of impact (i.e. entropy reduction) of a constraint depending on whether or not the other constraint is already enforced.

More »

Expand

Fig 3.

Entropy reduction as a function of the range of interactions between residue sites.

A) Entropy of two consecutive ANK repeats, as a function of the maximum allowed interaction distance W along the linear sequence. The entropy of the model decreases as more interactions are added and they constrain the space of possible sequences. After a sharp initial decrease at short ranges, the entropy plateaus until interactions between complementary sites in neighbouring repeats lead to a secondary sharp decrease at W = L − 1 = 32 (dashed line), due to structural interactions between consecutive repeats. B) Entropy of two consecutive ANK repeats as a function of the maximum allowed three-dimensional interaction range. The entropy decreases rapidly until ∼10 Angstrom, after which decay becomes slower. In both panels entropies are averaged over 10 realizations of fitting the model; See Methods for details of the learning and entropy estimation procedure. Error bars are estimated from fitting errors between the data and the model; see Methods and S1 Fig for error bars calculated as standard deviations over 10 realizations of model fitting.

More »

Expand

Fig 4.

A rugged energy landscape is characterized by the presence of local minima, where proteins sequences can get stuck during the evolutionary process.

The set of sequences that evolve to a given local minimum defines the basin of attraction of that mimimum.

More »

Expand

Fig 5.

Interactions within and between repeats sculpt a rugged energy landscape with many local minima.

Local minima were obtained by performing a zero-temperature Monte-Carlo simulation with the energy function in Eq (2), starting from initial conditions corresponding to naturally occurring sequences of pairs of consecutive ANK repeats. A, bottom) Rank-frequency plot of basin sizes, where basins are defined by the set of sequences falling into a particular minimum. A, top) energy of local minima vs the size-rank of their basin, showing that larger basins often also have lowest energy. Gray line indicates the energy of the consensus sequence, for comparison. B) Pairwise distance between the minima with the largest basins (comprising 90% of natural sequences), organised by hierarchical clustering. The panel right above the matrix shows the size of the basins relative to the minima corresponding to the entries of the distance matrix. A clear block structure emerges, separating different groups of basins with distinct sequences. C-D) Same as A) and B) but for single repeats. Since single repeats are shorter than pairs (length L instead of 2L), they have fewer local energy minima, yet still show a rich multi-basin structure. Equivalent analyses for LRR and TPR are shown in S3 and S4 Figs.

More »

Expand