A percolation theory analysis of continuous functional paths in protein sequence space affirms previous insights on the optimization of proteins for adaptability

doi:10.1371/journal.pone.0314929

Table 1.

List of variables.

More »

Expand

Fig 1.

Continuous functional paths.

For illustrative purposes, a small region of protein sequence space is depicted as a 10 x 10 grid where sequences that differ by a single mutation are directly above, below, to the left, or to the right of each other. Each sequence has a probability P_loc of being functional. Functional sequences are depicted as grey squares, and a starting sequence (top-left) and an ending sequence (bottom-right) are depicted as black squares. Neighboring sequences are within a certain number of mutations, n_m, of each other. Continuous paths of functional sequences (CFPs) are identified by orange turning arrows. The P_loc and the n_m for the identified CFP are listed for each grid. (a) P_loc = 30%, n_m = 1. Only one CFP extends from the starting sequence, and no CFPs extend for significant distances. (b) P_loc = 50%, n_m = 1. Multiple CFPs connect the starting and ending sequences. As P_loc increases, the number and average length of CFPs also increase. (c) P_loc = 30%, n_m = 2. No CFPs of immediate neighbors (n_m = 1) connect the starting and ending sequences. CFPs do connect the starting and ending sequence if one nonfunctional sequence can reside between two functional sequences (n_m = 2). (d) P_loc = 30%, n_m = 3. No CFPs connect the starting and ending sequence for n_m = 1 or 2, but CFPs connect them for n_m = 3.

More »

Expand

Table 2.

Comparison of P_th to P_fs.

The table lists for several peptides, polypeptides, and proteins their length (L), percolation threshold (P_th), proportion of functional sequences (P_fs), and the ratio of the percolation threshold to the proportion of functional sequences (R_b). It also lists the minimum allowed number of mutations between neighboring sequences (n_min) for P_th to approximate or drop below P_fs. The n_min values were converted to sequence identities (SI) using Eq 4. The study that reported a P_fs value is cited next to the protein’s name. Proteins are listed in order of descending P_fs.

More »

Expand

Fig 2.

Biasing of functional sequences in sequence space.

Sequence space is depicted as a 100 x 100 grid with neighboring sequences of a given sequence located directly above, below, to the left, and to the right. The proportion of functional sequences, P_fs, in every grid is close to 5%. The functional sequences are depicted as grey squares. The local probability of a sequence being functional, P_loc, is weighted along the y-axis by a normal distribution centered in the middle with a standard deviation of σ. The σ of the weighting function and the ratio, R_b, of P_loc in the center to P_fs are listed for each grid. (a) σ = infinity, R_b = 1. Functional sequences are distributed uniformly throughout the grid. (b) σ = 15, R_b = 1.6. (c) σ = 10, R_b = 4.0. (d) σ = 2, R_b = 19. Only the last grid contains continuous functional paths extending from the grid’s left side to its right side.

More »

Expand

Fig 3.

Clusters including center sequence.

Sequence space is depicted as a 40 x 40 grid where 50% of the sequences were randomly assigned as functional. The functional sequences are depicted by grey squares. Neighboring sequences are above, below, to the left, and to the right. A cluster is the set of sequences that are connected to each other through continuous functional paths. Sequences in the center clusters that include the square (20, 20) are colored blue. (a) The center cluster only includes 4 sequences. (b) The center cluster extends throughout the grid.

More »

Expand

Fig 4.

Average size of clusters of functional sequences that include the starting sequence.

The average cluster size that included the starting sequence was calculated for 20 matrices randomly populated with the same proportion of functional sequences, P_fs. The average cluster size increased dramatically after P_fs rose above a critical value, which is approximately 0.2% above the estimated percolation threshold, P_th (Eq 2: n_m = 1), for both (a) L = 10, A = 7 and (b) L = 13, A = 5, where A_t = A – 1. The estimated percolation thresholds are identified by dashed grey lines.

More »

Expand

Fig 5.

Percentage of starting sequences residing in large clusters.

All clusters were either below 300 sequences or above 500,000 sequences. Due to the dramatic division between small and large clusters, the average percentage of starting sequences in large clusters was calculated for 100 matrices for each set of parameter values. No large clusters were identified until the proportion of function sequences, P_fs, rose above a critical value roughly 0.2% above the estimated percolation threshold, P_th (Eq 2: n_m = 1), for both (a) L = 10, A = 7 and (b) L = 13, A = 5, where A_t = A – 1. The estimated percolation thresholds are identified by dashed grey lines.

More »

Expand

Fig 6.

Average number of attempts required to generate a CFP between the starting sequence and the target.

The target was all sequences that match a target sequence by all but at most 5 amino acids. Matrices were randomly generated until a continuous function path (CFP) connected the starting sequence to the target. The required number of attempts to generate a connecting CFP was averaged over 20 trials, where each attempt in a trial started with a matrix randomly populated with a specific proportion of functional sequences, P_fs. The average number of required attempts grew quickly as P_fs decreased below a critical value, which is approximately 0.2% above the estimated percolation threshold, P_th (Eq 2: n_m = 1), for both (a) L = 10, A = 7 and (b) L = 13, A = 5, where A_t = A – 1. The probability of a sequence residing in a CFP that extends to the target approximates the reciprocal of the average number of attempts. The estimated percolation thresholds are identified by dashed grey lines.

More »

Expand

Table 3.

Region neighboring wildtype sequences where P_fs is greater than P_th.

The percolation threshold, P_th, was calculated from Eq 2 for n_m = 1 and A_t = 7.5. The maximum Hamming distances, n_max, from wildtype sequences where P_fs > P_th were determined from experimental data. Data for β-lactamase comes from Bershtein et al. (2006), for GFP from Sarkisyan et al. (2016), and for HisA from Lundin et al. (2018). The maximum Hamming distances were converted to sequence identities (SI) using Eq 4. The region where P_fs > P_th for all the proteins is where their SI with a wildtype sequence is greater than approximately 95%.

More »

Expand

Fig 7.

Sequence transitions in simulation.

In this example, the sequence space corresponds to all sequences that are 10 amino acids long where each location in the sequence could hold one of seven possible amino acids. The positions in the sequence are labeled with N’s, and the number of the amino acid located at each position is labeled with A’s. The figure depicts two steps in a CFP. (a) The initial sequence has the 2^nd amino acid at the 3^rd position and the 4^th amino acid at the 7^th position. (b) The first mutation occurs at the 3^rd position, and it replaces the 2^nd amino acid with the 4^th amino acid. (c) The second mutation occurs at the 7^th position, and it replaces the 4^th amino acid with the 5^th amino acid. Each sequence is randomly assigned a value that determines if it is functional.

More »

Expand