Table 1.
List of variables.
Fig 1.
For illustrative purposes, a small region of protein sequence space is depicted as a 10 x 10 grid where sequences that differ by a single mutation are directly above, below, to the left, or to the right of each other. Each sequence has a probability Ploc of being functional. Functional sequences are depicted as grey squares, and a starting sequence (top-left) and an ending sequence (bottom-right) are depicted as black squares. Neighboring sequences are within a certain number of mutations, nm, of each other. Continuous paths of functional sequences (CFPs) are identified by orange turning arrows. The Ploc and the nm for the identified CFP are listed for each grid. (a) Ploc = 30%, nm = 1. Only one CFP extends from the starting sequence, and no CFPs extend for significant distances. (b) Ploc = 50%, nm = 1. Multiple CFPs connect the starting and ending sequences. As Ploc increases, the number and average length of CFPs also increase. (c) Ploc = 30%, nm = 2. No CFPs of immediate neighbors (nm = 1) connect the starting and ending sequences. CFPs do connect the starting and ending sequence if one nonfunctional sequence can reside between two functional sequences (nm = 2). (d) Ploc = 30%, nm = 3. No CFPs connect the starting and ending sequence for nm = 1 or 2, but CFPs connect them for nm = 3.
Table 2.
The table lists for several peptides, polypeptides, and proteins their length (L), percolation threshold (Pth), proportion of functional sequences (Pfs), and the ratio of the percolation threshold to the proportion of functional sequences (Rb). It also lists the minimum allowed number of mutations between neighboring sequences (nmin) for Pth to approximate or drop below Pfs. The nmin values were converted to sequence identities (SI) using Eq 4. The study that reported a Pfs value is cited next to the protein’s name. Proteins are listed in order of descending Pfs.
Fig 2.
Biasing of functional sequences in sequence space.
Sequence space is depicted as a 100 x 100 grid with neighboring sequences of a given sequence located directly above, below, to the left, and to the right. The proportion of functional sequences, Pfs, in every grid is close to 5%. The functional sequences are depicted as grey squares. The local probability of a sequence being functional, Ploc, is weighted along the y-axis by a normal distribution centered in the middle with a standard deviation of σ. The σ of the weighting function and the ratio, Rb, of Ploc in the center to Pfs are listed for each grid. (a) σ = infinity, Rb = 1. Functional sequences are distributed uniformly throughout the grid. (b) σ = 15, Rb = 1.6. (c) σ = 10, Rb = 4.0. (d) σ = 2, Rb = 19. Only the last grid contains continuous functional paths extending from the grid’s left side to its right side.
Fig 3.
Clusters including center sequence.
Sequence space is depicted as a 40 x 40 grid where 50% of the sequences were randomly assigned as functional. The functional sequences are depicted by grey squares. Neighboring sequences are above, below, to the left, and to the right. A cluster is the set of sequences that are connected to each other through continuous functional paths. Sequences in the center clusters that include the square (20, 20) are colored blue. (a) The center cluster only includes 4 sequences. (b) The center cluster extends throughout the grid.
Fig 4.
Average size of clusters of functional sequences that include the starting sequence.
The average cluster size that included the starting sequence was calculated for 20 matrices randomly populated with the same proportion of functional sequences, Pfs. The average cluster size increased dramatically after Pfs rose above a critical value, which is approximately 0.2% above the estimated percolation threshold, Pth (Eq 2: nm = 1), for both (a) L = 10, A = 7 and (b) L = 13, A = 5, where At = A – 1. The estimated percolation thresholds are identified by dashed grey lines.
Fig 5.
Percentage of starting sequences residing in large clusters.
All clusters were either below 300 sequences or above 500,000 sequences. Due to the dramatic division between small and large clusters, the average percentage of starting sequences in large clusters was calculated for 100 matrices for each set of parameter values. No large clusters were identified until the proportion of function sequences, Pfs, rose above a critical value roughly 0.2% above the estimated percolation threshold, Pth (Eq 2: nm = 1), for both (a) L = 10, A = 7 and (b) L = 13, A = 5, where At = A – 1. The estimated percolation thresholds are identified by dashed grey lines.
Fig 6.
Average number of attempts required to generate a CFP between the starting sequence and the target.
The target was all sequences that match a target sequence by all but at most 5 amino acids. Matrices were randomly generated until a continuous function path (CFP) connected the starting sequence to the target. The required number of attempts to generate a connecting CFP was averaged over 20 trials, where each attempt in a trial started with a matrix randomly populated with a specific proportion of functional sequences, Pfs. The average number of required attempts grew quickly as Pfs decreased below a critical value, which is approximately 0.2% above the estimated percolation threshold, Pth (Eq 2: nm = 1), for both (a) L = 10, A = 7 and (b) L = 13, A = 5, where At = A – 1. The probability of a sequence residing in a CFP that extends to the target approximates the reciprocal of the average number of attempts. The estimated percolation thresholds are identified by dashed grey lines.
Table 3.
Region neighboring wildtype sequences where Pfs is greater than Pth.
The percolation threshold, Pth, was calculated from Eq 2 for nm = 1 and At = 7.5. The maximum Hamming distances, nmax, from wildtype sequences where Pfs > Pth were determined from experimental data. Data for β-lactamase comes from Bershtein et al. (2006), for GFP from Sarkisyan et al. (2016), and for HisA from Lundin et al. (2018). The maximum Hamming distances were converted to sequence identities (SI) using Eq 4. The region where Pfs > Pth for all the proteins is where their SI with a wildtype sequence is greater than approximately 95%.
Fig 7.
Sequence transitions in simulation.
In this example, the sequence space corresponds to all sequences that are 10 amino acids long where each location in the sequence could hold one of seven possible amino acids. The positions in the sequence are labeled with N’s, and the number of the amino acid located at each position is labeled with A’s. The figure depicts two steps in a CFP. (a) The initial sequence has the 2nd amino acid at the 3rd position and the 4th amino acid at the 7th position. (b) The first mutation occurs at the 3rd position, and it replaces the 2nd amino acid with the 4th amino acid. (c) The second mutation occurs at the 7th position, and it replaces the 4th amino acid with the 5th amino acid. Each sequence is randomly assigned a value that determines if it is functional.