Figures
Abstract
A key question in protein evolution and protein engineering is the prevalence of evolutionary paths between distinct proteins. An evolutionary path corresponds to a continuous path of functional sequences in sequence space leading from one protein to another. Natural selection could direct a mutating coding region in DNA along a continuous functional path (CFP), so a new protein could arise far more easily than if a coding region were randomly mutating without any constraints. The distribution and length of CFPs undergird theories on the origin of natural proteins and strategies for engineering artificial proteins. This study examined the distribution of long CFPs within the framework of percolation theory, which addresses the proportion of randomly filled sites in a lattice above which long continuous paths of neighboring filled sites become common (aka percolation threshold). It also used a simulation to demonstrate that the percolation threshold in protein sequence space approximates the reciprocal of the average number of protein variants that could result from a single mutation. For diverse proteins, the ratio was calculated between the percolation threshold and the proportion of sequences reported to perform a protein’s function, relative to the total number of sequences of that protein’s length. This ratio represents a measure of the biasing in the distribution of functional sequences required for evolutionary paths to possibly exist, so it provides a means to quantify the specificity in protein sequence and structure required to allow for a protein to develop new catalytic functions. The consistently high ratio demonstrates that CFPs can only connect distinct proteins if the biasing in the distribution of functional sequences in sequence space is often extremely large. Regions in sequence space are identified where the biasing is sufficient to allow for extensive CFPs. The calculated levels of required biasing and the identified regions of high biasing reinforce the conclusion of previous studies that some proteins are highly optimized, so mutations can enable or enhance catalytic functions while maintaining the protein’s structure. The conclusions of this study also challenge the results of a previous application of percolation theory to sequence space that did not properly incorporate the percolation threshold. Steps are outlined for integrating the percolation threshold and the biasing measure into studies of protein sequence space.
Citation: Miller BJ (2024) A percolation theory analysis of continuous functional paths in protein sequence space affirms previous insights on the optimization of proteins for adaptability. PLoS ONE 19(12): e0314929. https://doi.org/10.1371/journal.pone.0314929
Editor: Andre van Wijnen, University of Vermont College of Medicine, UNITED STATES OF AMERICA
Received: November 26, 2023; Accepted: November 18, 2024; Published: December 5, 2024
Copyright: © 2024 Brian J. Miller. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: https://github.com/drbjmiller/Sequence-Space.git.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Background
A central question in protein engineering and protein evolution is how new proteins can emerge de novo or through the modification of existing proteins. The answer depends largely on two key factors. The first is the prevalence of amino acid sequences that perform a particular function. The prevalence can be determined for all of sequence space, which is the multidimensional space of all possible amino acid sequences. The proportion of functional sequences, Pfs, is then defined as the number of sequences that perform the protein’s function and maintain its structure divided by the total number of sequences with the protein’s length. The proportion of functional sequences can also be defined in a local region of sequence space. The local proportion, Ploc, is the number of functional sequences divided by the total number of sequences in that region.
The variables used in this study are listed in Table 1 in their order of appearance.
A second key factor is the distribution in sequence space of paths of functional neighboring sequences, which are termed continuous functional paths (CFPs). A neighboring sequence typically differs from an initial sequence by a single mutation. A continuous path is the series of neighboring sequences that results from a series of specific mutations. A CFP is a continuous path where every sequence in the path is functional.
In certain contexts, neighbors are defined as those sequences that are separated by not just one but up to a set number of mutations, nm. A CFP is then defined as a series of functional sequences that results from a series of sets of nm or fewer mutations. For instance, in species with very large populations, individuals can acquire multiple mutations at once, so nm would represent the maximum possible number of simultaneous mutations. Relatedly, nonessential proteins might acquire a disabling mutation but persist in the species for many generations. The protein could potentially acquire an additional mutation or mutations that reactivated it before natural selection removed the initial mutation from the population. In this situation, nm would represent the maximum number of mutations that a protein could potentially acquire within the available timeframe to regain function if the protein lost function with the first nm− 1 mutations. As a final example, an insertion/deletion (indel) could add or remove multiple amino acids in a single event, so nm would represent the number of amino acids that an indel adds or removes. Indels that do more than add or remove a few amino acids are almost always harmful [1], so they can be ignored.
Examples of CFPs for different Ploc and nm are illustrated in Fig 1. The diagrams display how a protein’s Ploc correlates with the number and average length of CFPs and how lower Ploc require higher nm for CFPs to extend significant distances. The value of Ploc in different regions of sequence space constrain theories of protein evolution and strategies for engineering new proteins.
For illustrative purposes, a small region of protein sequence space is depicted as a 10 x 10 grid where sequences that differ by a single mutation are directly above, below, to the left, or to the right of each other. Each sequence has a probability Ploc of being functional. Functional sequences are depicted as grey squares, and a starting sequence (top-left) and an ending sequence (bottom-right) are depicted as black squares. Neighboring sequences are within a certain number of mutations, nm, of each other. Continuous paths of functional sequences (CFPs) are identified by orange turning arrows. The Ploc and the nm for the identified CFP are listed for each grid. (a) Ploc = 30%, nm = 1. Only one CFP extends from the starting sequence, and no CFPs extend for significant distances. (b) Ploc = 50%, nm = 1. Multiple CFPs connect the starting and ending sequences. As Ploc increases, the number and average length of CFPs also increase. (c) Ploc = 30%, nm = 2. No CFPs of immediate neighbors (nm = 1) connect the starting and ending sequences. CFPs do connect the starting and ending sequence if one nonfunctional sequence can reside between two functional sequences (nm = 2). (d) Ploc = 30%, nm = 3. No CFPs connect the starting and ending sequence for nm = 1 or 2, but CFPs connect them for nm = 3.
Information about CFPs is critical to understanding how new proteins emerge since many proteins correspond to Pfs so small that they could not have arisen through an undirected search in sequence space. The Pfs values commonly cited for peptides, polypeptides, and proteins are listed in Table 2. By comparison, the largest number of possible variants of a protein in all organisms for the entire history of the earth is on the order of 1038 [2]. For a protein to have arisen through random mutations in a freely evolving coding region of DNA, its Pfs must be larger than the reciprocal of this value (10−38). Yet the reported Pfs for most proteins is close to or smaller than this cutoff.
The table lists for several peptides, polypeptides, and proteins their length (L), percolation threshold (Pth), proportion of functional sequences (Pfs), and the ratio of the percolation threshold to the proportion of functional sequences (Rb). It also lists the minimum allowed number of mutations between neighboring sequences (nmin) for Pth to approximate or drop below Pfs. The nmin values were converted to sequence identities (SI) using Eq 4. The study that reported a Pfs value is cited next to the protein’s name. Proteins are listed in order of descending Pfs.
For Pfs below the cutoff, an evolutionary search could only discover a functional sequence if it explored a vastly smaller portion of sequence space than what would have been required for an unconstrained randomly mutating sequence. If a CFP connected two proteins, natural selection could constrain a search along the CFP making the evolution of one protein into the other feasible. The existence of CFPs that extend long distances in sequence space depends on whether Ploc is above what is termed the percolation threshold, Pth. When Ploc rises above the threshold, a phase change occurs where long CFPs become common. Below the threshold, CFPs are almost always only a few sequences in length. The value of the threshold in different contexts has been a central subject of study in percolation theory.
Extensive research has been conducted on identifying Pth for multidimensional latices, and multiple studies demonstrated that it approximates the reciprocal of the number of nearest neighboring sites, z, to a site inside a lattice [10, 11]:
(1)
For example, if sites inside a multidimensional lattice had z = 100 neighbors, the percolation threshold would approximate 1/100 or 1%. Gavrilets (2003) derived the same equation for genotype space, where z represents the number of genotypes accessible through a single mutation [12].
This relationship should also hold in amino acid sequence space. For sequences of length L, the average number of nearest neighbors (sequences accessible from a single mutation) to any sequence is AtL, where At is the average number of amino acids an amino acid could transition into through a single mutation. By extension, the number of neighbors within nm mutations is AtL to the power of nm divided by nm!. For multiple mutations, the number of neighbors (z) is raised to the power of nm since each mutation could make any possible change and sequences that are reached by fewer than nm mutations are also neighbors. The factorial term is required since the order of alterations does not matter. An approximation for Pth directly follows from Eq 1:
(2)
If a protein’s Pfs is considerably below its Pth, the distribution of functional sequences in local regions of sequence space must be highly biased for Ploc to be sufficiently large for CFPs to extend significant distances (Fig 2). A measure of the level of required biasing is the ratio, Rb, of the percolation threshold to the proportion of functional sequences:
(3)
If Rb were 100 in a corridor extending through sequence space, Ploc would need to be at least 100 times larger than Pfs for a CFP to have a significant chance of extending through the corridor.
Sequence space is depicted as a 100 x 100 grid with neighboring sequences of a given sequence located directly above, below, to the left, and to the right. The proportion of functional sequences, Pfs, in every grid is close to 5%. The functional sequences are depicted as grey squares. The local probability of a sequence being functional, Ploc, is weighted along the y-axis by a normal distribution centered in the middle with a standard deviation of σ. The σ of the weighting function and the ratio, Rb, of Ploc in the center to Pfs are listed for each grid. (a) σ = infinity, Rb = 1. Functional sequences are distributed uniformly throughout the grid. (b) σ = 15, Rb = 1.6. (c) σ = 10, Rb = 4.0. (d) σ = 2, Rb = 19. Only the last grid contains continuous functional paths extending from the grid’s left side to its right side.
Current study
The current study applies percolation theory to protein sequence space for diverse proteins to determine the level of required biasing in the distribution of functional sequences to allow for CFPs to extend for long distances. Quantifying the biasing requires identifying the percolation threshold in protein sequence space, so a simulation was developed to confirm that Eq 2 reliably approximates proteins’ Pth. The simulation generated matrices that model protein sequence space for short amino acid chains, and it tested for the presence of CFPs for varying Pfs. The simulation results demonstrate that Eq 2 accurately predicts the percolation threshold, so it was used to calculate Rb for several peptides, polypeptides, and proteins.
All proteins included in the analysis demonstrate very high Rb values, indicating that long CFPs can only exist in regions of sequence space with high levels of biasing. Sources of strong biasing were identified based on protein mutation studies and protein evolution research. The results of this study reinforce the conclusion of previous studies that many proteins possess highly special features that allow them to evolve new catalytic functions.
In addition, a new approach is proposed for incorporating Rb values into analyses of protein sequence space. Previous studies applying percolation theory to biological research have typically focused on genomic sequence space and mutations’ impact on physical traits [13–16] instead of individual proteins. Investigations that applied percolation theory to proteins typically studied their physical structure [17, 18]. The conceptual lattice corresponded to the possible positions of amino acids in a protein’s molecular architecture, and neighboring cells corresponded to amino acids that directly interact instead of amino acid sequences that are separated in sequence space by a set number of mutations.
One of the only studies that employed percolation theory to understand protein sequence space is Buchholz et al. (2017). The investigators concluded that a single cluster of functional sequences likely connects all proteins in the same superfamily [19], and a few sequences within these extensive clusters have many neighbors. These “hub sequences” are expected to tolerate multiple mutations that may give rise to new functions [20, 21]. The investigators suggest that these sequences could serve as promising starting points for directed evolution experiments [22]. Their methodology, however, did not account for the percolation threshold, an omission that raises questions about the validity of their conclusions.
No identified methodology properly applied percolation theory to protein sequence space to ascertain CFPs. The new approach fills this gap; it should lead to more accurate evaluations of the distributions of functional sequences, extensive CFPs, and large clusters. Steps are outlined for applying the approach to protein evolution and protein design studies.
Results
Simulation
The simulation generated multidimensional matrices corresponding to the sequence space for amino acid chains of length, L, where each position could correspond to any of A amino acids. In most trials, each amino acid could directly transition into any other amino acid, but not to itself, so the number of transition options, At, was A – 1. For each trial, a matrix was generated with a set proportion of functional sequences, Pfs, uniformly distributed. For one set of trials, the simulation calculated the size of the cluster of neighboring functional sequences that included the starting sequence. A cluster is all sequences that are connected to each other through CFPs (Fig 3). For another set of trials, the simulation determined if a CFP (nm = 1) extended from a starting sequence to a target.
Sequence space is depicted as a 40 x 40 grid where 50% of the sequences were randomly assigned as functional. The functional sequences are depicted by grey squares. Neighboring sequences are above, below, to the left, and to the right. A cluster is the set of sequences that are connected to each other through continuous functional paths. Sequences in the center clusters that include the square (20, 20) are colored blue. (a) The center cluster only includes 4 sequences. (b) The center cluster extends throughout the grid.
The simulation was initially run with L = 10, A = 7 and L = 13, A = 5. The trials demonstrated the expected phase transition near Pth (Eq 2: nm = 1). For L = 10, Pth = 1.7% and for L = 13, Pth = 1.9%. When Pfs dropped below a critical value approximately 0.2% above Pth, clusters were small (Fig 4), and CFPs extending throughout sequence space ceased to exist (Fig 5). Correspondingly, the required number of attempts, Natt, to generate a matrix with a CFP connecting the starting sequence to the target quickly increased when Pfs dropped below the same critical value approximately 0.2% above Pth (Fig 6).
The average cluster size that included the starting sequence was calculated for 20 matrices randomly populated with the same proportion of functional sequences, Pfs. The average cluster size increased dramatically after Pfs rose above a critical value, which is approximately 0.2% above the estimated percolation threshold, Pth (Eq 2: nm = 1), for both (a) L = 10, A = 7 and (b) L = 13, A = 5, where At = A – 1. The estimated percolation thresholds are identified by dashed grey lines.
All clusters were either below 300 sequences or above 500,000 sequences. Due to the dramatic division between small and large clusters, the average percentage of starting sequences in large clusters was calculated for 100 matrices for each set of parameter values. No large clusters were identified until the proportion of function sequences, Pfs, rose above a critical value roughly 0.2% above the estimated percolation threshold, Pth (Eq 2: nm = 1), for both (a) L = 10, A = 7 and (b) L = 13, A = 5, where At = A – 1. The estimated percolation thresholds are identified by dashed grey lines.
The target was all sequences that match a target sequence by all but at most 5 amino acids. Matrices were randomly generated until a continuous function path (CFP) connected the starting sequence to the target. The required number of attempts to generate a connecting CFP was averaged over 20 trials, where each attempt in a trial started with a matrix randomly populated with a specific proportion of functional sequences, Pfs. The average number of required attempts grew quickly as Pfs decreased below a critical value, which is approximately 0.2% above the estimated percolation threshold, Pth (Eq 2: nm = 1), for both (a) L = 10, A = 7 and (b) L = 13, A = 5, where At = A – 1. The probability of a sequence residing in a CFP that extends to the target approximates the reciprocal of the average number of attempts. The estimated percolation thresholds are identified by dashed grey lines.
All clusters comprised fewer than 300 sequences or greater than 500,000 sequences. A CFP that is part of the larger cluster class extends throughout sequence space, so it is designated an extensive continuous functional path. Identifying the presence of extensive CFPs is crucial since they are required for natural selection to assist the evolution of one protein into another distinct protein.
For the initial trials where Pfs = Pth, the ratio of Natt to the size of the entire sequence space corresponded to roughly one successful attempt per million sequences in sequence space. This ratio is analogous to an ancestral protein’s Pfs. The targets encompassed more than 0.1% of the sequence spaces, which is analogous to a descendent protein’s Pfs. These proportions are far larger than the Pfs for even peptides that perform such simple functions as sticking to an ATP molecule. Moreover, the shortest path from the starting sequence to the target is only 5 steps for L = 10 and 7 steps for L = 13. Protein evolution and protein engineering entail searches in much larger spaces, involve much smaller Pfs, and often require larger numbers of amino acid alterations. Consequently, the percolation thresholds observed in the simulation should not exceed the thresholds in actual protein sequence space.
Another set of trials was run to identify the critical value for the phase transition where large clusters appeared for the following parameters (L, A, At): (7, 10, 9), (7, 20, 6), (7, 20, 19), (10, 5, 4), (10, 7, 2), (10, 8, 7). The transition for the largest sequence spaces was consistently close to 0.2% above the percolation threshold estimate (Eq 2). The transition was more than 0.2% above the estimate for trials with smaller A for a given L: (7, 10, 9) and (10, 5, 4). The transition also occurred at higher Pfs for trials that only allowed amino acids to transition to a limited set of possible amino acids: (7, 20, 6) and (10, 7, 2). The latter result was expected since allowing amino acids to transition to all other amino acids provides each sequence the easiest access to all of sequence space. A single mutation in proteins can result in changes to fewer than half of the available amino acids [23], further supporting that the simulation results do not underestimate the actual thresholds.
The results for all trials indicate that the phase transition occurs 0.2% or more above the percolation threshold estimate. The results should scale with L since they were consistent for L varying by almost a factor of two. This conclusion is further supported by theoretical analyses of complex multidimensional latices, which demonstrated that the percolation threshold never drops below the reciprocal of the number of nearest neighbors (Eq 1) but only approaches it as the number of nearest neighbors and the size of the lattice grows [24]. Consequently, Pth from Eq 2 represents a reliable lower bound to the percolation threshold for protein sequence space.
Biasing and nonfunctional intermediates
Since Pth represents a conservative lower bound for the percolation threshold, Rb conservatively estimates the level of biasing required in the distribution of functional sequences in a region of sequence space for extensive CFPs to exist in that region. The Pth (nm = 1), Pfs, and Rb values for multiple peptides, polypeptides, and proteins are listed in Table 2. The data includes all commonly cited Pfs values. The Rb values for the proteins are exceedingly large, and they are significant even for the peptides and polypeptides.
The importance of the Rb values is clearer when they are connected to the lowest nm required for Rb to drop below 1 (Pth to drop below Pfs). The lowest nm for the membrane embedding peptides is 4. Consequently, functional sequences are on average over 3 amino acid changes away from each other. The required nm for the ATP binding polypeptides is 5 and much higher for the proteins. For a CFP to connect even functional peptides and polypeptides without significant biasing, individuals would have to tolerate multiple nonfunctional intermediate sequences before the next functional sequence was discovered along a CFP. Consequently, biasing must be extremely large for CFPs to connect different proteins (see Discussion).
Sources of sufficient biasing to allow for extensive CFPs have been identified. One source is proteins in nature (wildtype) being highly optimized for stability and function, so wildtype sequences are more tolerant to mutations than proteins that have already accumulated several amino acid changes [25–28]. The number of amino acid changes, n, a protein acquires is referred to as the Hamming distance from the wildtype sequence. Ploc is highest next to wildtype sequences and decreases with n, often faster than exponentially. For sufficiently small n, Ploc is larger than Pth, allowing for extensive CFPs.
For some proteins, the decrease of Ploc with increasing n approximates a mathematical function, which can be designated Ploc(n). In such cases, Ploc(n) can be set to Pth and solved for n to determine the maximum Hamming distance, nmax, where Ploc(n) still exceeds Pth. The Hamming distance can be converted to the percentage of amino acids a sequence shares with the wildtype. This percentage is termed the sequence identity (SI), and the conversion follows a simple relationship:
(4)
These calculations were performed on the proteins β-lactamase, GFP, and HisA since their P(n) were reported or could be derived (see Methods). All nmax correspond to sequences that differ from the wildtype sequences by approximately 5% (Table 3), so sequences only have a significant probability of residing within an extensive CFP if their SI with a wildtype is around 95% or larger.
The percolation threshold, Pth, was calculated from Eq 2 for nm = 1 and At = 7.5. The maximum Hamming distances, nmax, from wildtype sequences where Pfs > Pth were determined from experimental data. Data for β-lactamase comes from Bershtein et al. (2006), for GFP from Sarkisyan et al. (2016), and for HisA from Lundin et al. (2018). The maximum Hamming distances were converted to sequence identities (SI) using Eq 4. The region where Pfs > Pth for all the proteins is where their SI with a wildtype sequence is greater than approximately 95%.
Methods
Simulation
The simulation was programmed in Python and run on a Linux server using parallel computing. The search space is a matrix of dimension L where each dimension corresponds to a single location along the sequence. Every dimension has a size of A, which corresponds to the A possible amino acids that could reside at that location. For most trials, any amino acid could transition into any other amino acid, so At = A– 1 (Fig 7). Each cell in the matrix corresponds to a sequence, and it is assigned a random number between 0 and 1. If the value is below the Pfs designated for the sequence space, the sequence corresponding to the cell is classified as functional, else it is classified as nonfunctional. For instance, the cell [1, 3, 2] corresponds to the sequence (1st amino acid, 3rd amino acid, 2nd amino acid). If Pfs = 5%, the cell would be functional if it were assigned a value below 0.05 and nonfunctional if assigned a value equal to or above 0.05.
In this example, the sequence space corresponds to all sequences that are 10 amino acids long where each location in the sequence could hold one of seven possible amino acids. The positions in the sequence are labeled with N’s, and the number of the amino acid located at each position is labeled with A’s. The figure depicts two steps in a CFP. (a) The initial sequence has the 2nd amino acid at the 3rd position and the 4th amino acid at the 7th position. (b) The first mutation occurs at the 3rd position, and it replaces the 2nd amino acid with the 4th amino acid. (c) The second mutation occurs at the 7th position, and it replaces the 4th amino acid with the 5th amino acid. Each sequence is randomly assigned a value that determines if it is functional.
A recursive algorithm started from the sequence consisting entirely of the 1st amino acid and traversed every CFP for nm = 1. For each matrix, the simulation either calculated the size of the cluster including the starting sequence or determined if a CFP connected the starting sequence and a target. The target corresponded to a target sequence consisting entirely of the 2nd amino acid and the set of sequences that match it by all but at most a specified number of amino acids, designated Tol for tolerance.
Initial trials used the parameters L = 10, A = 7 or L = 13, A = 5 since they represent two of the largest sequence spaces the program could manage. Tol was set to 5 for both the L = 10 and L = 13 trials since the proportions of the sequence spaces contained in the targets were comparable. The code for the simulation and the output data are available on GitHub: https://github.com/drbjmiller/Sequence-Space.
For one set of trials, 20 matrices were generated in parallel with the same Pfs, and the average size of clusters containing the starting sequence was recorded. Running 20 parallel processes was the maximum number the server could manage. Trials were run with Pfs values that ranged from 0.2% below Pth to 0.5% above Pth. The simulation could only manage paths shorter than 15,000 sequences, so the cluster size output from the simulation became increasingly inaccurate after roughly 0.4% above Pth.
For another set of trials, matrices were generated with the same Pfs until a CFP between the starting sequence and the target was identified. For each Pfs, the average number of attempts was recorded. Trials were run with Pfs values that ranged from 0.2% below Pth to 0.5% above Pth. Below the lower bound, the computational time required to generate a matrix with a connecting CFP grew extremely long.
Another set of trials was added due to the discovery that clusters were almost always either smaller than 300 sequences or larger than 500,000 sequences. The larger clusters extended throughout sequence space. Consequently, the percentage of starting sequences that are part of large clusters represents a more meaningful statistic than the average cluster size. For each Pfs, 100 matrices were generated, and the percentage of large clusters was recorded. The exception was for Pfs = Pth, where 100,000 matrices were generated with a modified version of the simulation that stopped searching a matrix when the cluster size exceeded 20,000 sequences, since all clusters over that size extended throughout sequence space. Trials were run with Pfs values that ranged from 0.2% below Pth to 0.5% above Pth. No large clusters were observed below Pth.
The graphs of the percentage of large clusters clearly identified the percolation phase transitions for initial trials, so the modified version of the simulation was used to identify the phase transition for additional trials to ensure that the results did not vary significantly with L or A. The modified simulation was then further modified to only allow amino acids to transition to a limited number of other amino acids to ensure results did not vary significantly with At. The results from all trials are accessible in the GitHub repository.
Comparing Pth to Pfs
The commonly cited Pfs values for peptides, polypeptides, and proteins were used in this study. Knopp et al. (2019) provided the data for membrane embedding proteins, Keefe et al. (2019) for ATP binding polypeptides, Taylor et al. (2001) for chorismite mutase, Reidharr-Olson and Sauer (1990) for λ-repressor, Yockey (1977) for cytochrome c, Axe (2004) for the larger domain of TEM-1 β-lactamase, and Tian and Best (2017) for the other 10 single-domain proteins. The source for each Pfs value is cited in Table 2. Pth was calculated for each entry using Eq 2 with nm = 1 and At = 7.5 since that value is the average number of possible amino acid transitions resulting from a single mutation [23].
Determining Ploc(n)
The Ploc(n) functions were derived from the empirical data for the proteins TEM-1 β-lactamase from E. coli [25], GFP from Aequorea victoria [26], and HisA from Salmonella enterica [27]. The maximum Hamming distance, nmax, from a wildtype sequence where Ploc(n) > Pth was determined for each protein. The value of nmax for β-lactamase was initially derived from Table 1 in Bershtein et al. (2006), which lists the percentage of tolerated nonsynonymous mutations after increasing rounds of mutagenesis with selection between rounds. The value for Ploc(n) was estimated by multiplying the highest percentage of tolerated mutations, Ptol, in the listed range for each added mutation. Specifically, Ptol was set to 100% for the first mutation, since Ptol was not listed, 61% for the next two mutations, 55% for the next two, and 39% for subsequent mutations yielding nmax = 10.
The investigators also measured the percentage of functional proteins with different numbers of mutations for trials not applying selection between rounds. They determined Ploc(n) by fitting the data to different functions. It best fit a decaying hyper-exponential:
(5)
The investigators included all mutations in their analysis, so I rescaled the variable n to only include mutations that altered the sequence (nonsynonymous) by replacing n with n/0.69 yielding α = 0.104 and β = 0.019. Setting the equation equal to β-lactamase’s Pth yields nmax = 17. The actual nmax is likely somewhere between the two estimates. The lower value is likely preferable since the context is environments undergoing high levels of selection.
Ploc(n) for GFP was derived by curve fitting the reported data on Ploc(n) for n between 2 and 10 to Eq 5 using the function optimize.curve_fit from the SciPy Python library. The data best fit with α = -0.062 and β = 0.058. The hyper-exponential function is the standard choice for proteins that display pervasive negative epistasis [29], which the investigators reported. The data fit the equation very well except for n = 1. This value represents such an extreme outlier that it was not included in the curve fitting. Setting Ploc(n) to GFP’s Pth yields nmax = 12.
The value of nmax for HisA was derived from the “selection” function reported by Lundin et al. (2018), which is also a hyper-exponential function with α = 0.165 and β = 0.065. The selection function represents the relative growth rate of bacteria with mutated HisA proteins. The function is not identical to Ploc(n), but the rapid loss of function with accumulating mutations [28] suggests that the selection function serves as a reasonable proxy since both functions approach 0 at the same time. HisA’s Ploc(n) approaches its Pth around nmax = 10.
Discussion
Biasing of distribution of functional sequences
Since many proteins correspond to functional sequences that are too rare for them to have been discovered through a random search, they could only have originated from ancestral protein sequences connected by CFPs to modern protein sequences [30]. A modern protein could either be a protein found in nature today or a newly engineered protein. The ancestral protein would then be either the earlier protein that evolved into the modern protein or the initial protein and was engineered into the new protein respectively. This study quantified the level of biasing required in the distribution of functional sequences in sequence space for extensive CFPs to exist.
If artificial or natural selection is constraining the evolution of a protein, the average distance between two functional sequences often could not exceed one mutation since nonfunctional intermediates would be quickly removed from the population under sufficiently high selection to constrain an evolutionary path along a CFP. Conversely, the time required for two specific mutations to arise in an individual when possessing only one is disadvantageous is often prohibitively long [31].
A nm = 2 CFP could realistically constrain an evolutionary search for species with very large populations since two coordinated mutations would occur in individuals sufficiently often even if possessing only one were detrimental. For instance, P. falciparum acquires two coordinated mutations that impart resistance to chloroquine with a per-parasite probability on the order of 1 in 1020 parasite multiplications [32]. The number of multiplications per year for many species is larger than 1020.
Only in species with the largest populations over geological timescales could nm = 3 CFPs constrain an evolutionary search. Natural selection would much more often remove mutations that disabled a protein before two additional mutations reactivated it, and the number of multiplications required to obtain three simultaneous specific mutations is on the order of 1030 based on the number required for two. Even without purifying selection, the number of coordinated mutations that could spread through a population is less than 10 under any circumstance [33]. The value of Pth does not drop below Pfs for any protein referenced in this study even for nm = 10 (Table 2).
One might postulate that indels could substantially lower Pth since they greatly increase the number of potential neighboring sequences. Yet the low ratio of documented indels to single nucleotide changes in the coding regions of DNA in diverse taxa [34] indicates that indels should not significantly decrease the effective Pth. They certainly would not decrease Pth sufficiently to drop below the cited protein Pfs values.
Proteins’ consistently high Rb even for unrealistically high nm indicate that the distribution of functional sequences must typically be extremely biased for CFPs to assist evolutionary searches. Such strong biasing occurs in regions of sequence space close to wildtype sequences. Mutation studies of multiple proteins reported Ploc remaining above the percolation threshold in regions where sequences do not differ from a wildtype by more than approximately 5%. At farther Hamming distances, the biasing must result from proteins possessing special properties that result in narrow tendrils extending through sequence space where Ploc > Pth.
This prediction supports the conclusion of previous studies that proteins capable of evolving new functions naturally or through engineering have highly optimized structures to generate such biasing. In many cases, different CFPs branching from a protein’s wildtype sequence have been shown to lead to the protein performing different functions at high efficiency without significantly altering its overall structure [35, 36]. In what are termed promiscuous enzymes, mutations can modify the active site to enable or enhance many possible catalytic activities, allowing an organism to adapt to new environments or a protein engineer to achieve multiple target functions [37]. For instance, a β-lactamase enzyme in Arthrobacter sp. gained the ability to digest human-manufactured nylon by acquiring only two mutations [38].
The consistently high Rb values calculated in this study further support the conclusion that proteins’ ability to evolve new functions results from highly specialized structural features. In addition, a protein’s Rb helps quantify the level of specification required in its sequence and structure to enable its evolvability.
Improvements to methodologies
As mentioned, Buchholz et al. (2017) applied percolation theory to protein sequence space, but they failed to properly incorporate the percolation threshold. They analyzed the distribution of the size of clusters of functional protein sequences in six superfamilies [39]. For each superfamily, they identified a protein sequence as a member of a cluster if it neighbors another sequence in the cluster based on the criterion that its SI with the neighbor is above a set cutoff. The investigators observed that the number, N, of clusters of size s follows a power-law with exponent -τ:
(6)
They argued that this observation is a direct expectation of percolation theory.
For each superfamily, they calculated τ for cutoffs that ranged from 60% SI to 90% SI. They found that τ increases linearly with increasing SI. They then extrapolated the linear relationship to estimate τ for clusters where neighbors were separated by a single amino acid (i.e., SI approaching 100%), so clusters represent interlinking extensive CFPs. Based on this result, the investigators concluded that CFPs should interconnect all proteins in a superfamily since N(s) remains finite for large s.
Yet this extrapolation is not justifiable. The authors assumed that the power-law behavior would apply for all SI, but it only applies when Pfs is not significantly below Pth. Below the percolation threshold, extensive CFPs cease to exist. For 60% sequence identify, nm is often sufficiently large that Pfs is greater than Pth (Eq 2), but for cutoffs approaching 100% Pth greatly exceeds Pfs. The SI cutoff for the proteins included in this study are listed in Table 2. The frequency of extensive CFPs for Pfs below the percolation threshold cannot be determined by extrapolating results from data corresponding to Pfs above the threshold.
A better approach for studies of protein sequence space is to first determine Pth, Pfs, and Ploc(n) for proteins under investigation and then incorporate the percolation phase change into analyses by treating the different regimes independently. In the region below the threshold, investigators could use Rb to determine the level of required biasing of the distribution of functional sequences to allow for extensive CFPs. They could also identify where the biasing results from the proximity of sequences to wildtype proteins (i.e., small Hamming distances) and where it results from proteins being optimized for easily developing new catalytic functions.
The source of the biasing could be explored by such tools as the sequence evolution with epistatic contributions (SEEC) model developed by Alvarez et al. (2021), which identifies evolutionary relevant epistatic interactions between amino acids [40]. The model’s developers used it to guide the modification of enzymes to effectively explore sequence space to enhance target functions. A similar tool was developed by Durston et al. (2012) that employs a k-modes attribute clustering algorithm to connect sets of amino acids to a protein’s structure and function [41]. Such tools could help identify how a protein’s adaptability–and by extension the biasing of sequence space–is facilitated by sequence and structural motifs.
References
- 1.
Al Aboud NM, Basit H, Al-Jindan FA. Genetics, DNA Damage and Repair. StatPearls. 2019.
- 2. Chatterjee K, Pavlogiannis A, Adlam B, Nowak MA. The Time Scale of Evolutionary Innovation. Beerenwinkel N, editor. PLoS Comput Biol. 2014;10: e1003818. pmid:25211329
- 3. Knopp M, Gudmundsdottir JS, Nilsson T, König F, Warsi O, Rajer F, et al. De novo emergence of peptides that confer antibiotic resistance. mBio. 2019;10. pmid:31164464
- 4. Keefe AD, Szostak JW. Functional proteins from a random-sequence library. Nature. 2001;410: 715–718. pmid:11287961
- 5. Taylor S V., Walter KU, Kast P, Hilvert D. Searching sequence space for protein catalysts. Proc Natl Acad Sci U S A. 2001;98: 10596–10601. pmid:11535813
- 6. Tian P, Best RB. How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis. Biophys J. 2017;113: 1719–1730. pmid:29045866
- 7. Reidharr-Olson JF, Sauer RT. Functionally acceptable substitutions in two α-helical regions of λ repressor. Proteins: Structure, Function, and Genetics. 1990;7: 306–316. pmid:2199970
- 8. Yockey HP. A calculation of the probability of spontaneous biogenesis by information theory. J Theor Biol. 1977;67: 377–398. pmid:198618
- 9. Axe DD. Estimating the Prevalence of Protein Sequences Adopting Functional Enzyme Folds. J Mol Biol. 2004;341: 1295–1315. pmid:15321723
- 10. Essam JW. Percolation theory. Reports on Progress in Physics. 1980;43: 833.
- 11. Gaunt DS, Sykes MF, Ruskin H. Percolation processes in d-dimensions. Journal of Physics A: General Physics. 1976;9: 1899–1911.
- 12. Gavrilets S. Perspective: models of speciation: what have we learned in 40 years? Evolution. 2003;57: 2197–2215. pmid:14628909
- 13. Gavrilets S. Perspective: models of speciation: what have we learned in 40 years? Evolution. 2003;57: 2197–2215. pmid:14628909
- 14. Sidorova A, Levashova N, Garaeva A, Tverdislov V. A percolation model of natural selection. Biosystems. 2020;193–194: 104120. pmid:32092352
- 15. Gravner J, Pitman D, Gavrilets S. Percolation on fitness landscapes: effects of correlation, phenotype, and incompatibilities. J Theor Biol. 2007;248: 627. pmid:17692873
- 16. Gavrilets S, Gravner J. Percolation on the Fitness Hypercube and the Evolution of Reproductive Isolation. J Theor Biol. 1997;184: 51–64. pmid:9039400
- 17. Deb D, Vishveshwara S, Vishveshwara S. Understanding protein structure from a percolation perspective. Biophys J. 2009;97: 1787–1794. pmid:19751685
- 18. Brunk NE, Twarock R. Percolation Theory Reveals Biophysical Properties of Virus-like Particles. ACS Nano. 2021;15: 12988–12995. pmid:34296852
- 19. Buchholz PCF, Fademrecht S, Pleiss J. Percolation in protein sequence space. PLoS One. 2017;12. pmid:29261740
- 20. Bauer TL, Buchholz PCF, Pleiss J. The modular structure of α/β-hydrolases. FEBS J. 2020;287: 1035–1053. pmid:31545554
- 21. Buchholz PCF, Van Loo B, Eenink BDG, Bornberg-Bauer E, Pleiss J. Ancestral sequences of a large promiscuous enzyme family correspond to bridges in sequence space in a network representation. J R Soc Interface. 2021;18. pmid:34727710
- 22. Buchholz PCF, Zeil C, Pleiss J. The scale-free nature of protein sequence space. PLoS One. 2018;13: e0200815. pmid:30067815
- 23. Thorvaldsen S. A Mutation Model from First Principles of the Genetic Code. IEEE/ACM Trans Comput Biol Bioinform. 2016;13: 878–886. pmid:26485722
- 24. Hu Y, Charbonneau P. Percolation thresholds on high-dimensional Dn and E8-related lattices. Phys Rev E. 2021;103: 062115.
- 25. Bershtein S, Segal M, Bekerman R, Tokuriki N, Tawfik DS. Robustness–epistasis link shapes the fitness landscape of a randomly drifting protein. Nature. 2006;444: 929–932. pmid:17122770
- 26. Sarkisyan KS, Bolotin DA, Meer M V., Usmanova DR, Mishin AS, Sharonov G V., et al. Local fitness landscape of the green fluorescent protein. Nature 2015 533:7603. 2016;533: 397–401. pmid:27193686
- 27. Lundin E, Tang P-C, Guy L, Näsvall J, Andersson DI. Experimental Determination and Prediction of the Fitness Effects of Random Point Mutations in the Biosynthetic Enzyme HisA. Mol Biol Evol. 2018;35: 704–718. pmid:29294020
- 28. Tokuriki N, Tawfik DS. Stability effects of mutations and protein evolvability. Curr Opin Struct Biol. 2009;19: 596–604. pmid:19765975
- 29. Bershtein S, Segal M, Bekerman R, Tokuriki N, Tawfik DS. Robustness–epistasis link shapes the fitness landscape of a randomly drifting protein. Nature. 2006;444: 929–932. pmid:17122770
- 30. Ferrada E, Wagner A. Evolutionary innovations and the organization of protein functions in genotype space. PLoS One. 2010;5. pmid:21152394
- 31.
Hössjer O, Bechly G, Gauger A. Phase-Type distribution approximations of the waiting time until coordinated mutations get fixed in a population. Springer Proceedings in Mathematics and Statistics. Springer New York LLC; 2018. pp. 245–313. https://doi.org/10.1007/978-3-030-02825-1_12
- 32. White NJ. Antimalarial drug resistance. Journal of Clinical Investigation. 2004;113: 1084. pmid:15085184
- 33. Behe MJ, Snoke DW. Simulating evolution by gene duplication of protein features that require multiple amino acid residues. Protein Sci. 2004;13: 2651. pmid:15340163
- 34. Chen JQ, Wu Y, Yang H, Bergelson J, Kreitman M, Tian D. Variation in the Ratio of Nucleotide Substitution and Indel Rates across Genomes in Mammals and Bacteria. Mol Biol Evol. 2009;26: 1523–1531. pmid:19329651
- 35. Dellus-Gur E, Toth-Petroczy A, Elias M, Tawfik DS. What Makes a Protein Fold Amenable to Functional Innovation? Fold Polarity and Stability Trade-offs. J Mol Biol. 2013;425: 2609–2621. pmid:23542341
- 36. Trudeau DL, Tawfik DS. Protein engineers turned evolutionists—the quest for the optimal starting point. Curr Opin Biotechnol. 2019;60: 46–52. pmid:30611116
- 37. Khersonsky O, Tawfik DS. Enzyme Promiscuity: A Mechanistic and Evolutionary Perspective. https://doi.org/101146/annurev-biochem-030409-143718. 2010;79: 471–505. pmid:20235827
- 38. Negoro S, Ohki T, Shibata N, Mizuno N, Wakitani Y, Tsurukame J, et al. X-ray crystallographic analysis of 6-aminohexanoate-dimer hydrolase: molecular basis for the birth of a nylon oligomer-degrading enzyme. J Biol Chem. 2005;280: 39644–52. pmid:16162506
- 39. Buchholz PCF, Fademrecht S, Pleiss J. Percolation in protein sequence space. PLoS One. 2017;12. pmid:29261740
- 40. Alvarez S, Nartey C, Mercado N, Morcos F. Novel sequence space explored by functional proteins generated through computational evolution-based design. Biophys J. 2022;121: 45a.
- 41. Durston KK, Chiu DK, Wong AK, Li GC. Statistical discovery of site inter-dependencies in sub-molecular hierarchical protein structuring. EURASIP J Bioinform Syst Biol. 2012;2012. pmid:22793672