Percolation in protein sequence space

The currently known protein sequences are not distributed equally in sequence space, but cluster into families. Analyzing the cluster size distribution gives a glimpse of the large and unknown extant protein sequence space, which has been explored during evolution. For six protein superfamilies with different fold and function, the cluster size distributions followed a power law with slopes between 2.4 and 3.3, which represent upper limits to the cluster distribution of extant sequences. The power law distribution of cluster sizes is in accordance with percolation theory and strongly supports connectedness of extant sequence space. Percolation of extant sequence space has three major consequences: (1) It transforms our view of sequence space as a highly connected network where each sequence has multiple neighbors, and each pair of sequences is connected by many different paths. A high degree of connectedness is a necessary condition of efficient evolution, because it overcomes the possible blockage by sign epistasis and reciprocal sign epistasis. (2) The Fisher exponent is an indicator of connectedness and saturation of sequence space of each protein superfamily. (3) All clusters are expected to be connected by extant sequences that become apparent as a higher portion of extant sequence space becomes known. Being linked to biochemically distinct homologous families, bridging sequences are promising enzyme candidates for applications in biotechnology because they are expected to have substrate ambiguity or catalytic promiscuity.


Introduction
Despite the rapidly growing amount of DNA data due to advances in DNA sequencing techniques, only a tiny fraction of all protein sequences existing in the biosphere has been sequenced, yet. While we currently know the sequences of almost 10 8 proteins [1], the number of extant sequences was estimated to be 10 34 , and up to 10 43 different protein sequences might have been explored during 4 Gyr of evolution [2]. Though this number seems to be large, it is infinitesimally small as compared to the theoretical sequence space (10 400 possible sequences for a medium-sized protein), and it would be highly improbable to find functional proteins by random search [3]. Therefore, the Darwinian model of protein evolution based on mutation of the genotype and subsequent natural selection of the phenotype excludes the possibility of extant sequences being scattered in the theoretical sequence space, but they are expected to form a connected network, where functional sequences and mutations form the nodes and a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 edges, respectively [4]. In his fundamental article about the structure of sequence space, J. Maynard Smith asked the questions whether indeed all existing proteins are part of a single network with a single starting point, what fraction of the functional sequence space has been explored yet, and how large is the space of functional, but never-born proteins [5]. Although the sequence space of functional proteins is unknown, we can reliably measure distances between sequences by global or local alignment methods [6]. The currently known protein sequences are not equally distributed, but cluster into homologous families [7]. However, due to the sparsity of the known sequence space, in most clusters even neighboring nodes differ by multiple mutations. As an exception, the TEM β-lactamase family has a very high microdiversity, and the variants form a dense single network with nodes connected by single mutations [8].
The apparent sparsity of the known sequence space is a consequence of our limited knowledge of the extant sequences in the biosphere. Therefore, we expect that as we know more sequences, all nodes will gradually form a connected network. As an alternative explanation of sparsity, the observed separation between clusters is the consequence of ancestor sequences having become extinct during evolution [9].
In this work, the known sequence space was explored by applying percolation theory to learn about the extant sequence space. Percolation theory describes the cluster distribution on a randomly populated lattice, with a parameter p describing the occupancy of the lattice sites [10]. For increasing values of p, the characteristic cluster size s ξ and the fraction P of sites belonging to the largest cluster increases. As p approaches the percolation threshold p c , an infinite cluster appears for the first time on an infinite lattice, while on a finite-sized lattice the largest cluster percolates between the lattice boundaries. The core of percolation theory is a set of scaling relations that depend on |p c -p|, such as s ξ * |p c -p| -1/σ and P * (p-p c ) β with critical exponents σ and β which depend on the geometry of the lattice. Most importantly, percolation theory predicts that the cluster size distribution N(s) (the number N of clusters with size s) decreases for s << s ξ as N(s) * s -τ and decays exponentially for s >> s ξ . Near to percolation (p!p c ), s ξ becomes infinite. Thus, for s spanning many orders of magnitude log N(s) depends linearly on log s, with the Fisher exponent τ describing the ratio of small to large clusters.
Thus, investigating the cluster size distribution N(s) of homologous protein families provides insights into the structure of the known sequence space and gives a glimpse of the extant sequence space, despite our limited knowledge.
For each protein database, sequence identities of high-scoring sequence pairs were calculated by the USEARCH software suite (version 9.2) [19]. Sequence pairs with a distinct sequence identity cutoff were clustered by the Python module graph-tool (https://graph-tool. skewed.de/) (version 2.17).

Cluster size distribution
For the six protein superfamilies, the cluster size distribution N(s) was analyzed for cluster sizes s = 1, 2, 3,. . ., 1000. Because for large cluster sizes data becomes increasingly sparse, a histogram distribution was generated by counting the number of clusters N i,j = S j s = i, N(s) with cluster sizes between i and j.
The observed cluster size distributions were compared to three model distributions: a Gaussian distribution N(s)~exp(-½Á(s-μ) 2 /σ 2 ), an exponential distribution N(s)~exp(-bÁs) and a power law distribution N(s)~s -τ with the Fisher exponent τ characterizing the model distribution (S1 Fig). Excel sheets for the calculation of the distributions are provided as supporting information (S1 File). The log-log plots of the three model distributions differ considerably: log N(s) of the Gaussian distribution increases gradually with log s and decays rapidly for s>μ, while for the exponential distribution is decays rapidly for all s>0. In contrast, for the power law distribution log N(s) depends linearly on log s with a slope of -τ.
For each model distribution, the respective histogram distribution was calculated. Qualitatively, the histogram distributions were similar to the model distributions. For power law distributions with τ>1, the corresponding histogram distribution could also be approximated by a straight line with a slope of -τ h . However, the two slopes -τ and -τ h deviated.
For each histogram distribution of the six protein families, the slope -τ h was determined by fitting the initial linear decay (N 1,10, N 11,100 , and N 101,1000) by linear regression, and the Fisher exponent of the respective cluster size distribution was derived from τ h by varying τ of the model distribution to fit the observed τ h .

Sequence space
The known protein sequence space is rapidly increasing, but it represents only a tiny fraction of the extant sequence space, that has been explored during evolution. In turn, the extant sequence space represents a fraction p of the much bigger sequence space coding for functional proteins. Although both the extant and the functional sequence space and therefore also p are unknown, the scaling properties of the cluster size distribution can be used as an indicator of p: if the cluster size distribution in the extant sequence space follows a power law over many orders of magnitude, p is close to a critical percolation threshold p c .
The scaling properties of the extant sequence space are investigated by analyzing the scaling properties of the much smaller space of known sequences. Because a typical protein superfamily consists of 10 4 −10 5 protein sequences, the cluster size range is limited to 2-3 orders of magnitude. The sparsity of the known sequence space has three major consequences: (1) Because of the poor statistics of the cluster size distribution N(s) between s = 1 and 1000, the number of clusters with a size between 1 and 10 (N 1,10 ), 11 and 100 (N 11,100 ), and 101 and 1000 (N 101,1000 ) are analyzed, and the corresponding cluster size distribution is derived from this histogram distribution. (2) Except for very few families, e.g. TEM β-lactamases, it is rare that two members of a protein superfamily differ by only one amino acid. Therefore, neighbor relationships are established by global sequence identity as a cutoff criterion. Using a 90% cutoff criterion, two proteins of 400 amino acids are considered to be neighbors if they differ in less than 40 positions. As a consequence, the structure of the resulting network and the Fisher exponent τ depend on the cutoff criterion for the neighborhood relationship. (3) The Fisher exponent τ depends on the number of known sequences. As the number of known sequences increases, the protein families become more densely populated, and the number of large clusters is expected to increase. As a consequence, the Fisher exponent τ decreases. Therefore, the observed Fisher exponent τ as evaluated from the known protein superfamilies represents an upper limit to the Fisher exponent of the extant sequence space.

Cluster size distribution
For each of the six protein superfamilies, the sequences were clustered by a cutoff criterion of 60% global sequence identity which is often applied for defining homologous families. The number N of clusters with size s was analyzed in a histogram with logarithmic bins for s between 1 and 10, 11 and 100, 101 and 1,000, 1,001 and 10,000, and 10,001 and 100,000 to improve statistical sampling (Fig 2). Intuitively, we had expected a Gaussian normal distribution, assuming a random distribution of cluster sizes. However, in contrast to intuition, the distribution of cluster sizes followed a power law N(s)~s -τh , indicated by a linear dependency of log s and log N(s) for the six protein superfamilies (abH, SDR, oTA, CYP, DC, bHAD). The Fisher exponent τ h of a histogram describes the ratio between small and large clusters and is derived from linear regression in the log-log plot of the histogram [20]. From the Fisher exponent τ h of the histogram, the Fisher exponent τ of the underlying cluster size distribution was calculated by fitting the observed τ h of the histogram to a model distribution of cluster sizes following a power law distribution. Though the protein families differ in size, structure, and function, for four of the five (SDR, oTA, DC, bHAD) the Fisher exponent τ varied only slightly (1.8-1.9). The smallest Fisher exponent was derived for the CYP superfamily (τ = 1.6). For the largest superfamily (abH), the Fisher exponent was 2.0. These values are in agreement with the Fisher exponent of τ%2 determined for the protein family size distribution of the Gene3D database [21] or the TRIBES resource [22], while the distribution of protein folds showed a slightly larger exponent of 2.5 [23].

Dependency of τ on the cluster criterion
While the Fisher exponent τ was almost independent of the protein family and its size, its absolute value depended on the cutoff criterion used for clustering. Upon clustering of the six families with six cutoffs between 60 and 90%, the cluster size distributions followed a power law for all cutoffs (S2 Fig). With increasing clustering cutoff, the relative number of small clusters increases, while the number of large clusters decreases. Consequently, the Fisher exponent τ increased almost linearly with increasing cutoff (Fig 3)

Dependency of τ on the number of sequences
Of the 395,000 abH sequences, 50, 25, or 12.5% were randomly selected and clustered, and the cluster size distribution was determined for four distinct cutoff values (S3 Fig). With a decreasing number of sequences, the relative number of small clusters increased, while the number of Thus, the observation of a power law cluster size distribution results from the connectedness of extant sequence space which is as a consequence of Darwinian evolution. Interestingly, a model that describes protein structural evolution on a three dimensional lattice also results in a power law cluster size distribution with an exponent of 2.3 [25]. It is a tempting observation that the two foundations of protein evolution, the connectedness of extant sequence space and the formation of a stable fold, both result in a power law cluster size distribution with a similar exponent. This observation relates to the fundamental property of protein folds: the stability of a fold is closely related to its evolvability. The more stable a fold is, the more sequences can adopt it, thus forming larger and better connected sequence networks.

Connectedness and saturation of sequence space
The cluster size distribution of the known sequence space of six protein superfamilies followed a power law, with the extrapolated Fisher exponent τ 100 being an upper limit to the Fisher exponent of the extant sequence space. The observation of few clusters containing many sequences might relate with the assumption that more stable protein folds are more evolvable, thus forming larger and higher connected clusters of mutations. The extrapolated Fisher exponent is independent of characteristic properties of the protein families such as family size (Table 1). Because the Fisher exponent measures the ratio of small to large clusters, it can be interpreted as an indicator of the global connectedness of the known sequences of a protein family. The protein families oTA, SDR, bHAD and abH (τ 100 = 2.3, 2.4, 2.5 and 2.6, respectively) had a smaller τ 100 and thus a higher ratio of large to small clusters than the protein families DC, or CYP (τ 100 = 2.8 and 3.3, respectively). A high ratio of large to small clusters indicates a high connectedness. There are at least three possible reasons for a high connectedness of a protein family: (1) The protein family is well explored; thus, a high fraction of its extant sequence space is already known. (2) The protein family has a high microdiversity. (3) The protein family covers only a small region in sequence space, thus overall variability is low.
Our observation that the connectedness gradually increased as more sequences become known is supported by the concept of gradual saturation of sequence space. This concept describes the observation that the number of newly sequenced genes that form separate clusters plotted over time decreases to zero [26]. Rather than expanding, the sequence space of protein families is gradually becoming denser and more connected. As τ 100 measures the connectedness of the protein family, it also measures the current level of saturation, with the protein families SDR and CYP having the highest and lowest saturation, respectively.

Bridges between homologous families
The six protein families showed a similar linear dependency of τ on the clustering cutoff. Thus, for high cutoff values many small clusters were observed, which gradually combine into larger clusters as the clustering cutoff was decreased, and bridges between clusters gradually appeared (S4 and S5 Figs). These bridges were formed by sequences that had been part of one cluster and then became part of a second cluster, or were recruited from previously isolated sequences, as the clustering cutoff was decreased. These bridging sequences are interesting, as they belong to both clusters. If global sequence similarity relates to biochemical function, a cluster is characterized by a similar biochemical function that differs between the clusters. The bridging sequences, having similarities to two or even more clusters, are therefore promising candidates with substrate ambiguity [27] or even catalytic promiscuity [28].

Protein evolution
By analyzing the known sequence space, we predict that extant proteins form a percolating, highly connected network where each sequence has multiple neighbors, and each pair of sequences is connected by many different paths, as expected from evolution [4]. However, the density in sequence space is not uniform, but follows a power law distribution which indicates that certain folds were more evolvable than others. Percolation allows for the concept of evolution as adaptive walks on a fitness landscape [29], where sequences at the ends of the walks may substantially differ from one another [30]. A high degree of connectedness also overcomes the possible blockage by sign epistasis and reciprocal sign epistasis [31] and thus is a necessary condition of efficient evolution, despite the fact that only an infinitesimally small portion of the theoretical sequence space been explored during the course of life on Earth [2]. In a highly connected sequence network as a model of evolution [32], sequences are found that form bridges between two clusters. Since the number of bridges is much smaller than the number of cluster members, they only gradually appear as the number of sequenced genes increases. Consequently, the observed separation of families is merely a consequence of our limited knowledge of extant sequence space. With increasing sequence data from genomics and metagenomics projects, we expect more and more sequences to occur which form bridges between yet separated families and thus contribute to the connectedness of known sequence space.
These bridging sequences are equivalent to reconstructed ancestral sequences in binary trees [33]. Since they form a link between two branches, ancestral proteins are assumed to be generalists with a broader substrate spectrum or even multiple activities [28]. While the binary tree model of evolution assumes that the ancestor sequences have disappeared from the biosphere, the network model of evolution assumes that bridging sequences still exist. For any two neighboring, biochemically distinct clusters, we expect bridging sequences to exist that contribute to the formation of a continuous network. It will be challenging to analyze how the biochemical properties change as we walk across the bridges. Most probably, bridging sequences are multi-functional or promiscuous enzymes with known or latent activities of both sub-families. In contrast to ancestors, these generalists already exist in the biosphere and are waiting to be found.  Table 1 (α/β-hydrolases in blue, shortchain dehydrogenases/reductases in green, ω-transaminases in black, cytochrome P450 monooxygenases in red, thiamine diphosphate-dependent decarboxylases in cyan and β-hydroxyacid dehydrogenases/imine reductases in orange).