Evolution of tunnels in α/β-hydrolase fold proteins—What can we learn from studying epoxide hydrolases?

The evolutionary variability of a protein’s residues is highly dependent on protein region and function. Solvent-exposed residues, excluding those at interaction interfaces, are more variable than buried residues whereas active site residues are considered to be conserved. The abovementioned rules apply also to α/β-hydrolase fold proteins—one of the oldest and the biggest superfamily of enzymes with buried active sites equipped with tunnels linking the reaction site with the exterior. We selected soluble epoxide hydrolases as representative of this family to conduct the first systematic study on the evolution of tunnels. We hypothesised that tunnels are lined by mostly conserved residues, and are equipped with a number of specific variable residues that are able to respond to evolutionary pressure. The hypothesis was confirmed, and we suggested a general and detailed way of the tunnels’ evolution analysis based on entropy values calculated for tunnels’ residues. We also found three different cases of entropy distribution among tunnel-lining residues. These observations can be applied for protein reengineering mimicking the natural evolution process. We propose a ‘perforation’ mechanism for new tunnels design via the merging of internal cavities or protein surface perforation. Based on the literature data, such a strategy of new tunnel design could significantly improve the enzyme’s performance and can be applied widely for enzymes with buried active sites.

Introduction specificity in closely related family members. Additionally, we raised the following question: are there any mechanisms or schemes that can be adopted during protein engineering to mimic new tunnels' appearance? Our results indicate that most tunnels in soluble epoxide hydrolases can be considered as conserved features, and we have proposed a "perforation" model that can be applied as a strategy for de novo tunnel design. Due to high structural similarity between members of α/β-hydrolases superfamily, our results could be expanded and applied into other superfamily members including acetylcholinesterase, dienelactone hydrolase, lipase, thioesterase, serine carboxypeptidase, proline iminopeptidase, proline oligopeptidase, haloalkane dehalogenase, haloperoxidase, epoxide hydrolase, hydroxynitrile lyase and others [38]. We need to emphasise that since we analysed tunnels identified in relatively small protein structures with narrow tunnels (usually 1.0-2.0 Å), some processes leading to tunnel formation or modification cannot be covered. This includes long insertion or deletion, dimerization, or quaternary protein structure organisation.

Results
For this study, we chose only the unique and complete structures of sEHs deposited in the Protein Data Bank (PDB) [39]. Any structures with information missing about the positions of any of their amino acid residues could have provided bias, and therefore were excluded. The resulting selection of seven epoxide hydrolase structures represent the clades of animals (Mus musculus, msEH, PDB ID: 1CQZ, and Homo sapiens, hsEH, PDB ID: 1S8O), plants (Solanum tuberosum, StEH1, PDB ID: 2CJP), fungi (Trichoderma reesei, TrEH, PDB ID: 5URO), bacteria (Bacillus megaterium, bmEH, PDB ID: 4NZZ) and thermophilic enzymes collected in hot springs in Russia and China from an unknown source organism (Sibe-EH, and CH65-EH, PDB IDs: 5NG7, and 5NFQ, respectively).

Model description and referential compartment evolutionary analysis
sEHs consist of two domains: the main domain, featuring eight β-strands surrounded by six αhelices; and the mostly helical cap domain, which sits atop the main domain. The cap domain is inserted between the strands of the main domain and is connected by an element called the NC-loop. The cap-loop is inserted between two helices of the cap domain [40]. The active site of the sEHs is buried inside the main domain, and therefore the transportation of substrates and products is facilitated by tunnel (either single or in a network) [29].
We performed an entropy analysis of the residues making up particular protein compartments with the use of the Schneider entropy metric implemented in the BALCONY package [41]. As an input BALCONY requires multiple sequence alignment (MSA) and a list of residues building up particular compartments. We analysed the compartments listed in S1 Table  (i.e. residues forming the active site; buried and surface residues; main and cap domains; NCloop; cap-loop; and α-helices, loops, and β-strands). In order to determine the positions' variability, we used Schneider entropy metric [42] calculated for each position in the MSA. To avoid bias and position-specific conservation scores we trimmed the MSA removing positions that did not correspond to the analysed proteins' sequences. To evaluate the overall compartments' variability we calculated the difference between the median distances of positions of the proteins' compartments and the remaining positions of the trimmed MSA (Fig 1 and S2 Table, see also Methods section for the description of the MSA trimming). Negative values of the difference between median distances of the selected proteins' compartments and the trimmed MSA (S2 Table) indicate compartments with lower variability, and positive values indicate compartments with higher variability in comparison to the remaining positions in the trimmed MSA. For quantitative statistical analysis, we compared the calculated Schneider entropy values of these compartments with the remaining positions of the trimmed MSA using the Epps-Singleton test [43].
Based on the obtained differences in median distances and the results of the Epps-Singleton test, the active site residues were classified as conserved, i.e. with lower entropy scores in comparison to the remaining positions in the MSA. The surface residues (classified as solventexposed residues according to the NetSurfP server [44]) were classed as the most variable. Entropy analysis showed that the variability of the buried residues was significantly lower than the variability of the surface residues (Fig 1 and S2 Table). These results are in agreement with the general findings mentioned previously [3,16,45]. With regards to the structural elements specific to sEHs, all compartments (main domain, cap domain, cap-loop, and NC-loop (except for the NC-loop in CH65-EH)) were classified as variable among all the selected sEHs. In all analysed proteins, α-helices and loops were also classified as variable (however, in the case of hsEH and StEH1 the information about the variability of loops was not statistically significant). In all analysed proteins, except for msEH, β-strands were found to be conserved which stays in agreement with the work of Sitbon and Pietrokovski [18] (Fig 1 and S2 Table).

Tunnel identification and comparison
We identified tunnels providing access to the active site using a geometry-based approach implemented in CAVER software [46] for both crystal structures and in molecular dynamics (MD) simulations, and then compared their geometries (for details see the Methods section). CAVER software identified between three and nine tunnels in the analysed crystal structures. Those tunnels were then compared with the tunnels identified during MD simulations to find their corresponding counterparts (S3 Table), based on the similarity of their tunnel-lining residues (for more details see the Methods section). We marked all identified tunnels according to their localisation within the epoxide hydrolase's domains as was shown in our previous work [47]. We identified tunnels passing through three regions of the sEH structure: i) the main domain (marked as Tg, Tm, Tback, and Tside), ii) the cap domain (marked as Tcap), as well as iii) the border between the cap and main domains (marked as Tc/m).
We identified seven tunnels in the main domain, six in the cap domain, and three at the border between those domains (Fig 2). It should be pointed out that the Tc/m tunnel was identified as multiple tunnels by CAVER (Tc/m1, Tc/m2, and Tc/m3). This issue is related to the asymmetric shape of the Tc/m tunnel, which makes it difficult to classify in a geometry-based approach (S1 Fig). Closer analysis of the tunnels identified in crystal structures and during MD simulations by CAVER showed that the tunnels identified in crystal structures are well-defined; however, their parts located closer to the protein surface are, in some cases, coiled. For most tunnels identified during MD simulations, the interior parts of tunnels were well-defined, whereas the tunnels' mouths were widely distributed on the protein surface. Such an observation might suggest that those regions are tightly packed and/or lined by bulky residues which can change their conformation to open/close a particular tunnel.  Table. All pairwise differences (except for loops (LOO) in hsEH and StEH1, marked red) are statistically significant (Epps-Singleton test). In the bottom right corner, a schematic representation of the analysed structure-specific comportments is provided. Abbreviations: AS-active site; BAA-buried amino acids; SAAsurface amino acids; MD-main domain; CD-cap domain; CL-cap-loop; NCL-NC-loop; HEL-helices; LOO-loops; STRstrands. https://doi.org/10.1371/journal.pcbi.1010119.g001

Tunnel evolutionary analysis
In the case of sEHs, tunnels can perform several distinct functions: i) transport and positioning of substrates and products, ii) control of the solvent access to the catalytic cavity, and iii) transport of catalytic water. Only those tunnels which maintain at least one of those functions can undergo evolutionary pressure. As we confirmed during the referential compartments' evolutionary analysis, surface residues are more variable than buried residues. Indeed, Fig 3 shows protein structures coloured according to Schneider entropy values, where thin blue lines represent regions with lower entropy, and yellow thick lines represent regions with higher entropy values. We also coloured the identified tunnels according to their frequency of detection (i.e. based on the number of frames in which they were identified) in MD simulations (darker = more frequent). The overall position of the tunnels was similar among all the protein structures; however, there were large differences concerning their frequency during the MD

PLOS COMPUTATIONAL BIOLOGY
Evolution of tunnels in α/β-hydrolase fold proteins-What can we learn?
simulations. Cross-sections of these structures suggest that the protein core is composed of residues with lower variability (lower entropy values), whereas the tunnel mouths, located at the protein surfaces, are surrounded by residues of both higher and lower variability (higher and lower entropy values, respectively).
We identified the residues lining these particular tunnels during the MD simulations. During MD simulations, the protein is not a rigid body and the residues gain some level of

PLOS COMPUTATIONAL BIOLOGY
Evolution of tunnels in α/β-hydrolase fold proteins-What can we learn? flexibility, which may cause the opening and closing of identified tunnels. Moreover, due to the residues' movements, the identified tunnels may branch (either near the active site, in the middle of the tunnel, or near the surface). Since we observed many cases of tunnels branching near the surface, the list of identified tunnel-lining residues may be overrepresented by the surface residues. Therefore, we decided to perform an entropy analysis of: i) all tunnel-lining residues; ii) surface tunnels-lining residues; and iii) tunnel-lining residues without the surface residues. An evolutionary analysis of the tunnel-lining residues without the surface residues is presented in Fig 4. Analysis was performed using the same procedure as in the case of the referential compartments. Complete lists of tunnel-lining residues are shown in S5-S11 Tables. A detailed analysis of the sEHs tunnels is shown in S3-S9 Figs. Median distances of all analysed proteins are listed in S12 Table.

PLOS COMPUTATIONAL BIOLOGY
Evolution of tunnels in α/β-hydrolase fold proteins-What can we learn?
Based on the median distances between all tunnel-lining residues and the remaining residues' positions in the MSA, we concluded that almost all analysed tunnels should be considered as conserved. Following exclusion of the surface residues from the tunnel-lining residues, differences in median values decreased indicating that the conserved character of tunnels comes from the buried residues (Fig 4 and S12 Table). It is clear that the surface tunnel-lining residues generally reach higher entropy values than the other analysed tunnel-lining residues (S3-S9 Figs).
Presented violin plots (Fig 4) provide insight into tunnel's residues entropy distribution. To perform that, the right violin shape from each pair has to be analysed. For example, in bmEH the distribution of the entropy values among residues creating Tc/m1 tunnel shows a trianglelike shape with a wide base of residues with low entropy values, which corresponds to the prevailing contribution of conserved residues. In contrast, the distribution of the entropy values among residues lining the Tc/m_back tunnel resembles a rectangle or even hourglass-like shape which means that both variable and conserved residues build that tunnel. Thus, analysis of the shape of the violin plots provides descriptive information about the variability of the residues creating each tunnel. The differences between violin plots for all tunnel-lining residues, and those with excluded surface residues, clearly confirm the variable character of tunnels' entries.

Detailed analysis of selected tunnels
The violin plots provide information about the general variability of the tunnel-lining residues, but do not give insight into the location of the variable/conserved residues along the tunnel. To analyse that we performed a more advanced analysis. We selected three different tunnels which were identified in three different sEHs. The Tc/m tunnel of hsEH and the Tm1 tunnel of StEH1 represent the most commonly identified tunnels, and the Tc/m_back tunnel of bmEH represents an interesting case of a tunnel which already was engineered. The entropy values of the tunnel-lining residues are presented in S13 Table. As we pointed out elsewhere [47] the Tc/m tunnel whose mouth is located between the main and cap domains can be seen as an ancestral tunnel created during cap domain insertion and preserved in nearly all epoxide hydrolases. In hsEH this tunnel (Fig 5A) has an average length of~13.3 Å. It was open during 59% of the simulation time, with an average bottleneck radius of 1.6 Å, reaching a maximum of 2.7 Å. It is lined by residues with both low and high values of entropy, which makes the overall entropy distribution nearly flat (when surface residues are included) or exponential (when surface residues are excluded) which corresponds to the hourglass-like and triangle-like shape of the violin plot, respectively. The majority of variable residues is located close to the surface or at the interface between the cap and main domains. Close inspection of the tunnel revealed also a highly variable residue (i.e. with higher entropy value)-F497 (Schneider entropy value 0.7946)-located approximately in the middle of the tunnel and situated between two less-variable residues (i.e. with lower entropy values)-D496 (Schneider entropy value 0.0336), from the active site, and V498 (Schneider entropy value 0.4713). The F497 residue might act as a molecular gate [48] since its position in several other crystal structures differs substantially, and was identified as a surface residue (S10 Fig).
The Tm1 tunnel of StEH1 is the shortest identified in this structure (Fig 5B). Similar tunnels were identified in three other analysed sEHs: msEH, hsEH, and TrEH. The tunnel mouth is located in the main domain, near the NC-loop and hinge region. A close inspection of this tunnel revealed that it was~13 Å long on average, and was always open during MD simulation. It had an average bottleneck radius of 1.9 Å, with a potential to increase up to 3.1 Å. The analysis of the violin plots suggests overrepresentation of the variable residues (reversed triangle-like shape of the violin plot, when surface residues are included), and nearly flat distribution of entropy values (hourglass-like violin shape, when surface residues are excluded). The majority of the tunnel-lining residues showed relatively high entropy values, while the residues with lower entropy values were located in proximity to the active site. In our previous analysis of StEH1 [49], we identified three residues, namely P188 (Schneider entropy value 0.5117) (not shown in Fig 5), L266 (Schneider entropy value 0.7946), and I270 (Schneider entropy value 0.5594), as potentially useful during protein engineering. Here we present that those residues are also variable, which may suggest that their substitution might not affect protein stability. Interestingly, approximately in the middle of the tunnel length, a less-variable H131 (Schneider entropy value 0.2524) residue was also identified. Residues are coloured according to entropy score. For the sake of clarity, less-frequently detected amino acid residues were omitted, and those creating the active site are shown as red lines. The active site cavity is shown as the interior surface, and the representative tunnel detected during molecular dynamics (MD) simulations as centerlines; middle section-cumulative distribution function (CDF) of entropy score for the tunnel-lining residues without the surface residues (cyan dots) and corresponding counterpart (black dots); and bottom section-scatterplot of the tunnel residues' entropy values relative to distance from the geometric centre of the α carbons of the enzyme, along with a marginal histogram of entropy value counts in respective intervals. Scatterplot points as well as histogram counts grouped into classes based on residue classification (active site-red; surface residues-blue; buried-grey). https://doi.org/10.1371/journal.pcbi.1010119.g005

PLOS COMPUTATIONAL BIOLOGY
Evolution of tunnels in α/β-hydrolase fold proteins-What can we learn?
The last example is the Tc/m_back tunnel of bmEH (Fig 5C) which was already engineered by Kong et al. [50]. This tunnel was identified as a third tunnel during MD simulation and had an average length of 26.7 Å. It was open only for 18% of the simulation time, with an average bottleneck radius of 1.0 Å, and the potential to increase up to 1.8 Å. The mouth of this tunnel was located in the main domain. Both violin plots (with surface residues included and excluded) show a similar hourglass-like shape. Close inspection of the tunnel revealed that residues with lower entropy values contributed to the binding cavity inside the main domain, while residues with higher entropy were located in the area surrounding a deep pocket on the protein surface. We also found two residues, namely F128 (Schneider entropy value 0.5798) and M145 (Schneider entropy value 0.4678), which had lower entropy values than their neighbours. Those residues were successfully modified to create a novel tunnel leading to the bmEH active site, allowing conversion of bulky substrates [50].

Discussion
In our study, we focused on sEHs, which are enzymes belonging to the α/β-hydrolase superfamily. Members of this superfamily share a barrel-like scaffold of eight anti-parallel β-strands surrounded by α-helices with a mostly helical cap domain sitting on top of the entrance to the active site [45], which seems to be also the oldest [51] and most stable [52] fold used by one the largest groups of enzymes [53]. Structural and evolutionary analyses of EHs have been reported systematically [40,47,[54][55][56], providing valuable insights into their structural and functional features. In our work, we first assessed the system-specific compartments described previously by Barth et al., such as the main and cap domains, the NC-loop, and the cap-loop, along with secondary structure elements such as strands, helices, and loops. Based on an alignment of 95 EH sequences, three available crystal structures, and several homology models, they showed that the main and cap domains are conserved, while the NC-loop and cap-loop are variable [40].
Here, we analysed an alignment of 1455 EH sequences and additionally performed an indepth analysis of the seven complete crystal structures representing different clades (animals, plants, fungi, and bacteria). By calculating the difference between median distances of Schneider entropy values of a selected compartment and the remaining positions of the trimmed MSA-we were able to determine the variability of each compartment. The calculated median distances for all analysed compartments confirmed well-known observations: active sites comprised highly conserved residues, with greater variability exhibited by surface residues than by buried residues [3,16,45]. Our results were also consistent with the work of Barth et al. [40], showing that the cap-loop and NC-loop should be considered as variable features. However, in contrast to their work, for such a large set of sequences, the whole main and cap domains were considered variable. In all analysed proteins, α-helices, and loops were found to be variable (S2 Table), while β-strands were found to be conserved in all analysed proteins, except for msEH. Such a tendency was shown previously for other systems elsewhere [18]. Further, since we were able to identify structural compartments of the seven sEHs analysed, observations regarding the modularity of EHs are still applicable [37].
The main aim of our analysis was to perform what was, to our knowledge, the first systematic analysis of the evolution of tunnels in a large family of sEHs. Therefore, our results can be applied mostly to the EHs, and-with some minor adjustments-to other members of the α/βhydrolases superfamily. We identified multiple tunnels of different sizes and shapes, located in three different regions: the cap and main domains, as well as at the border between those domains. We hypothesised that tunnels are conserved structural features equipped with variable parts, such as gates responsible for different substrate specificity profiles in closely related family members. This hypothesis was based on two assumptions: i) that the surface residues are more variable in comparison to the buried residues, and ii) that access to the active site cavity should be preserved to sustain the catalytic activity of the enzyme. Our results confirmed both assumptions. Moreover, we identified the Tc/m tunnel which was present in all analysed sEHs, and is located in the border between the cap and main domains. The cap domain is thought to be a result of a large insertion into the α/β-hydrolase main domain [45,47]. Both domains interact, creating a hydrogen bond network [57]; they co-evolved to preserve access to the buried active site while also ensuring the flexibility required for transport of the substrate and the products [58]. Most of the residues with lower entropy values in the cap domain face the main domain. This finding confirms previously presented information about the main and cap domains' relative flexibility [40,[59][60][61][62].
We also proposed two ways of the analysis of the tunnel residues variability. The violin plots allow analysis of the contribution of variable and conserved residues, which provides a general overview of each tunnel. They also allow assessment of the variability of a particular compartment relative to the remaining positions of the MSA (as shown in Fig 1). The scatterplots (similar to those in Fig 5A-5C) provide detailed insight and can be used to draw further conclusions regarding the distribution of entropy values of tunnel-lining residues along an analysed tunnel. They can also be used to identify the most variable and conserved tunnel-lining residues. In general, after excluding the active site and surface residues, the analysed examples (Fig 5) show three cases of entropy distribution among tunnel-lining residues: i) the flat distribution of the entropy values (Fig 5A); ii) the overrepresentation of residues with higher entropy values (Fig 5B); and iii) the quasi-sigmoidal distribution (Fig 5C; most of the residues have values of the entropy in the range of 0.25-0.7).
Our results confirmed the conserved character of the tunnels. Moreover, we found that even conserved tunnels can be lined with more variable residues, located not only at the surface (tunnels' entry). Close inspection of the Tc/m tunnel of hsEH allowed us to detect variable S412 and F497 residues (Schneider scores 0.618 and 0.795, respectively), among which phenylalanine was observed to be the most flexible amino acid, and which was even observed in a different conformation in crystal structures (S10 Fig). This indicates a potential role for F497 as a gate, controlling access through this tunnel [48]. On the other hand, the Tc/m tunnel is also lined by more conserved residues, such as the highly conserved substrate-stabilising tyrosine located in the cap domain (Y466 in hsEH, Schneider score 0.0323) [63,64].
Analysis of the variability of particular amino acid positions could be used in the search for feasible key amino acids (hot-spots) [65]. More variable positions might be considered as favourable locations for the introduction of mutations. Such residues can be detected even for the shortest tunnels, and have already been shown to enable fine-tuning of enzyme properties [66]. For example, the Tm1 tunnel of StEH1 is lined with several variable residues which may have a role to play in the fine-tuning of the enantioselectivity of that enzyme [67]. Such a strategy is acknowledged as one of the most likely to succeed, since it does not significantly disturb protein activity and stability, and the different locations of hot-spots along the transport pathway may enable modification of geometric/electrochemical constraints, thus contributing to the enzyme selectivity.
In our other study, we showed a relationship between a tunnel's shape and location, and the enzyme's function [47]. Thus, the evolution of the tunnel network can be considered as an additional mechanism that allows the enzyme to adapt and catalyse the conversion of different substrates. Mimicking such a process could provide a straightforward strategy for enzyme reengineering. As we pointed out above, the insertion of the cap domain has created the buried active site cavity and the Tc/m tunnel ensuring access to that cavity. This tunnel can be considered as an ancestral tunnel and it seems to be well-preserved among nearly all sEHs family members. However, the insertion of large fragments into existing structures appears to be a high-risk strategy. Based on our results, we can suggest a much easier approach that can be used for tunnel network redesign.

Perforation mechanism of the tunnel formation
The observed entropy values of tunnel-lining residues usually range from 0.25 to 0.7 (S13 Table). As we showed, the scatterplots can be used to identify the most variable and conserved residues. Variable residues are considered potentially safe hot-spots for single-point mutations [65]. We can imagine that new tunnels providing access to the protein interior can appear as a result of a "perforation" via a mutation occurring: i) in the surface layer of protein or ii) at the border of large cavities affecting surface integrity (Fig 6). Such a process can be easily mimicked and adopted for enzyme modification. We showed [47] that, in some cases, tunnels behave more like a series of small cavities which are rarely open. In the case of such tunnels, a mutation resulting in a permanently open cavity might be a driving force for future tunnel widening and modulation of selectivity or activity of enzymes, or otherwise provide additional regulation of activity.
The appearance of a new tunnel, resulting from a single-point mutation, via the proposed perforation mechanism provided significant freedom and flexibility for α/β-hydrolases to modify their activity and selectivity. Since the mechanism for hydrolysis performed by the sEHs involves deprotonation of the nucleophile in the hydrolysis step (proton shuttling) and water attack, it requires precise transport of water molecules. New tunnels could significantly improve the enzyme's performance by separating the substrate/products transport pathways from water delivery tracks.
A perfect example of the mimicking of the proposed surface perforation model is the transformation of the Tc/m_back tunnels of the bmEH shown by Kong et al. [50] in which they turned a substrate inaccessible tunnel into an accessible one in order to improve the enzyme's functionality. As we showed here these two residues whose substitution to alanine led to the opening of a side tunnel, improving the activity of bmEH upon α-naphthyl glycidyl ether, had higher entropy values than their neighbours. This work also led us to a hypothesis that mutations of such variable residues could also appear spontaneously and may drive the evolution of the active site accessibility via surface perforation and/or joining of internal cavities. Identification of such residues which are prone to cause such an effect might easily be adopted as part of protein reengineering processes. These conclusions are supported by the observations of Aharoni et al. [68], who noticed that most mutations affecting protein functionality (mostly activity and selectivity) were located either on the protein surface or within the active site cavity. Indeed, the investigation of long and narrow tunnels, not obviously relevant at first glance during protein engineering, should be regarded as a strategy for new pathway creation, as illustrated by Brezovsky et al. in their de novo tunnel design study which resulted in the most active dehalogenases known so far [69]. Dehalogenases are closely related to sEHs and belong also to the α/β-hydrolases superfamily, thus further supporting the rationality of our approach.
The tunnels described in our findings which we consider conserved provided substantial information about their origin, and about the evolution of enzymes' families more broadly. On the other hand, our results suggest that after the ancestral occlusion of the active site, the further evolution of α/β-hydrolases may be driven by perforation of either the surface or of the internal cavities, which mostly comprised variable residues. Tunnels themselves can be equipped with both conserved residues, which are potentially indispensable for their performance, as well as highly variable ones, which can be easily used for fine-tuning an enzyme's properties. Such hotspots can be easily identified using the approach presented here.

Workflow
Evolutionary analysis was divided into two parts: system-specific compartment analysis, and tunnel analysis. Prior to those analyses, the positions of the residues that contribute to compartments and tunnels needed to be mapped in an MSA comprising sequences of epoxide hydrolases. The identified residues were then used as input for an evolutionary analysis using the BALCONY software [41]. Tunnels were identified by CAVER software [46] in both crystal structures and during MD simulations and then compared with each other to find their corresponding counterparts. Finally, the tunnel-lining residues, the surface tunnel-lining residues, and the tunnel-lining residues without surface residues were used for the evolutionary analysis using BALCONY software (Fig 7).

Structure preparation
Ligands were manually removed from each structure, and only one chain was used for the analysis. For the msEH and hsEH structures, only the C-terminal domain, with the hydrolytic activity, was used. Several referential structural compartments were selected for further analysis (see S2 Table): the active site; buried and surface residues; main and cap domains; cap-loop; NC-loop; and α-helices, β-strands, and loops. The definitions of the cap-loop and NC-loop were taken from the works of Barth et al. [40] and of Smit and Labuschagne [70]. The NetSurfP service [44] was used to identify both buried and surface residues. Tunnels identified by CAVER software were also selected for further analysis.

CAVER analysis
Tunnel identification and analysis in each system were carried out using CAVER 3.02 software [46] in two steps: i) the crystal structure of the enzyme was analysed by the CAVER plugin for PyMOL [71]; ii) tunnels were identified and analysed in 50,000 snapshots of multiple MD simulations by the standalone CAVER 3.02 software. The parameters used for both steps are shown in S14 Table. The tunnels found in MD simulations and in crystal structures were ranked and numbered based on their throughput value [46].

Tunnels comparison
The tunnels identified during MD simulations and in crystal structures were compared based on the occurrence of tunnel-lining residues. For crystal structures, the occurrence was defined as the number of atoms of a particular amino acid that were identified as tunnel-forming

PLOS COMPUTATIONAL BIOLOGY
Evolution of tunnels in α/β-hydrolase fold proteins-What can we learn?
atoms. For the sake of simplicity, no weighting scheme was used: Cα, backbone atoms, and side-chain atoms were considered to be of the same importance. For MD simulations' results, tunnel occurrence was defined differently: as the number of MD snapshots in which particular amino acid was detected for a particular tunnel cluster. Therefore, this number could vary between 1 and the number of MD snapshots (50,000 in the performed analyses).
Despite the different definitions of occurrence used for crystal structures and for MD results, interpretation can be conducted in exactly the same way for each. Therefore, the above-defined occurrences can be directly used for fine-tuning the list of residues that form tunnels, i.e. by applying certain threshold values. In this study, the threshold was a number in the open range (0,1) and amino acids were retained only if the condition o > max(o) × τ was satisfied, where o is the occurrence and τ is the threshold value.
For sets of tunnels detected in both the crystal structures and in the MD data, a distance matrix was calculated using the Jaccard distance formula [72]: where T A and T B are A and B tunnels, respectively, and d is the Jaccard distance.
Elements of the matrix with lower distance values correspond to crystal structures/MD data pairs of similar tunnels. Further improvements in distance calculation accuracy were achieved by fine-tuning the tunnels' amino acids with thresholds. For each of the compared pairs, two independent thresholds were used, and τ values for both lists of tunnel-forming residues in the crystal structure and MD simulations were scanned in the range of [0.05, 0.95] with a step of 0.05 (361 combinations in total). The combination of τ values which yielded the minimal distance was selected as the optimal one.

Obtaining protein sequences, and MSA
Each of the amino acid sequences of the selected sEHs (PDB IDs: 1S8O, 1CQZ, 2CJP, 4NZZ, 5URO, 5NFQ, and 5NG7) was used as a separate query for a BLAST [73] search of similar protein sequences. The obtained results were merged and duplicates were removed, providing 1484 unique sequences (including those primarily selected). The 12 outlying sequences were detected and individually checked in the Uniprot database [74]. Nine sequences were trimmed according to the hydrolase domain, and three were removed since there was no information or similarity with other sequences. Next, in order to eliminate proteins other than EHs from the set of sequences, the conserved motifs described by van Loo et al. [55] were used, and only sequences with motifs H-G-X-P and G-X-Sm-X-S/T were preserved (where X is usually an aromatic residue, and Sm is a small residue). As a result, 29 sequences were discarded during the analysis. In the last step of MSA preparation, additional domains (e.g. phosphatase domain) were removed. To detect sequences with an additional domain, a histogram of sequence lengths was prepared, and long sequences (> 420 residues) were trimmed all at once in a temporary MSA. In the end, a final MSA of 1455 epoxide hydrolase sequences was prepared with Clustal Omega [75] using default parameters (S11 Fig). BALCONY analysis BALCONY (Better ALignment CONsensus analYsis) [41], an R package, was used to analyse the MSA and map selected structural compartments/tunnels onto the correct positions in aligned reference UniProt sequences. The Schneider metric [42] was calculated for each alignment position. Selected structures of M. musculus, H. sapiens, S. tuberosum, T. reesei, and B. megaterium sEHs, as well as the two thermophilic enzymes collected in hot springs (respective PDB IDs: 1CQZ, 1S8O, 2CJP, 5URO, 4NZZ, 5NG7, and 5NFQ), were divided into compartments/tunnels as shown in S1 and S5-S11 Tables. The compartment/tunnel residues were then appropriately mapped with MSA, and Schneider entropy values were collected for each position in the MSA.

Variability analysis
To assess the variability of a particular tunnel/compartment, their positions were compared with selected positions of the MSA. The MSA was trimmed only to positions where at least one residue was present of the seven structures (PDB IDs: 1CQZ, 1S8O, 2CJP, 5URO, 4NZZ, 5NG7, and 5NFQ) (S12 Fig). The MSA containing 1455 sequences was trimmed from 722 to 419 positions. This way, for each comparison, Schneider entropy values of a compartment/ tunnel positions were compared to the Schneider entropy values of selected positions of the MSA in which were present: i) neither one of residues of the currently analysed compartment/ tunnel, and ii) at least one residue of the seven analysed structures. In order to determine whether a compartment was to be classed as variable, a median distance was calculated, which was defined as a difference between medians of Schneider entropy values of a selected compartment/tunnel and the selected positions in the MSA. If the median distance was > 0, then the analysed compartment was considered variable. To compare the distributions of entropy scores of analysed compartments/tunnels with the distribution of the selected positions of the MSA, the Epps-Singleton two-sample test [43] was used. The advantage of this test is the comparison of the empirical characteristic functions (the Fourier transform of the observed distribution function) instead of the observed distributions. The comparison analysis was performed using the es.test() function from GitHub repository [76]. In an attempt to visualise the variability of selected tunnels (Fig 5), the collected entropy values of selected tunnel-lining residues without the surface residues and the selected MSA positions were sorted separately, and cumulative distribution functions (CDF) were calculated. For each position in the selected tunnel, a paired one from the selected position in the MSA was found, based on the minimal CDF. Plots of CDF as a function of entropy score were prepared.   Table. List of amino acids forming analysed tunnels in bmEH structure. (XLSX) S10 Table. List of amino acids forming analysed tunnels in Sibe-EH structure. (XLSX) S11 Table. List of amino acids forming analysed tunnels in CH65-EH structure. (XLSX) S12 Table Table consists of: the MD simulation tunnel ranking (MDRank), tunnel name, its average length, number of tunnel-lining residues, number of tunnel-lining residues without the surface residues, number of surface tunnel-lining residues, and the last three columns comprise of the differences between the median distances of particular tunnel-lining residues and the remaining positions of the trimmed MSA: median distances of tunnel-lining residues (7th column), median distances of the tunnel-lining residues without the surface residues (8th column), and median distances of the surface tunnel-lining residues (9th column). Differences between median distances values that are marked bold passed the statistical significance of the Epps-Singleton two-sample test (p-value <0.05). Hyphen (-) indicates that there were no surface residues. NA means that the number of surface residues was insufficient to obtain the p-value median distance of the Epps-Singleton twosample test. (XLSX) S13 Table. Entropy values for selected tunnels, Tm1 from StEH1, Tc/m1 from hsEH, and Tc/m_back from bmEH. Surface amino acids are in italics. Active site residues are marked by an asterisks ( � ). (XLSX) S14 Table. The list of parameters set for both CAVER plugin and CAVER 3.0 tunnels identification for each of the analysed systems.