The response to selection in Glycoside Hydrolase Family 13 structures: A comparative quantitative genetics approach

The Glycoside Hydrolase Family 13 (GH13) is both evolutionarily diverse and relevant to many industrial applications. Its members hydrolyze starch into smaller carbohydrates and members of the family have been bioengineered to improve catalytic function under industrial environments. We introduce a framework to analyze the response to selection of GH13 protein structures given some phylogenetic and simulated dynamic information. We find that the TIM-barrel (a conserved protein fold consisting of eight α-helices and eight parallel β-strands that alternate along the peptide backbone, common to all amylases) is not selectable since it is under purifying selection. We also show a method to rank important residues with higher inferred response to selection. These residues can be altered to effect change in properties. In this work, we define fitness as inferred thermodynamic stability. We show that under the developed framework, residues 112Y, 122K, 124D, 125W, and 126P are good candidates to increase the stability of the truncated α-amylase protein from Geobacillus thermoleovorans (PDB code: 4E2O; α-1,4-glucan-4-glucanohydrolase; EC 3.2.1.1). Overall, this paper demonstrates the feasibility of a framework for the analysis of protein structures for any other fitness landscape.


A.1 Shape Simulation
The phenotype was simulated following equation 5 in the main paper, where the phylogenetic effects (a) were simulated using the rbv function in the MCMCglmm R package [13] to create a matrix of randomly generated multivariate normal phylogenetic effects. The generation of phylogenetic effects was constrained with a known p × p genetic matrix (G) that was drawn from an inverse Wishart distribution with a scale matrix with the digit 1 in the diagonal and 0.5 in the off diagonal entries. The input tree for the simulation of the phylogenetic effects was generated by the functions compute.brtime and rtree of the ape R package [36]. The error component (e) was built by performing a Cholesky factorization of a known p × p covariance matrix (E) and multiplying the decomposed matrix with a matrix of random values of shape n × p. The known E matrix was also drawn from an inverse Wishart distribution, and its scale matrix contained the digit 1 in the diagonal and 0.1 in the off diagonal. Summarizing, the simulation was performed by: 1. Creating the known phylogenetic (G) and error (E) covariance matrices 2. Using G to constrain the simulation of the phylogenetic effects (a), simulating the n × p matrix a given a random tree 3. Incorporating the desired covariation E into a n × p random matrix

Adding the error and phylogenetic effects
Here, the number of observations (n) and the number of variables/traits (p) were controlled.

A.1.1 Including a within group component in the simulation
To include a within group component to the simulation the same approach used before was applied, but a dynamic term was included in the simulation and the model. This term was performed as follows:

A.3 Testing Linear Mixed Models implementations
The estimations of G and E matrices with the Lynch's PMM model were performed using the Lynch's algorithm implementation in the R package ape [36]. The Bayesian Generalized Linear Mixed Model (BGLMM) estimations of G and E were performed using the R package for BGLMM MCMCglmm [13].

A.4 Testing the accuracy of the estimations
To test the extent of the bias in the estmations of G and E, the mean correlation and corresponding p-values of the Cheverud's Random Skewer (RS) test [6,7] implemented in the R package phytools [41] were used. Works such as Bégin et al. [3] have contentions about any given covariance matrix comparison methods; however, Cheverud's test is better suited for the framework under study. It introduces random vectors of change and compares the correlation of resulting vectors. This is in line with the quantitative genetic framework as in equation 3 in the main paper. It is also better suited for comparative studies than other tests of equality such as Anderson's maximum likelihood test of equality of covariances [2] or the common principal component (CPC) analysis [38]. Those tests have big biases given the sample size and the number of traits. Steppan [47] showed a positive relationship between the number of traits and the likelihood of rejecting equality.  number of traits (p), the computation becomes untractable in terms of memory and time. For p traits, p(p + 1)/2 covariance components need to be estimated per random effect, and thus the estimation burden increases quadratically [29,18]. Table   2 shows the time and memory spent in a simulation fixing n to 100 and varying p.
The estimation was performed using Lynch's PMM model as implemented in the R package ape [36], where many matrix inversions and Kronecker products are involved [see 27, for details]. This computation was performed in a PC Intel R Xeon R CPU E5-2620 v2 @ 2.10GHz Intel i3 3.10GHz 128 Gb 2x hexa core processor. Table 2 shows that at 16 traits over 300 Mb of memory and over 9.4 hours are required to compute the model. The behaviour with 32 traits is erratic. In the current simulation the computation cannot be performed since the memory requirements are too high. The problem scales up quickly, not only with the number of traits, but also with the number of individuals. This trend is due to computation of the inverse of the relationship and the identity matrices. The computation of these matrices take the most time (≈94%) and memory (≈60%). Also, the time and memory required to compute the Kronecker product of these very large matrices scale up quickly with the showing the dependency between complexity and both n and p. The reliability of the estimates is also affected. The sample size needs to be increased in order to estimate the covariance matrices with confidence. However, by increasing the sample size the computation becomes more complex. To test the extent of the bias Table 2 shows the mean correlation and corresponding p-values of the Cheverud's Random Skewer (RS) test [6,7].
Surprisingly, despite the expected instability of the matrix estimation given the sample size, most of the estimations were highly correlated (>0.70) and significant.
It is important to state that the (RS) test does not evaluate equality of the covariances (in fact most covariances were very dissimilar). However, the overall response to a vector of selection is highly correlated. For the purposes of this work, the matrix equality in response to disturbances is more relevant than the exact match between the covariance matrices' values. This is because the response to disturbances expressed in the RS test follows the same framework as the expression detailed in equation 3 in the main paper.

B.1.1 Bayesian solution to the memory requirements
Given that Bayesian generalized linear mixed models (BGLMM) use Markov chain Monte Carlo (MCMC) simulations and usually Gibbs sampling, the memory requirements should lower significantly. However, Table 2 show the same trend on a different scale when using the R package for BGLMM MCMCglmm [13].
We simulated up to 32 traits using approximately 2 hours in the bigger dataset when the sample size and the number of MCMC iterations were held constant.
The results show that it was not the memory but the time that benefited from the Bayesian approach since over 1Gb was used (Table 2). However, this memory requirement can be lowered if fewer MCMC iterations are performed. Nevertheless, lowering the number of iterations can only be done on a per-case basis since the convergence has to be guaranteed. Table 2 describe the accuracy of this approach. With these particular data, there is higher or similar accuracy to that of Lynch's approach while being faster when the number of traits is high. With the given dataset, when more than 32 variables were analysed, MCMCglmm [13] reported ill-conditioned priors when completely flat priors (identity matrices) were used. With this in mind, and the fact that on 32 traits the memory requirements remain over 1Gb, a new approach was required.

B.1.2 Sample size effect in the estimation
It is known that the sample size is also an issue in the estimations of G especially when the number of traits increases. To test this assumption, we simulated and measured time and memory usage for the estimations following the same strategy used as before. In this case, the simulation had a fixed p equal to 8 and varying levels of n: low (16 observations), medium (64 observations), and high (256 observations). Table 3 shows the requirements in time and memory, as well as the average accuracy and the standard deviation for 10 replicates. The estimated mean is slightly biased resulting in a decrease in the mean. It is expected that accuracy increases with sample size. However, Table 3 shows that this is only significantly true for the residual matrices. This phenomenon might be attributable to the differences in the scaling structure in the original simulated matrices.
Data from Table 3

B.2 Beyond the OTUs: partitioning the variance within taxonomic units
We know by the decomposition of the phenotype into its components that the phenotypic variance can be explained by the genetic, environmental, and interaction components. It is also known that if repeated measures of a trait are available, the variance can be further partitioned into a third component. In general, comparative evolutionary biology, such components, may include differences among populations, phenotypic plasticity, sampling variation, instrument-related error, physiological state, variation related to age, sex, season, or time of day, among others [20]. All these sources of variations greatly depend on the way the sampling was done and the trait in hand. By including a within-group, within individual in traditional quantitative genetics, or within species in comparative studies, nuisance parameters can be dealt with. Many studies have shown the importance of dealing with measurement errors and deviations from the between-group analyses [15,20,9,12,10,45]. Therefore the within-group analysis is ideal. In protein structures, repeated measures can be taken as snapshots of molecular dynamic simulations of a protein in solution (refer to methods for details), thereby adding a partition to the variance. In this set up, another variable is added to the model: where m is the matrix of individual effects or effects of the dynamics of a protein.
This approach has an application in structural biology since it allows partitioning the structural variation into:  can be seen in Table 4.
With the inclusion of extra parameters there is a significant increase in the time and memory required for the estimation. However, the high accuracy in the phylogenetic covariance matrix estimation is surprising since a more unstable estimation was expected. The accuracy was anticipated to diminish quickly with the number of variance-covariance components to be estimated. For 3 random effects, there are 3q(q+1) 2 variance-covariance components that have to be estimated. This complexity makes the estimation of parameters unstable and intractable. However, data in Table 4 show high correlation values for the phylogenetic and acceptable values for the dynamic component. In the case of the residual value, most correlations were low and non significant. This is also an interesting observation since the trend seen in previous sections showed that an accurate estimation of the residual matrix was more feasible than the phylogenetic one.
Despite an improvement in accuracy, it can be seen that the time and memory required for the computation are still prohibitive; therefore, when q and n are high, another approximation is needed.

B.3 Fitness landscape in the GH13 dataset
To infer the source and target structures, a fitness landscape must be defined. In this paper we computed the average ∆G • unf old per residue as∆G • unf old = ∆G • unf old n , n being the number of residues. With this∆G • unf old as proxy for fitness we can try to explore the fitness surface. To do this, we used the first two principal components of a PCA analysis of the shapes as X and Y axes;∆G • unf old in the Z axis.