Crystal structure of Thermus thermophilus methylenetetrahydrofolate dehydrogenase and determinants of thermostability.

The elucidation of mechanisms behind the thermostability of proteins is extremely important both from the theoretical and applied perspective. Here we report the crystal structure of methylenetetrahydrofolate dehydrogenase (MTHFD) from Thermus thermophilus HB8, a thermophilic model organism. Molecular dynamics trajectory analysis of this protein at different temperatures (303 K, 333 K and 363 K) was compared with homologous proteins from the less temperature resistant organism Thermoplasma acidophilum and the mesophilic organism Acinetobacter baumannii using several data reduction techniques like principal component analysis (PCA), residue interaction network (RIN) analysis and rotamer analysis. These methods enabled the determination of important residues for the thermostability of this enzyme. The description of rotamer distributions by Gini coefficients and Kullback–Leibler (KL) divergence both revealed significant correlations with temperature. The emerging view seems to indicate that a static salt bridge/charged residue network plays a fundamental role in the temperature resistance of Thermus thermophilus MTHFD by enhancing both electrostatic interactions and entropic energy dispersion. Furthermore, this analysis uncovered a relationship between residue mutations and evolutionary pressure acting on thermophilic organisms and thus could be of use for the design of future thermostable enzymes.


Introduction
While considerable experimental and theoretical advances have been made, the precise underlying mechanisms of protein thermostability remain elusive. However, several important features of thermophilic protein stabilization mechanisms have been identified. Thermostability of proteins appears to involve enhancing the number of salt bridges, hydrogen bonds and charged amino acids like arginines [1][2][3] and a better hydrophobic packing and/or larger hydrophobic core including more amino acids like tyrosines [4][5][6]. Certainly, thermostability can be thus linked to a multitude of different mechanisms with no single mechanism uniquely determining the thermostability of all proteins. One emerging motif is the possible presence of a network of stabilizing charged residues in thermophilic and especially extreme thermophilic proteins [7,8].
Recent studies have analyzed thermostability using sampling simulation techniques like molecular dynamics (MD) [9,10]. Complemented with known analysis tools, these techniques have been fruitful in characterizing the dynamic rather than just the static causes of thermostability. MD analysis has been complemented by more global analyses of trajectories, including normal-mode analysis (NMA) [11], principal component analysis (PCA) [12] and residue interaction network (RIN) analysis [13,14].
Here, we have solved by X-ray crystallography the structure of methylenetetrahydrofolate dehydrogenase (MTHFD) from Thermus thermophilus. This organism is an extreme thermophilic model organism, which has been described as having an optimal growth ranging from 65˚C to 72˚C and being resistant to maximum temperatures of 85˚C [15,16]. MTHFD is a key enzyme in the folate-dependent one carbon metabolism, which is important for the biosynthesis of several amino acids like glycine, alanine and serine, as well as nucleotide bases, formylated methionine and some pro-vitamins [17,18]. In some bacteria like E. coli, MTHFD (called FolD in this species) is a dual function enzyme that catalyzes the conversion of 5,10-methylenetetrahydrofolate and NADP + to 5,10-methenyltetrahydrofolate and NADPH (dehydrogenase function), and further converts the first product to 10-formyltetrahdrofolate (cyclohydrolase function) [19] (Fig 1A). Some microorganisms have a separate enzyme for catalyzing the second cyclohydrolase function (FchA) [20], and some species have an alternative pathway for 10-formyltetrahdrofolate production using formyltetrahydrofolate synthetase (Fhs) [21]. In eukaryotes, MTHFD is an important anticancer therapy target [22,23].
To analyze possible determinants of Thermus thermophilus MTHFD (TtMTHFD) thermostability, we have carried out 300 ns MD simulations of this enzyme at three temperatures (29.9˚C / 303 K, 59.9˚C / 333 K and 89.9˚C / 363 K). Additionally, the enzyme was compared with two related prokaryotic enzymes from the thermoacidophilic archaeon Thermoplasma acidophilum (PDB identifier 3NGL) [24], which is known to grow at temperatures between 45˚C and 62˚C with an optimum temperature of 59˚C [25,26] and the mesophilic gram-negative opportunistic human pathogenic bacterium Acinetobacter baumannii (PDB identifier 4B4U) [27], which thrives between 25˚C and 45˚C, with maximum growth rates at 37˚C [28]. To analyze the obtained MD trajectories, we combined several well-known analysis tools such as Principal Component Analysis (PCA) and Residue Interaction Network (RIN) analysis. For rotamer analysis, we adapted methods from other research fields, like Gini coefficient analysis and Kullback-Leibler (KL) divergence analysis. All these methods, together with clustering methods, provide data and dimension reduction capabilities which we used to identify important amino acids that determine the heat stabilization of the protein.
Principal component analysis (PCA) is a dimensionality-reduction method that can be applied to simplify complex data sets [12]. Briefly, a covariance matrix of the atomic coordinates of a MD trajectory is diagonalized to obtain its eigenvectors and associated eigenvalues, which represent a measure of the main divergence of the original trajectory. Thus, PCA can be used determine the major movements of proteins, which are called, in this sense, 'principal components' (PCs).
Residue Interaction Network (RIN) analysis was carried out using the concept of betweenness centrality (BC). Essentially, every amino acid in a protein (as represented by the position of its C α and C β atoms) is considered a node in an interaction network with the edges characterizing node interaction defined when the C β atoms are located within a given cut off distance. The number of the shortest paths between two residues that pass through a chosen node is termed 'betweenness centrality' of the node. BC can be considered a measure of the structural prominence of a residue [13].
Gini coefficients, originally introduced in econometrics to describe the equality of income distributions [29], are defined as the ratio between the areas under a true income distribution and a total equality distribution. In our analysis, we borrowed this notion and applied it as a simple measure of the breadth of distribution of side chain rotamers. Side chain rotamers are defined by the usual χ-angle torsion conformations of a protein's side chains [30]. Smaller Gini coefficients indicate a rotamer distribution tending towards a uniform distribution. Negative differences of the Gini coefficients of rotamer distributions at two different temperatures were interpreted as a measure of entropy-related energy dispersion effects. Gini coefficient analysis was additionally corroborated using Kullback-Leibler (KL) divergence analysis, which is a finer measure of discrete probability distributions. KL divergence (D KL ), often called the relative entropy, is defined as where P and Q are two discrete probability distributions satisfying 0 < p i ; q i < 1, It was originally introduced by Shannon to express the information content of a message transmitted to a receiver [31]. When the sender and receiver probability distributions share similar representations of a message's information content, D KL (P||Q) vanishes. Non-vanishing KL divergence is a qualitative measure of discrimination between sender and receiver information content. Here, we interpret D KL (P||Q) to discriminate two side-chain rotamer probability distributions at different temperatures in terms of information content but, more importantly for our purposes, as an indication of entropy related energy dispersion effects.
Finally, to combine these different data reduction methods, dimensionless Z-scores (in form of standard deviations away from the mean) were introduced and clustering techniques applied. Clustering is an unsupervised machine learning technique that groups objects based on their features. Hierarchical clustering is based on distance connectivity [32] and was used here to filter the MD analysis data for important residues involved in the thermostability of TtMTHFD.

Crystallization and crystallography
TtMTHFD was crystallized by the hanging-drop method using 1 μL drop/1 μL reservoir ratio conditions in 24-well plates under conditions containing 800mM Potassium Sodium Tartrate and 100 mM HEPES buffer pH 7.5 as crystallization buffer. The obtained crystals were measured after flash-cooling in liquid nitrogen using crystallization buffer supplemented with 26% glycerol. Two datasets at 1.5406 Å and 1.4587 Å were obtained on a Rigaku MicroMax-007 HF microfocus rotating anode diffractometer with an R-AXIS IV++ image plate detector at the Analytical Center at the Chemistry Institute of São Paulo University (IQ-USP, São Paulo, Brazil) and the MX2 Beamline at the Brazilian Synchrotron Light Laboratory (LNLS, Campinas, Brazil) [33]. Crystallographic data were processed with XDS [34]. The structure of MTHFD was solved by molecular replacement (MR) using the Phaser [35] module of PHENIX [36]. Electron-density maps were inspected and the structural model was built using Coot [37].

Molecular dynamics (MD)
MD simulations were carried out using the AMBER [38] simulation software package. Parametrization of the NADP + nucleotide was performed using antechamber and the semi-empirical AM1 with bond charge correction (AM1-BCC) method [39]. The MTHFD NADP + complexes were processed with tleap using the ff14SB AMBER force field for the protein and the Generalized Amber Force Field (GAFF) for the nucleotide, neutralized with counter-ions and solvated with the TIP3P water model in a cuboid integration box with 14 Å solvent margins from the complexes. After structure minimizations, short heating MD runs and density equilibration MD runs, 300 ns molecular dynamic production runs of all three dimeric complexes were performed using the GPU-version of pmemd at three temperatures (303 K, 333 K and 363 K). Production runs were carried out using 2 fs time steps, constant pressure with isotropic position scaling, Particle Mesh Ewald (PME) periodic boundary conditions with an 8 Å classical non-bonded cut off, a 2 ps collision frequency Langevin thermostat, the SHAKE algorithm with bonds involving hydrogens constrained and bond interactions involving hydrogens omitted. A total of 200.000 snapshots were produced. RMSD, RMSD per-residue and radius of gyration analysis was carried out with cpptraj.

Principal component analysis
Principal Component (PC) Analysis was carried out to compare and analyze main trajectory divergences between trajectories at two temperatures using cpptraj [40] as described by Galindo-Murillo et al. [41]. Basically, both trajectories were concatenated, RMS-fitted to the first frame and an average trajectory calculated that was fitted to all other frames. Next, the covariance matrix between the three coordinates of all C α atoms was calculated and diagonalized to determine its eigenvectors and associated eigenvalues. The coordinates of the trajectories were then projected onto the eigenvectors to determine the influence of residues on the principal components by scalar multiplication p ¼ hxi � e, where x are the three vector components of the average position of each atom and e the corresponding vector components of a specific eigenvector. A mass weighted sum of the P values of all atoms in a residue was calculated to obtain each residue's participation on the specific principal component. Additionally, principal component trajectory interpolations were carried out. Principal component distribution was evaluated by histograms in bins by projecting the C α trajectory coordinates on the corresponding eigenvector components. The obtained values were used to calculate interpolated trajectories, such as P ¼ hXi þ lE, where hxi is the average coordinate over the whole vector space and E a specific eigenvector and λ evaluated from the histograms.

Residue Interaction Network (RIN) analysis
RIN analysis was carried out using MD-TASK [13]. Trajectories were reduced to C α and C β atoms using cpptraj. Betweenness centrality (BC) was calculated using the calc_network.py script. Networks at two different temperatures were compared using the compare_networks. py script.

Rotamer analysis
The rotamer distribution of the amino acid side chains (chip, chi2, chi3, chi4) of the proteins were evaluated using cpptraj based on a library of the most common rotamers in protein structures [42], using the algorithm described by Haddad et al. [43]. Differences between the Gini coefficients at different temperatures were calculated using following equation for the Gini coefficients where n is the total number of rotamers of an amino acid and x i is the observed number of the i th rotamer and hxi is the average value of the rotamer sampled over a trajectory. The KL divergence D KL (P||Q) was calculated using the distribution of side-chain rotamers relative to a uniform distribution of those side-chain rotamers.

Heat maps
For heatmap calculations, sequences were aligned with Clustal Omega [44] and inserts between structures removed after manual inspection. Per-residue PCA component values from temperature difference PCA analysis, Gini coefficient differences, temperature dependent BC analysis values and difference KL (ΔKL) values for two temperature distributions when compared with the corresponding uniform distributions were then loaded into a comparison matrix using Z i scores (based on the ΔKL, GCD, PC1, DBC for the three different temperatures), where Z i ¼ ðx À hxiÞ=sdðxÞ. Besides correlation analysis, a hierarchical cluster analysis was performed on the comparison matrix using the Python Seaborn library (https:// github.com/mwaskom/seaborn/tree/v0.8.1). Additionally, an overall per-residue Z-score was calculated as z ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi X z 2 i q .

Residue identity histograms
Residues with similar composite Z-scores of a protein were binned. Then, the average amino acid identity of these binned amino acids calculated in comparison with one of the other proteins. Finally, the calculated amino acid identity values were plotted over the Z-score bins.

Results and discussion
The MTHFD fold In order to gain more insight into the temperature resistance of T. thermophilus MTHFD (TtMTHFD), we produced the enzyme recombinantly in E. coli BL21(DE3), purified and Pearson's correlation coefficient calculated with data set randomly split in half.
e R work Vare the observed and calculated structure factor amplitudes.
f R free was calculated as R work with 10% of the data omitted from structure refinement.

PLOS ONE
Thermostability Determinants of MTHFD molecules in the asymmetric unit. The model can be described after refinement by a crystallographic R factor of 21.3% with a corresponding R free of 24.4% ( Table 1). TtMTHFD consists of two α/β-fold sub-domains connected by two large α-helices, which are at the same time the Nterminal and C-terminal α-helices. The N-terminal α/β-fold sub-domain is the catalytic domain that binds the main substrate, and the C-terminal α/β-fold sub-domain is the nucleotide (co-substrate) binding domain, which are variants of the structurally conserved Rossmann dinucleotide-binding domain fold (Fig 1B). Both domains are interconnected by an extended loop-helix-loop motif that we here termed 'Connector Helix (CH) region'. In the nucleotide binding domain, we were able to unambiguously identify and fit a molecule of NADP + as shown in Fig 1C. The obtained structure showed, as expected, a high degree of conservation when compared to similar bacterial MTHFD structures (Fig 1D and 1E). In order to compare TtMTHFD, we chose two related MTHFD structures from Thermoplasma acidophilum, an organism with growth optimum between 45˚C and 62˚C (PDB entry 3NGL) and Acinetobacter baumannii, an organism with growth optimum between 25˚C and 45˚C (PDB entry 4B4U). Compared to the Thermoplasma acidophilum MTHFD (TaMTHFD) and Acinetobacter baumannii MTHFD (AbMTHFD) structures, T. thermophilus MTHFD (TtMTHFD) showed an overall RMSD of 0.98 Å (with an overall sequence identity of 36%) and an overall RMSD of 1.29 Å (with an overall sequence identity of 39%), respectively (Fig 1D).
Concerning salt bridges, both TtMTHFD and TaMTHFD have 67 charged surface amino acids, while AbMTHFD has 57 charged surface amino acids. Conversely, using a cut-off value of 4 Å, a higher amount of salt-bridges in the crystal structure of TtMTHFD (n = 26), when compared with TaMTHFD (n = 16) and AbMTHFD (n = 10), could be identified. Interestingly, again TtMTHFD showed a higher number of inter-chain salt bridges (n = 6) in comparison with TaMTHFD (n = 2) and AbMTHFD (none), indicating the importance of salt bridges for the thermo-stabilization of dimers. Furthermore, TtMTHFD had a higher number of arginine-formed salt-bridges (n = 23) compared with TaMTHFD (n = 5) and AbMTHFD (n = 6), as expected for thermophilic proteins [47].
As crystal structures measured under cryo-conditions give a static view of proteins, 300 ns molecular dynamics (MD) simulations of the dimeric form of all three proteins where carried out at three temperatures (303 K, 333 K and 363 K). These temperatures partially represent the appropriate temperatures for mesophilic organisms, like A. baumannii, a thermophilic organism like T. acidophilum and an even more thermoresistant organism like T. thermophilus. The corresponding RMSD values over the trajectories of these simulations are shown in Fig 2A, 2B and 2C and the corresponding per-residue average RMSD values of all non-hydrogen atoms are shown in Fig 2D, 2E and 2F. Besides small trends, we could not detect significant shifts of RMSD values (Fig 2A, 2B and 2C) nor the radii of gyration. The radius of gyration of the monomers oscillated at the three temperatures around an average of 19.3 Å ± 0.2 Å for AbMTHFD, 19.1 Å ± 0.2 Å for TaMTHFD and 19.4 Å ± 0.2 Å for TtMTHFD.
Regarding the RMSD values over time (Fig 2A, 2B and 2C), a slight trend to higher RMSD values with temperature could be seen in all three proteins. This is especially true for the mesophilic A. baumannii structure where a separation of the RMSD values could be seen, indicating a possible partial denaturation of the protein with higher temperatures. Moreover, the per-residue RMSD (Fig 2D, 2E and 2F) showed a similar distribution in all three structures. In this analysis, a high correlation of RMSD values between the residues in each monomer of the simulated dimers could be observed. Pearson correlation coefficients ranged from 0.75 to 0.96 with an average value of 0.90, indicating a coherent behavior of the MD simulations. Higher RMSD values correlated with loop regions, as expected. All three proteins showed similar distributions of RMSD values, with the notable exception of the T. acidophilum lid loop region, which showed a slightly lower RMSD values when compared to the homologous regions of the other two structures (Fig 2D, 2E and 2F).

Principal component analysis
Because molecular dynamics is a thermodynamic sampling technique, we hypothesized that a more systemic variation technique like principal component analysis (PCA) could be deployed to extract the main distinctions in movement at different temperatures. Thus, to obtain more insight on possible underlying mechanisms behind thermo-resistance, PCA was carried out so that two temperatures were compared, e.g. comparing 303 K to 363 K trajectories.
Important insight can be gained by interpolating movements along the eigenvectors of the PCs (Fig 3A, 3B and 3C). Several types of movements could be identified. The most prominent movements were: opening/closing of the lid loop and different directed movements leading to opening/closing or distortion of the substrate binding domain. In the case of TtMTHFD, displacements of C α atoms between 303 K and 363 K showed precisely these two movements. Firstly, the lid loop closes downwards in direction of the substrate binding site. Secondly, the substrate binding domains followed a well-defined movement (1 st PC, eigenvalue of 295, 2 nd PC, with a similar eigenvalue of 231). These movements evolve around hinges on the N-terminal helix and the region of the connector helix. Similar patterns could be detected in the PCA of the 333 K to 363 K and 303 K to 333 K trajectories of TtMTHFD (not shown). Altogether this indicates that TtMTHFD shows similar coordinated movements at different temperature steps. In the case of AbMTHFD, the movements are similar. In the case  These main movements described here, are possibly a structural characteristic that is intrinsic to the dual α/β-domain fold of MTHFD, as similar movements were deduced by simply comparing the two monomers in the dimer of previously described crystal structures [17,18]. In these dimeric crystal structures, superposition of the NCS related monomers show a torsion of the substrate binding domain at 'hinges' on the N-terminal and α-helical connecting domains, thus in regions very similar to the ones identified by our PCA. Therefore, overall, PCA analysis confirms qualitatively the expected motion of the substrate binding domain.
To expand the overall motion to the residue level, trajectory projections were calculated. In detail, these trajectory projections consist of multiplying the average coordinate of the C α atoms of the trajectories by the corresponding components of the eigenvectors to obtain the absolute value of the contribution of each C α to the 1 st PC (Fig 3D). In conclusion, PCA highlighted the main differences between the protein at different temperatures and could be described at the amino acid level using the vector projections. Many of these displacements showed a similar trend at different temperatures, indicating that PC movements are qualitatively based on the structural movements of the protein fold. We therefore set out to determine other important descriptions of differences in temperature that could be linked to the observations derived from PCA.

Residue interaction networks
As PCA showed important conformational changes, we then carried out residue interaction network analysis (RIN) on the trajectories of the three proteins (Fig 4A, 4B and 4C). Again, these images showed subtle but detectable differences between the three proteins, this time centered around the connector helix region. Like in the case of 1st PC projections, also RIN analysis showed less movement for the more thermostable protein TtMTHFD. As a whole, the BC distribution of amino acids affected by temperature jumps were similar in all three structures. AbMTHFD showed a movement that could be interpreted as partial denaturation. TaMTHFD showed a border-line behavior between TtMTHFD and AbMTHFD. Regarding BC differences, the more thermophilic TtMTHFD showed smaller movements reminiscent of a more thermo-resistant protein. This allows to narrow down the analysis of movements to obtain a more mechanistic view of the temperature effects. However, the interpretation of both the PC and RIN analysis could be more related to effects than causes. Therefore, a more causative explanation of the observed displacements was required.

Rotamers
To obtain more causative insight, we decided to undertake a rotamer analysis of the MD trajectories. This was initially motivated by the hypothesis that charged residues play a role in dynamic thermostability by acting as energy dissipation structures [7,8]. Analysis of side chain conformations of amino acids carried out in crystal structures show a preference for a limited number of the so called rotamer conformations. Thus, a published rotamer library was used to classify the rotamers in the MD trajectories onto one of these conformations. Next, to describe rotamer distributions, Gini coefficients were introduced as a measure of these distributions. Gini coefficients are normally used in econometrics to measure the inequality (higher Gini values) of income distributions. These Gini coefficients of the conformation distribution among the main rotamers were calculated for the MD trajectories and compared at different temperatures for each protein. For the sake of simplicity, in the following we will refer to this 'rotamer Gini coefficient' simply as the Gini coefficient. To detect essential amino acids with a particularly high differences in Gini coefficients, residue plots were constructed (Fig 5A, 5B and 5C).
Because of the simplicity of Gini coefficients, we extended this analysis using the concept of KL divergence. In computational science, KL divergence is an important measure of comparisons between probability distributions. We thus calculated the KL divergence for different temperatures in reference to a uniform distribution of rotamer probabilities, so as to corroborate possible energy dissipation in rotamers (Fig 5D, 5E and 5F). Because of the uniform distribution reference, reduced KL divergence values indicate a more uniform distribution.
As expected, in TtMTHFD most residues show a reduction in both KL divergence and Gini coefficients upon transition from 303 K to 363 K (Fig 5C and 5F). Interestingly, in these figures, the more thermo stable TtMTHFD shows a slightly more uniform distributed peaks with lower intensities, both for the Gini coefficients and KL divergence values. In addition, several charged residues and hydrophobic residues appear to have especially reduced Gini coefficients and KL divergences. This is particularly striking for the KL divergencies of TtMTHFD, where most significant peaks are from charged residues (Fig 5F). Thus, the rotamer analysis confirms the hypothesis that temperature resistant proteins have more charged large amino acids like arginine and lysins on their solvent exposed surface possibly because these residues act like heat energy dissipators. To corroborate the validity of Gini and KL analysis, average Gini and KL values were plotted against the temperature (Fig 5G and 5H, respectively). As can be seen, both average Gini and KL values are reduced with higher temperatures, corroborating their validity.

Determinants of thermostability
Taken together all four analyses indicated important determinants of thermostability in MTHFD. To obtain more mechanistic insight, it is however essential to integrate the different approaches in order to identify correlations between the different descriptors. To scale these ΔKL, GCD, PC1 and DBC values at the three temperatures for comparison, different Z i scores (that is number of standard deviations from the mean) were calculated for these per-residue values. Then, heat map analysis was carried out with the different perresidue Z i scores using hierarchical clustering (Fig 6A). The clustering showed strong correlations between the difference KL values (ΔKL) and the Gini coefficient differences (GCD). Additionally, the 1 st PC values (PC1) at different temperatures tended to correlate with each other, as did the ΔBC values (DBC).
To highlight better the clustering of residues, composite per-residue Z-score, defined as the RMS of the individual Z i, were calculated. When plotted on the structure, these residues with high composite Z-scores are mostly of two kinds. They are mostly charged solvent exposed residues or semi-buried hydrophobic residues (Fig 6B). As examples of these high composite Zscore residues can be cited: Val 228, Asn 229, Val 231 and Glu 232 from the lid loop; Arg 213, Glu 242, Glu 245 from the nucleotide binding domain; Glu 108, Phe 119, Pro 121, Arg 126, from the connector helix region; Arg 102 from the substrate binding domain. Of these cited 12 residues, four are identically conserved in TaMTHFD and two in AbMTHFD. Except the charged residues from the lid loop, all other charged residues from TtMTHFD mentioned form salt-bridges. Thus, some of the residues with high composite Z-scores indeed are involved in salt bridges, which appear to be one of the determinants of thermo stabilization of TtMTHFD. Residues involved in salt bridges in this structure, have an average composite Zscore of 3.8, which can be considered below 1 standard deviation (1.9) above the average composite Z-score (2.9), indicating that not all salt bridges have elevated Z-scores. Regarding the ΔKL value from the 303 K to 363 K jump, charged amino acids of TtMTHFD not involved in salt bridges have a mean value of -0.14, whereas charged amino acids in salt bridges have a value of -0.19. This is an interesting result, as it indicates that salt bridges have an elevated capacity to absorb heat by energy dissipation. Thus, an important conclusion is that in TtMTHFD a large amount of salt bridges probably contributes both statically, i.e. through more elevated Coulomb interactions, as well as dynamically, through higher energy dissipation capabilities, to the stabilization of the protein.
Because it can be assumed that residues with high composite Z-scores probably are structurally important for thermostability, average Pearson correlation coefficients for the different evaluated parameters were calculated. These Pearson correlation coefficients are shown in Table 2 for all residues of TtMTHFD. Interestingly, a high correlation was found between GCD and ΔKL (0.68) and a slight correlation between PC1 and GCD (-0.37). However, when the residues with high composite Z-score (i.e. with more than two standard deviations from the average) were compared, more significant correlations for GCD/ ΔKL (0.74) and PC1/GCD (-0.50) ( Table 3) could be obtained. Thus, a high composite Z- score seems to indicate amino acids which show elevated correlating behavior between principal components and rotamer energy dispersion. The (negative) correlation between PC1 and GCD is particularly intriguing because it indicates that movement (as described by the PC1 calculated from joint 303 K and 363K trajectories) actually "transfers" energy to rotamers movements during temperature increases. To visualize rotamer energy dispersion, PCA-analysis of the rotamer library value changes at different temperatures during the MD trajectories was carried out. Fig 6C shows PCA of the charged residues with 2 sigma level composite Z-scores at 303 K and Fig 6D shows PCA at 363 K. From these PCAprojections, the energy dispersion effect of the charged residues becomes apparent, corroborating the notion that charged residue networks could be of great importance for the thermostabilization of proteins.
To check if there is any correlation of composite Z-scores with amino acid mutations between TtMTHFD and the other two proteins, residue identity histograms comparing proteins were generated. Indeed, there was a trend of amino acids with high composite Z-scores to be more mutated between TtMTHFD and TaMTHFD (Fig 6E). This trend was even more significant when TtMTHFD was compared with mesophilic AbMTHFD (Fig 6F). In the case of heat-maps and residue identity histograms for TaMTHFD and AbMTHFD, similar results could be found (S1A and S1B Fig).
As an overall conclusion, this indicates that composite Z-scores like the one defined in this work, could be considered important descriptors of thermostability in proteins. Thus, composite Z-scores integrating rotamer analysis by KL divergence and Gini coefficients, PC and RIN analysis data from MD simulations can be used to define amino acids that play a more active role in thermostability of protein folding. Furthermore, this type of analysis can eventually be used to predict the effect of mutations on the thermostability of proteins and thus be of help in the protein design of thermostable mutants. We therefore propose an analytical mechanism to detect important mutated residues that determine the thermostability of proteins. It is tempting to speculate, as postulated by Ladenstein et al. [7], that rotamer energy dispersion at higher temperatures is causally related to thermostability in proteins like TtMTHFD via a mechanism involving interacting charged residues, as we showed here with help of rotamer distribution analysis. Further research has to be carried out to corroborate and expand this hypothesis.