Dynamics based clustering of globin family members

A methodology to cluster proteins based on their dynamics’ similarity is presented. For each pair of proteins from a dataset, the structures are superimposed, and the Anisotropic Network Model modes of motions are calculated. The twelve slowest modes from each protein are matched using a local mode alignment algorithm based on the local sequence alignment algorithm of Smith–Waterman. The dynamical similarity distance matrix is calculated based on the top scoring matches of each pair and the proteins are clustered using a hierarchical clustering algorithm. The utility of this method is exemplified on a dataset of protein chains from the globin family and a dataset of tetrameric hemoglobins. The results demonstrate the effect of the quaternary structure of globin members on their intrinsic dynamics and show good ability to distinguish between different states of hemoglobin, revealing the dynamical relations between them.


Introduction
Protein structures are dynamic rather than static, and it is protein dynamics that play a key role in executing their functions [1][2][3][4][5][6][7][8][9]. Understanding the relation between protein function and dynamics is fundamental for comprehending the protein structure-dynamics-function relationship. Such an understanding can stem from a comparison of the dynamics of related proteins or the same protein in different states. Comparison of sequences and structures is a common approach in the study of proteins. BLAST [10] is by far the most widely used software for sequence similarity detection, and structure based comparison and classification algorithms like CATH [11], SCOP [12], and DALI [13] provide a good overview of the entire protein structure universe. In recent years, several tools and techniques have been developed for comparison of protein dynamics and contributed to the development of the field of comparative dynamics [14][15][16].
Many of the tools and techniques for dynamical comparison rely on analysis of low-frequency normal modes from coarse-grained elastic network models (ENM) or principal components analysis of biomolecular structures and dynamic simulation. These studies have proven useful in unraveling the collective modes and, in particular, those at the low frequency end of the mode spectrum that underlie protein equilibrium dynamics [17]. Simple and computationally efficient normal mode models were developed to study protein dynamics [18,19].
A detailed structure of a protein is not necessary to obtain the global dynamics, rather the shape of the molecule plays a predominant role in determining the eigenvectors [20]. Even low-resolution on-lattice models can provide insights into functionally important global dynamics [21]. Low-frequency normal modes from coarse-grained ENMs have been shown to match experimentally observed conformational changes [22,23]. Recently, using Anton supercomputing technology [24,25], a comparison was made between PCA modes obtained from micro-to milli-second full atomic MD simulations and modes obtained from the Anisotropic Network Model (ANM) [26,27]. Close overlap was found between the principal modes of these two techniques, reinforcing normal mode analysis as a tool for exploring protein dynamics [28].
Clustering algorithms rely on similarity (or distance) scores to identify closely related proteins. Sequence clustering algorithms used percentage sequence identity of aligned proteins to measure sequence similarity. In order to reduce computational cost, short word filtering of non-relevant sequences may precede the alignment step such as in BLAST [10] or CD-HIT [29]. Structure clustering algorithms use structural similarity scores such as the Root Mean Squares positional Deviation (RMSD) or TM-score [30,31]; for comprehensive review see [32]. For example, the DALI server optimizes a structural alignment; that is, a sequential set of one-to-one correspondences between C α atoms [33]. Similarly, there is a need to develop dynamical similarity scores; such scores can enable us to classify proteins according to their dynamics. Previously, Zen et al. developed a method to align proteins based on their equilibrium dynamics inferred from ENM [34]. Munts et al. showed that dynamics similarity can be measured by comparing Dynamic Fingerprint Matrix [35]. We previously showed that the similarity of normal modes of motion can be measured using alignment algorithms [36,37] based on the global sequence alignment algorithm of Needleman-Wunsch [38]. The optimal way of comparing complex molecular motions is, however, far from trivial and no efficient methods were developed so far for large scale dynamics-based clustering of proteins.
A new dynamics similarity score (DSS) based on local alignment of ANM modes of motion is presented here. The algorithm is based on the local sequence alignment algorithms of Smith-Waterman [39]. Its ability and usefulness to measure dynamics similarity is demonstrated on the globin family of proteins. Globins are globular proteins comprising 6-8 α-helices (labelled A-H), with members distributed across all three domains of life: bacteria, archaea and eukaryotes. Each globin polypeptide binds a single molecule of iron-protoporphyrin-IX (Heme B) that can bind diatomic gaseous ligands such as O 2 , CO, NO, and other small ligands [40]. In mammals, hemoglobin (Hb) acts as an O 2 carrier to transport O 2 from lungs to tissues, while myoglobin (Mb) is responsible for intracellular O 2 storage in muscles and its transport from the plasma membrane to mitochondria. The heteromeric quaternary structure provides the hemoglobin a mechanism for cooperative oxygen-binding and allosteric regulatory control [41], so the protein switches between two forms: tense (T) low-affinity state and relaxed (R) high-affinity state upon ligand binding [42]. There are two classical models proposed for describing the allosteric mechanism of Hb: the Monod, Wyman, and Changeux (MWC) concerted model [43] and the Koshland, Nemethy, and Filmer (KNF) sequential model [44]. After solving the R conformation the R2 conformation was discovered, this conformation more accurately represents the liganded Hb end state [45]. Two different structural sub-classes of the globin fold are recognized. The 3-on-3 fold is the canonical Hb fold, exemplified by Mb. The '3-on-3' designation refers to the α-helical 'sandwich' formed by the A-G-H and B-E-F helices [46]. Members of this class include, for example: Hb, Mb, non-symbiotic Hb and protoglobin. The second structural class is the truncated Hb (trHb) class, also called '2-on-2' Hbs, based on the arrangement of the B-E and G-H helical pairs [47].
The three-dimensional (3-D) structure of globins is well preserved, but their sequences are very different [48]. However, it was possible to identify, from a set of aligned protein structures, a core set of residues that are located at relatively invariant 3-D positions [49]. Maguid et al. [50] showed that two slowest Gaussian network model (GNM) normal modes of motions are conserved within this family, indicating common dynamics within the globin family. The present study demonstrates that the DSS can be used to cluster proteins based on their dynamics similarity. The DSS score matrix is calculated based on the top scoring matches of each pair of globin members and the proteins are clustered using a hierarchical clustering algorithm. The utility of this method is exemplified on a dataset for protein chains from the globin family and a dataset of tetrameric hemoglobins. The results demonstrate the effect of the quaternary structure of globin members on their intrinsic dynamics and the dynamical relations between different available Hb structures.

Globin datasets
A list of protein structures was compiled based on the globins Superfamily (1.10.490.10) of the CATH database [11,51]. Protein structures were downloaded from the Protein Data Bank (PDB) [52] and a total of 1030 structures and 1289 unique chains were obtained. Two datasets were created; the first (dataset1) contains 117 randomly selected globin chains. The second dataset (dataset2) consists of all 320 tetrameric Hb excluding three structures 1ITH, 4HRR and 4HRT that have a unique quaternary structure and do not superimpose with other Hbs. The PDB codes of two datasets are listed in S1 and S2 Tables (Supporting Information).

ANM calculation
The ANM modes of motion of the superimposed structures were calculated as previously reported [27,53,54]. Each residue was represented by a single node positioned at its C α atom and a cutoff distance of 15Å was used. Heme groups were represented as four nodes corresponding to CHA, CHB, CHC and CHD atoms. ANM modes were calculated to the biological unit. In the case when only a single chain was compared, and the chain was a part of the biological unit, only the part of the modes that corresponds to this chain was used in the alignment. Every pair of compared structures was superposed using the TM-align software [30,31] prior to ANM calculations. Calculating ANM modes of motion to a large dataset, especially when some of the structures are large, is very time consuming. Therefore, fast ANM calculations of only the slowest 40 modes were performed using the Spectra library for large scale eigenvalue problems [55]. This library is a C++ implementation of ARPACK [56] and uses the Lanczos algorithm [57].

Anisotropic network model modes local alignment
The commonly used Smith-Waterman [39] local sequence alignment algorithm was modified to align ANM modes with few modifications. ANM mode analysis results in a set of vectors The score for residues i and j upon alignment of modes k and l is defined as: S ij will be positive if their cosine value is greater than C that is the two deformation vectors pointing in the same direction and negative if their cosine is smaller than C. Here we used C = 0.7 radians (~40˚) to define the threshold for vector similarity. In case of alignment of homologous or identical proteins, it is possible to guide the algorithm to prefer the matching of spatially close residues by applying distance constraints. Distance constraints were applied in the present work by modifying the alignment score S ij as follows: where r ij is the C α distance between residues i and j and R c is the cutoff distance set here to 10Å. Since the sign of the fluctuations in each mode is arbitrary, the alignment of two modes, a and b, is done twice. Once between the two original modes (a and b) and once between mode a and the negative of the second mode-b, with the best alignment (highest total score) being used. For each pair from the n slowest aligned modes, an alignment matrix is created and the best non-overlapping (up to 200) gapless matches with minimal length of seven residues (gapless alignment of a single mode pair) are kept. The top 2n matches (best scores) are selected and best S ij is kept, for residues i and j of the first and second aligned proteins, for each mode combination. The final residue dynamical similarity score of each residue is the sum of all its best (kept) S ij . The sum of the average residue dynamical similarity score of both proteins divided by two is defined as their DSS. The ability of the current algorithm to identify local dynamics similarity is demonstrated in a recent paper [58] where we show a detailed dynamical comparison between myoglobin and hemoglobin.

Clustering and principal component analysis
Clustering calculations were performed in R environment for statistical computing and graphics [59]. Clustering was performed using the hierarchical clustering function hclust with default parameters after converting the DSS matrix into a distance matrix. Preliminary calculation showed that both hclust and agnes functions give similar results. Principal Component Analysis was performed using the R package Bio3D [60] in combination with the MUSCLE program for multiple sequence alignment [61].

Results
In order to determine the optimal number of slow modes to use for comparing the dynamics of globin family members, the DSSs between all chains in dataset1 were calculated using a series of slow nodes (n = 2,4,6, . . . 20). The correlation coefficient was calculated between successive DSSs and the results are depicted in Fig 1. The first data point marks the correlation coefficients between DSSs, calculated using the 2 and 4 slowest modes. The second between DSSs calculated using the 4 and 6 slowest modes, and so on. The more slow modes being used, the higher the correlation coefficient until it reaches a plateau around the 12 slowest modes. Therefore, the 12 slowest modes were used in the following calculations. The results also indicate that the algorithm is not very sensitive to the exact number of top 2n matches (see Methods) that are used to calculate the DSSs.

Clustering chains of globin members
Clustering of dataset1 was performed by using the DSS scores matrix and the dendrogram is depicted in Fig 2, with the number of chains in each biological assembly indicated in blue next to each structure. Overall, the dendrogram classification follows the number of chains in the biological assembly, indicating the strong effect of the quaternary structure on the dynamics of its subunits. The globin chains are divided into three major groups: Hb-α chains, Hb-β chains, and Mb and another small group of dimeric Hbs. There are three noticeable outliers (highlighted in maroon). The PDB code 3ZJM [62] (Fig 2 top) is dimeric protoglobin from Methanosarcina acetivorans, a strictly anaerobic methanogenic Archaea, whose biological role is still unknown. Protoglobin is the first globin identified in Archaea [63]. Another two noticeable outliers are the PDB codes 1MWB [64] and 2BKM [65] (Fig 2 middle); both these structures belong to the truncated hemoglobin family.
The fact that Hb α and β chains are clustered in different groups indicates that there is a difference in the dynamics of the two chains. The first mode of motion of Hb describes the R2 to T transition [66]. Fig 3 depicts this motion for the liganded human Hb 1BBB [67] using a porcupine plot from the viewpoint of the α (top) and β (bottom) chains. There are seven helices in α subunit of Hb and eight in β subunit, with the difference between them in the region that connects helices B and E in the α subunits. This region is composed of one short helix and a large loop in α subunit and a short helix-loop-short helix in β subunit; therefore, helix D is missing in the α chain. The 1BBB is presented using α-and β-subunits viewpoint with similar orientation of the helix E. The circular motion of α subunit is in approximately 30˚off the helix E direction while the circular motion of β chain is parallel to helix E direction. That is, the two chains have distinguishable motion of the slowest mode and the difference increases with the second and third modes.

Clustering tetrameric hemoglobins
The vast majority of the Hb proteins are tetrameric. These globin members were clustered separately in order to obtain a better understanding of their dynamical relations. A set of 320 tetrameric Hb structures were compiled (dataset2) from the CATH globin family and they are listed in S2 Table. PDB entries 1A3N (T) [68] and 1BBB (R2) [67] were used to represent human Hb in the unbound and bound states, respectively. The intact protein structures were superimposed prior to the ANM mode calculations, then the DSS scores matrix was calculated, and the proteins were clustered as before. The complete dendrogram is presented in S1 Fig. The resulting dendrogram divides the Hbs into two main groups, one includes the T-state and the other includes the R2-state structure. The nearest neighbors of the T and R2 state, they are highlighted in blue and maroon, respectively, and are listed in Table 1. For the majority of the proteins listed in the table, we were able to verify from the literature that they are indeed Tand R2 state structures. Two structures are distinguished as outliers in the dendrogram 3AT6 [69] and 2M6Z [70]. 3AT6 is the yellow-spotted river turtles (Podocnemis unifilis, Pleurodira, Chelonia, REPTILIA) adult Hb-A and the first refined model for reptilian adult Hb A structure. 2M6Z is an NMR solution structure of HbCO with overall quaternary structure more similar to the X-ray R structure of HbCO A than to the R2 structure. The authors concluded that it is a dynamic intermediate between the R and R2 forms. The dynamical comparison shows that this structure has unique dynamics distinct from the T and R2 states and is slightly closer to the T rather than R2 state (R state is referred below).
Dynamics based 2-D mapping of the tetrameric Hbs was carried out by calculating their average DSS from the T and R2 representatives (Table 1), with the results depicted in Fig 4. The R2 structure 1BBB has a high average R2-DSS and a low T-DSS, as expected. The two proteins that show the highest average R2-DSS are Hb-E structures 1YVQ and 1NQP [71]. The two structures are in the bound state, the former binds to CO and the latter to CN and was also reported to represent the R2 state. As expected, these proteins also have a low average T-DSS. The T structure 1A3N has a high average T-DSS and low average R2-DSS. The structure 1Y7D [72] has the highest average T-DSS and lowest average R2-DSS. This structure represents transitions referred to as T-to-T High transitions between the quaternary-T structure of wild-type deoxyhemoglobin and an ensemble of related T-like quaternary structures that are induced by some mutations in the Trp37β cluster. The R-state structure 3OO4 [73] is located between the R2 and T state with average R2-DSS greater than the average T-DSS. The map shows a continuous path from the R2 to R to T states. The two outliers 3AT6 and 2M6Z show low average T-and R2-DSS, another such structure is 1CG8 [74]. 1CG8 is an Hb structure from Dasyatis akajei, a stingray that is one of the most distantly related vertebrate Hbs to human HbA. Larger structural deviations between Dasyatis akajei Hb and human HbA are observed in various parts of the molecule, even in the E and F helices. The average T-and R2-DSSs for all Hbs in dataset2 is provided in the S3 Table. PCA analysis is a common method for 2-D mapping of the structural space. The resulting principal components (orthogonal eigenvectors) describe the axes of maximal variance of the distribution of structures. The 2D map of the tetrameric Hbs along the first two principal components is presented in Fig 5. Similarly, to the dynamics based 2-D mapping, the R2-state (1BBB) and T state are on opposite sides of the map and the R state is located closer to the R2 state than the T state. The main difference between the two maps is the clear gap between the cluster of structures that are at the vicinity of the T state and the rest of the structures. While, the dynamic based 2-D map shows intermediate states linking the two states. Dynamics based clustering of globin family members Dynamics similarity versus structural similarity ANM normal modes are derived from the protein structure. Hence, there is an expected relation between DSSs and structural similarity scores such as the TM-score obtained here during structural superposing of the Hb pairs. Fig 6 presents the DSSs as a function of the TM-score. The figure shows that Hb pairs with high DSS also have a high TM-score. However, the opposite is not always true: Protein pairs with high TM-score may have high or low DSS scores. One example is the 1CG8 structure; many Hb pairs, which include this structure, have a high TM-score and low DSS. Although the global 1CG8 structure is very close to other Hb structures, significant mutations and/or conformational changes are observed between this structure and HbA around the hemes, in the C-terminal region of the β-subunit, in the α1β2 interface, and in the organic phosphate-binding site of HbA [74]. These changes affect the DSS more than the TM-score.
Another way to compare between the DSS and the TM-score is a comparison of the derived dendrograms, the TM-score based dendrogram of dataset2 is presented in S2 Fig. Similarly, to the DSS-based dendrogram, the Hbs are divided into two major groups, one includes the Tstate and the other include the R2-state. Branching of the major groups is different. The nearest neighbors of the T and R2 state according to the TM-score based dendrogram are listed in Table 2. The overlap with the DSS based nearest neighbors (Table 1) of the R2-state is higher than the T-state as the non-overlapping structures 1NQP and 1YVQ are in a close branch of the DSS-based dendrogram. Thus, the different scores resulting in differences in the dendrograms.

Discussion and conclusion
A novel methodology for dynamics-based clustering of proteins is presented here. The method performs local ANM mode alignments of n slowest modes, using an algorithm based on the local sequence alignment algorithms of Smith-Waterman [39]. Dynamical comparison of the globin members is challenging since the proteins are in different multimeric states ranging from monomers to 24-mers. Hence, some of the modes were inter-subunits and others intrasubunit motions. Despite this difficulty, we were able to develop a fully automated procedure that identifies the best matches between the slowest n modes of each protein without the need to select reference modes. Recently, Ponzoni et al. classified the LeuT-fold superfamily of  secondary active transporters using the three slowest ANM modes calculated for the monomers/protomers [75]. This methodology necessitates the choice of a reference state and is harder to implement when the protomer in its multimeric state has significantly different dynamics. Clustering of globin chains shows that Mb, Hb-α and Hb-β are clustered in different groups, indicating the unique dynamics of each group within the globin family. The dynamical difference between Mb and Hb subunits is expected as there is a functional difference between the two proteins; Mb has a higher affinity for oxygen than Hb and lacks allosteric and cooperative function. The distinguishable dynamics between α and β subunits of Hb indicate that the two subunits have an asymmetric role in agreement with the following studies. PELE molecular dynamics simulations of Hb showed that the ligand exit paths for the β-subunit are considerably different from those for the α-subunit [76]. Experimentally, geminate ligand recombination reaction studies showed dissimilar behavior of these two subunits, pointing to their nonequivalent role [77][78][79].
The clustering results of the globin chains reflect the quaternary structure of the chains. However, even clustering of tetrameric Hb, i.e. globins with the same quaternary structure, results in a meaningful dendrogram and a 2D dynamic similarity map of the different structures of Hb. The gap in Fig 4. of the dynamical based 2-D map between the T and R2 states suggests that there is an indirect path from the R2 to T states through the R, with the R state closer to R2. The map reflects the ability of the DSS to distinguish between the different Hb structures and correctly classify them. In the case of a compact molecular structure such as  that of a globular protein, the shape of the molecule plays a predominant role in determining the eigenvectors of low-frequency normal modes [20,21]. Therefore, the use of only one or two slow modes for calculating DSS can affect the ability of the algorithm to distinguish between similar structures. Based on the correlation analysis of DSSs calculated using different number of slow modes (Fig 1), the number n = 12 of slowest modes was chosen as the optimal one. This number is in agreement with the algorithm of Zen et al. [34] for dynamics-based alignment of proteins, which uses the 10 slowest modes. Although elastic network model normal modes of motion are derived from protein structures, dynamics-based clustering is different from structure-based clustering. This is due to the fact that the modes of motions are calculated for the entire biological assembly even when we compare only a single chain. Therefore, we can take into account the effect of the quaternary structure on the intrinsic dynamics of the examined chain. In addition, as shown in Fig 6, proteins with high DSS have a high TM-score; however, proteins with a high TM-score may have high or low DSS. The present methodology has few limitations, first, it can only be used with structurally similarity proteins, since structural superimposing is necessary prior to ANM mode calculation. Second, normal mode analysis assumes a harmonic potential, but not all protein motions can be correctly described by harmonic potentials [80]. Thereby, the dynamics here only describe protein behaviors near the energy minimum. Proteins with moderate structural conservation can be compared by increasing R c . Dynamics comparison of structurally dissimilar proteins can be performed by comparing GNM modes since these modes do not depend on the specific pose of a structure in cartesian space [36,81]. The information embedded in the GNM modes is more limited than ANM modes, hence, comparison of GNM vs. ANM modes can be viewed as low vs. high resolution dynamical comparisons [58]. Combination of the two methods will be necessary to achieve a broader applicability of this approach. As proteins depend on their dynamics to execute their function, dynamics-based alignment and clustering algorithms are expected to deepen the understanding of the structure-dynamics-function relation of proteins.
Supporting information S1 Table. Dataset1. 117 randomly selected globin chains. The first 4 letters of each protein correspond to the PDB code of the structure and the fifth letter to the chain name. (DOCX) S2