Difference contact maps: From what to why in the analysis of the conformational flexibility of proteins

Protein structures, usually visualized in various highly idealized forms focusing on the three-dimensional arrangements of secondary structure elements, can also be described as lists of interacting residues or atoms and visualized as two-dimensional distance or contact maps. We show that contact maps provide an ideal tool to describe and analyze differences between structures of proteins in different conformations. Expanding functionality of the PDBFlex server and database developed previously in our group, we describe how analysis of difference contact maps (DCMs) can be used to identify critical interactions stabilizing alternative protein conformations, recognize residues and positions controlling protein functions and build hypotheses as to molecular mechanisms of disease mutations.


Introduction
Protein structures have complex three-dimensional shapes and are most often visualized as cartoons depicting their overall arrangement of secondary structure elements and neglecting interaction details. Such cartoons were popularized by Jane Richardson [1] and gained wide popularity thanks to programs such as PyMol [2] (see Fig 1A). Other visualization styles: topology diagrams [3], distance [4] or contact [5] maps are also used as each of them highlights aspects of structure that are difficult to see in the other representations, but their popularity doesn't compare to that of ribbon diagrams, which became de facto standards in presenting protein structures in manuscripts and books.
Widespread use of such images to depict protein structures, often combined with wording that talks about "the structure" when referring to entities illustrated by such images, may incorrectly suggest that protein structures are unique and static. In fact, protein structures are far from static and, as any physical system in constant temperature, can assume any of the conformations from the canonical ensemble describing the system [6]. This point is well known and accepted among biophysicists and is the subject of many reviews [7]. Protein functions often include cycling through various functional isoforms that correspond to different neighborhoods in the conformational ensemble. For many proteins, single conformations representing different functional forms have been captured experimentally and are available as  [8]. Differences between such alternative conformations are difficult to illustrate by ribbon diagrams and are often described verbally or shown in detail only for the most relevant, but small section of the structure, such as for instance an active site. The most often used measure of structural difference between protein structures is the root mean square deviation (RMSD) between Cα atoms [9]. While useful for classifying and rank ordering of (dis-)similarity of protein structures, it is a global measure that doesn't give much information about the details of the differences and treats on equal footing a protein pair with significant, but localized differences in one loop with a pair with subtle, but distributed differences. Similar to other popular protein structure similarity/dissimilarity measures, such as TM-score [10], RMSD is useful for identifying the most similar (or divergent) structures from a group, but not to describe the details nor mechanisms of the divergence.
Protein structure visualizations that directly focus on interactions stabilizing it may be better suited for this purpose and were indeed quite popular in the early days of structural biology, but mostly fell out of favor with the growing popularity of ribbon diagrams. For instance, difference distance (Fig 1B) or contact ( Fig 1C) maps can be used to compare protein structures and analyze the details of differences between functional states [11] (See Fig 1E and 1F, respectively). But as historically the main focus of structural biology was the exploration of the protein universe, classification and initial characterization of novel structures was a priority and tools and visualizations useful for that purpose became popular. Now structural biology is increasingly focusing on details of protein function rather than on initial structural characterization of novel proteins and we are in need of tools for the in-depth analysis of differences between different conformations of otherwise similar or identical proteins, not supplanting, but enhancing standard visualizations. Difference distance or contact maps are ideal for this purpose, and indeed recent years saw their reemergence, either as general tools [12,13] or in applications to specific systems [14]. An increasing percentage of protein coordinate sets deposited to the PDB consists of multiple depositions of independently solved structures of the same protein, either in different functional states or simply in different conditions, resolution etc., resulting in apparently redundant deposition. For instance, in the latest PDB release (data from Jan 25, 2019) there were 478,448 independent protein chains with 70,567 unique (at 100% identity level) sequences-for an over 6-fold redundancy (this would further fall to 56,973 i.e. 8.4-fold redundancy at the 95% threshold similarity level, used in this project). Many global analyses of structures in the Protein Data Bank try to remove this redundancy, selecting a single chain to represent each group of alternative conformations and both specialized servers [15] and the PDB itself provide lists such of such non-redundant structures. This approach, perhaps inadvertently, reinforces the unique structure paradigm by treating all information about structural variation within a conformational ensemble of individual proteins as noise to be removed from analysis. This surprisingly large variation was studied previously in our group [16] and by others [17] and to study it further, we developed PDBFlex [18], a server that illustrates and performs automated analysis of all clusters of coordinates sets for almost-identical (95% sequence identity) proteins. However, PDBFlex, as well as other similar resources [19], is based almost entirely on using the language of RMSD and global comparisons, thus focusing on the question of classifying and describing the structural divergence within the cluster-the "what" question. Here we would like to approach the "why" question-which interactions define the structural variations, how it can be influenced by mutations, etc. We believe that to answer these questions we need to get away from global descriptions such as RMSD and develop a new language to describe the differences between protein conformations, a language anchored in interactions.

Results
We decided to characterize the differences between two conformations of a protein in a language of residue interactions and analyze differences between contact maps of proteins. Such an approach was used in the past by us and other groups to optimize protein structural alignment [5], to classify protein structures [20] or as a general tool for protein structure comparison [21]. Finding an optimal overlap between protein contact maps became a popular algorithmic problem [22]. We hypothesize that such analysis would allow us to identify residues that differentially stabilize alternative conformations (Differentially Stabilizing Residues or DSRs). We hypothesize that DSR-based description of differences between alternative protein conformations, in contrast to more popular RMSD-based descriptions, is a natural way to identify factors that may control the natural balance between functional/structural states of a protein and which, if modified by disease mutations or interactions with drugs, may disrupt this balance.
To identify the possible DSRs and evaluate this approach on a large set of proteins, we compared residue contact maps for conformations representing the most divergent structure models of all proteins with "redundant" structures deposited in the Protein Data Bank [8]. The starting point for this approach was the PDBFlex [18] database developed previously in our lab.

Non-redundant structures in the PDB illustrate the scope of conformational differences between alternative protein conformations
The PDBFlex database is based on clusters of PDB entries with >95% sequence identity, representing independently solved structural models of the same protein, with the 95% sequence identity threshold adapted to allow for variations in construct boundaries and/or single mutations. From each cluster, the two sets of coordinates with largest RMSD were selected to represent the two most divergent conformations of the protein in the cluster, although this approach can be applied to study the differences between any two conformations. Fig 2 shows a histogram of maximum RMSD for all clusters in the PDBFlex database (snapshot from 30 April 2018). The main peak around 0.4Å RMSD (9079 clusters with RMSD � 0.5 Å) shows that for most clusters the structural differences are small. We hypothesize that such fluctuations describe movements around the mean structure, like in a short molecular dynamics simulation [23] or protein normal mode analysis [24]. However, the distribution also shows a long tail towards higher RMSDs (20934 clusters with RMSD > 0.5 Å), which correspond to significant structural differences, such as the difference between the active and inactive conformation of BRAF (6.2 Å RMSD) and the open and closed forms of adenylate kinase (7.3 Å RMSD). This tail is seen more clearly when the y-axis is log-scaled (Fig 2 inset).
This picture is qualitatively similar to one presented in a previous analysis from our group [16], showing that despite doubling of the number of entries in the PDB and significant Distribution of maximum RMSD shows the same trend for all clusters with a large peak at 0.4Å and a long tail toward higher RMSDs. Inset: The same distribution on a log scale. Black represents the distribution for all clusters, blue represents clusters with two ligand or complex -bound structures, red represents structures with one bound and one unbound structure, and green represents clusters with two unbound structures. The maximum RMSD data is taken from the PDBFlex and the data on complex formation from the PDB bioassembly information. The ligand binding information was obtained from the BioLiP database [25]. https://doi.org/10.1371/journal.pone.0226702.g002

PLOS ONE
Difference contact maps in the analysis of the conformational flexibility of proteins differences in the protocol between the two papers, the general observation of the PDB providing information on the conformational ensemble of proteins remains the same. It is important to note that our analysis provides the lower estimate of the structural divergence for every protein, since introduction of new structures that may be determined in the future can make the maximum RMSD only higher.

Binding events are not the only cause for conformational changes
Several studies have shown that ligand binding or complex formation does not "induce" structural changes, but results in the stabilization of one of many possible conformations and a shift in the conformational equilibrium towards that conformation [26,27]. In order to determine to what extent the structural changes between alternative conformations were caused by ligand binding/complex formation, we classified the clusters into three classes, based on whether one or both of the structures in the maximum RMSD pair were or were not bound to a ligand or part of a complex. If ligand binding results in the stabilization of and shift towards one of many possible conformations, this would mean that a protein can sample the bound conformation even in the absence of a ligand, and vice-versa. Alternatively, if ligand binding induces a conformational change, then a large RMSD would be observed only when comparing the ligand bound and unbound structures of a protein. In  Fig 2, we see that the green and blue distributions show a long right tail, similar to the red distribution, indicating that there are many clusters in these two groups that show a large RMSD between their two representative structures. Since these clusters are represented by two bound (blue distribution) or two unbound (green distribution) structures, this supports the claim that these proteins can show conformation changes regardless of their binding state. If a long right tail had been seen only for the red distribution (clusters represented by one bound and one unbound structure), it would have suggested that ligand binding is the only cause for conformational changes. However, this is not the case and this result is in line with previous studies [26,27].
To assess the contribution of different crystallization conditions/experiments to the structural differences, we classified the clusters into two groups based on whether the two most divergent structural models in the cluster were different chains/coordinate sets in the same PDB entry (representing the same crystallization experiment) or were from different PDB entries (and would therefore be likely to represent different experimental conditions). We found that while overall the two distributions looked similar, the maxima were shifted by 0.2 Å and the ratio between both groups reversed between the small and larger RMSD values (See Fig 3). For instance, the most divergent pair in the~66% of the clusters (6000/9079 clusters) with RMSD � 0.5Å came from the same PDB entry. For clusters with RMSD > 0.5Å, this ratio was reversed with~46% of clusters (9674/20934 clusters) represented by models from the same PDB entries (Fig 3 inset). These results show that different crystallization experiments can explore a broader range of conformational flexibility of proteins and further underscores the observation that the redundancy in the PDB can be used to study protein conformational diversity.
We hypothesized that differential contacts could be modulating the equilibrium between alternative conformations seen in the PDB. To verify this, we looked at the fraction of the total number of contacts that have changed between two conformations (as seen in the filtered differential contact maps) for three classes of clusters: those represented by one bound and one free (or holo-and apo-)) structure (Fig 4). In order to see some examples of which residues are identified as the DSRs, we applied this approach to two examples (presented below) of proteins that bind substrates/form complexes.

Adenylate kinase
Adenylate kinase (ADK) is an enzyme that catalyzes the transfer of one phosphate group from ATP to AMP, to form two ADP molecules. ADK consists of three domains: the NMP bind domain, the LID domain and the CORE domain [28]. Substrate-binding and catalysis by this enzyme involves a conformation change wherein the NMP bind and LID domains move from an "open" to a "closed" conformation ( Fig 5). However, this enzyme has been known to sample both conformations even in the absence of the substrate [29,30].
We created contact maps for the closed and open conformations based on the PDB models with codes 4jzkB and 4akeA [28], respectively. The final differential contact map showed a total of 49 contacts unique to only one conformation, 31 to the closed form and 18 to the open form (Fig 5). Six contacts were seen between the NMP bind and LID domains, nine between the NMP bind and CORE domains and eight contacts were between the LID and CORE domains. The remaining contacts were intra-domain contacts. We hypothesized that the inter-domain contacts would be responsible for modulating the conformational change seen in this protein.
One of the six contacts between the NMP bind and CORE domains was formed by Glu170 and Leu58 and seen only in the closed conformation. It has been previously shown that a hydrogen bond exists between these residues in the closed conformation and that the disruption of this bond (by mutation of Glu170 to Alanine) resulted in a shift of the conformational equilibrium towards the open form in the absence of the substrate [29]. In order to assess whether this mutation could be impacting the differential stability of the two conformations, we used MODELLER [31] to build two structures for the WT sequence and two structures for the mutated sequence, using 4akeA and 4jzkB as templates. We then calculated the DOPE scores for these modeled structures and found that the open conformation had a higher DOPE score for both the WT and mutated structures (Table 1). However, the difference in the scores for the open and closed conformation was less in the mutant structure (Table 1), which suggests a shift in the equilibrium to the open form [29].

BRAF
BRAF is a protein kinase and one of the components of the Ras-Raf-MEK-ERK signaling pathway that plays a central role in cell proliferation and differentiation. Activation of BRAF allows Black represents the distribution for all clusters, blue represents clusters with two ligand-bound structures, red represents structures with one bound and one unbound structure, and green represents clusters with two unbound structures. This distribution has a similar shape for all categories, with a long right-tail. https://doi.org/10.1371/journal.pone.0226702.g004

PLOS ONE
Difference contact maps in the analysis of the conformational flexibility of proteins it to phosphorylate MEK via its kinase domain [32]. The BRAF kinase domain consists of a small N-terminal lobe and a large C-terminal lobe. The N-terminal lobe includes the glycine rich P-loop (residues 463-471) and has a major α-helix termed the αC-helix. The C-terminal

PLOS ONE
Difference contact maps in the analysis of the conformational flexibility of proteins lobe includes the activation segment/AS beginning with the DFG motif (residues 594-596) and ending with the APE motif (residues 621-623) [33].
We created contact maps for BRAF based on the models with PDB codes 4mneB (in which it is complexed to MEK1) [34] and 4h58C (in which BRAF is bound to an inhibitor) [35]. The final differential contact map showed 82 contacts, 51 of which are between residues from the activation segment and the rest of the protein (Fig 6). This included the contact between R575 and E611, that has been described to stabilize the active conformation of the protein [34].
Given its role in cell growth and division, BRAF is frequently mutated in many different cancers. One of the most common mutations seen in this protein is the V600E mutation, that accounts for 80% of all BRAF mutations [36]. This residue forms a contact with K507 that is responsible for providing additional stability to the active conformation in the mutant form of the protein [34]. This contact was seen in our final differential contact map, demonstrating a possible application of this method to exploring the role of various disease-causing mutations. It is interesting to note that one of the applications of a difference contact map to the analysis of protein structural/functional conformations [14] analyzed in depth the same example as above, and despite different approach and contact definition, arrived at similar results.

Discussion
In this manuscript we have explored a contact map based description of protein structures with a goal of identifying residues that preferentially stabilize alternative conformations, potentially controlling the balance between different functional isoforms. Exploring the redundancy in the Protein Data Bank we realized that, for most proteins, multiple coordinate sets deposited in PDB provide some level of sampling of these proteins' conformational ensemble. We have shown that while for most proteins the differences between various coordinate sets available in the PDB is small, for a significant number of them the differences are not trivial and, most likely, sample different functional states. Somewhat surprisingly, these differences are not necessarily related to ligand binding or complex formation and even models of unliganded proteins solved in different crystal forms provide a significant amount of information about the shape of the conformational ensemble.
We developed an automated protocol to identify residues involved in stabilizing alternative conformations of proteins and integrated this information with the PDBFlex database, previously developed in our group. As PDBFlex is integrated directly with the PDB database and website, this provides information about such differentially stabilizing residues (DSRs) for hundreds of proteins to a broad group of users. We explored the DSRs identified by our protocol on few examples of clinically important proteins and confirmed that they indeed overlap with residues previously identified to stabilize alternative conformations and/or mutated in disease states.

PLOS ONE
Difference contact maps in the analysis of the conformational flexibility of proteins Differential contact map visualization and identification of DSRs is now implemented on the PDBFlex server, with thousands of examples open for user analysis. We believe that this resource can be used as a hypothesis-generation tool to identify the residues that could be modulating the conformational equilibrium of any given protein.

Calculation of contact maps and identification of DSRs
We have developed a method based on contact maps to identify residues that could be modulating the conformational flexibility of a protein by forming contacts that differentially stabilize two different conformations of the protein. This method is described below and illustrated by

Selection of structural models and alignment of sequences
We used structural models deposited in the Protein Data Bank clustered at 95% sequence identity, as done in PDBFlex [18]. Each cluster represents various conformations of a single protein (within the 95% sequence identity threshold). For each cluster, we selected the pair of structural models with maximum RMSD as representative of the two most divergent conformations of the protein (Fig 7A and 7B). However, this method could be performed for any two structural models of the protein. The sequence alignment for the two structural models, based on the PDB SEQRES sequences, was modified to remove any residues that did not have coordinates in the model. This alignment was further modified to remove any residues that were aligned to a gap in either of the sequences.

Calculation of contact maps
For each structural model, PDB files were downloaded and the coordinates of the residues in the alignment, corrected as described above were extracted. Only residues with an empty insertion code and hetero-flag were used. (If more than 5% of the extracted residues differed between the two structural models, or the total number of residues extracted was not the same, then the cluster was skipped). Based on the coordinates for these residues, contact maps were created for the two structural models (Fig 7C and 7D). A pair of residues was considered to be in contact if the distance between any two side-chain heavy atoms from the residues was less than 5Å. Residues that did not have coordinates for any heavy atoms were considered to not form contacts and for glycines, the Cα atom was used.
Many different definitions of a residue-residue contact have been used in the literature. The differences are either in the choice of the atom used to measure inter-residue distances (either the Cα or Cβ atoms, center of side chain mass or a side-chain heavy atom) or the distance threshold used to determine a contact. We use side-chain heavy atoms to measure inter-residue distances in order to prioritize side-chain interactions, a definition that has been previously shown to be optimal for threading and protein structure comparisons [5]. Independently it was shown that direct contacts in proteins fall in the range between 3-5Å and longer distances start to include indirect interactions (for instance through water molecules) [37]. Here, following our earlier papers [5], we use the distance threshold at the end of the direct interactions range to increase contact statistics.
Calculation and filtering of differential contact maps A differential contact map was then created by subtracting one contact map from the other. This map therefore, represents the set of contacts seen in only one conformation, but not the other (Fig 7E). We further filtered this differential contact map to remove any residue pairs that showed less than (or equal to) 5Å difference in inter-residue distance in the two conformations ( Fig 7F). This was done in order to consider only the residues that could be modulating large scale conformation changes and to exclude contact changes seen as a result of small local fluctuations in the structure. The residues forming the contacts seen in this final differential contact map were considered to be differentially stabilizing residues (DSRs) that could be modulating the change between the two conformations considered.

Calculation of DOPE scores
MODELLER v 9.21 [31,38] was used to create structure models for the WT and mutated sequences of BRAF and ADK, based on template structures representing the two alternative conformations of each protein. Alignments between the sequence and the template structure were created using MODELLER's align2d function. Models were created using the automodel class, with the DOPE score as the assessment method. One model was created per template, for each sequence.
For adenylate kinase, the SEQRES sequence for chain A from the PDB file for 4ake was used to create a sequence file in the MODELLER .ali format. This sequence was manually edited to create a sequence corresponding to the E170A mutation. The WT and mutated sequence were modeled using 4akeA and 4jzkB as templates.
For BRAF, the SEQRES sequence for chain C from the PDB file for 4h58C was used. 4mneB and 4h58C were used as template structures. Since the PDB ATOM records for both 4mneB and 4h58C start from residue D449 (D448 in 4h58C), the sequence file was manually edited to start the sequence from the same residue and reduce the number of gaps in the resulting alignment. The same procedure was followed for the V600E mutated sequence.
Supporting information S1 File. README provides detailed information about the format of the following files. (TXT) S2 File. GetFlexResFromCM.py is the python script used to calculate the contact maps for each PDBFlex cluster. (PY) S1 Table. Diff_Stab_Residues_public.csv contains information on the differentially stabilizing contacts for different clusters/coordinate sets and can be used to create the filtered differential contact maps as shown in the paper. (CSV) S2 Table. ContactMaps_metadata_public.csv contains information on the differentially stabilizing residues and other metadata associated with each cluster. (CSV)