Pockets as structural descriptors of EGFR kinase conformations

Epidermal Growth Factor Receptor (EGFR), a tyrosine kinase receptor, is one of the main tumor markers in different types of cancers. The kinase native state is mainly composed of two populations of conformers: active and inactive. Several sequence variations in EGFR kinase region promote the differential enrichment of conformers with higher activity. Some structural characteristics have been proposed to differentiate kinase conformations, but these considerations could lead to ambiguous classifications. We present a structural characterisation of EGFR kinase conformers, focused on active site pocket comparisons, and the mapping of known pathological sequence variations. A structural based clustering of this pocket accurately discriminates active from inactive, well-characterised conformations. Furthermore, this main pocket contains, or is in close contact with, ≈65% of cancer-related variation positions. Although the relevance of protein dynamics to explain biological function has been extensively recognised, the usage of the ensemble of conformations in dynamic equilibrium to represent the functional state of proteins and the importance of pockets, cavities and/or tunnels was often neglected in previous studies. These functional structures and the equilibrium between them could be structurally analysed in wild type as well as in sequence variants. Our results indicate that biologically important pockets, as well as their shape and dynamics, are central to understanding protein function in wild-type, polymorphic or disease-related variations.


Introduction
Conformational ensembles are nowadays increasingly used to understand protein function [1]. However, most of those studies use backbone coordinates to define open and close conformers neglecting the importance of cavities, pockets and tunnels (gates of enzymes). In the present work, pockets and cavities are considered to better characterise the alternative conformations in the kinase region of the Epidermal Growth Factor Receptor (EGFR). This tyrosine kinase receptor is one of the main tumor markers in many cancer types [2]. Its cytoplasmic region is composed by a juxtamembrane (JM), a Tyr-kinase domain, and a C-terminal intrinsically disordered tail (C-tail) target of auto-phosphorylations which triggers signals involved in different cell processes [3,4]. In different types of cancers, an increase in kinase activity and PLOS ONE | https://doi.org/10.1371/journal.pone.0189147 December 11, 2017 1 / 17 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 its resultant deregulation are observed [5,6]. Several single amino acid substitutions (SASs) as well as insertions and deletions located in the kinase region and detected in patients affected with different cancers, mainly non-small cell lung cancer (NSCLC), have been proposed as cause of this kinase activity enhancement [7]. In the case of the EGFR kinase domain, as in other kinases, the native state is mainly composed of two populations of conformers called active and inactive or dormant structures [8,9]. Moreover, it has been proposed that the stabilisation of the EGFR kinase active conformation is mediated by the formation of an asymmetric dimer (interface between the C-lobe of one subunit with the N-lobe of the other, Fig 1). Several works have reported kinase activity assays and/or their corresponding structures, allowing the generalisation of some structural and sequence features that are typical of active conformations [10][11][12][13]. Unfortunately, some of these shared structural traits are, expectedly, not easily detected in inactive conformations due to their intrinsic structural variability when compared with their active counterparts. However, in the case of kinases, some inactive conformations share common traits that have been repeatedly observed [14][15][16].
The identification of specific structural features of active and inactive conformations is relevant to improve our understanding of the deregulation of enzyme activity, as well as to gain knowledge on the specificity and selectivity of inhibitors [17,18]. This distinction is also important to better evaluate the impact of sequence variants. Briefly, sequence variants may cause active conformation enrichment at equilibrium due to the structural stabilisation of the active conformation [19]. Alternatively, sequence variants may also have a destabilising effect on inactive conformations, changing in both cases the ΔG barriers between conformers, with the consequent enrichment of the active form at equilibrium [20,21]. Moreover, an alteration of the inter-monomer interaction can also change enzyme activity due to equilibrium perturbation. In the case of EGFR, the study on the effects of many reported sequence variants has promoted a lot of research work in response to targeted therapy treatment decisions [22,23]. Phenotypic and clinical outcomes of several activating variants are well-known, but new ones are frequently reported as a consequence of the currently progressive extension of the sequencing of patient samples [24,25]. Thus, when it is not possible to perform an activity assay, the characterisation and eventual classification of each new sequence alteration using just sequence, structural and evolutionary information is of great interest [26]. These analyses could also, for each case, delimit the group of most appropriate inhibitors [17]. Thus, a structural description based on experimental data or derived from homology modelling, in silico structural analysis or docking studies may improve our understanding of the structural and/or functional effect of different reported variants [27][28][29].
In addition to the effect on kinase activity due to the enrichment of active conformer in the equilibrium caused by a sequence variation, small molecule kinase inhibitors show different conformer dependent mechanisms of binding. Thus, inhibitors of type I bind to active conformations, while Types I ½ and II to inactive ones. Several non-covalent inhibitors interact with the kinase ATP-binding pocket, a structure with different characteristics depending on the conformer type while others are bivalent or are allosteric [30,31], and protein allosterism also depends on the conformational ensemble of the protein [32]. The selectivity and specificity of these inhibitors also depend on the kinase sequence, structural or conformational differences being a challenge in the recognition of the specific characteristics of each particular kinase [33][34][35]. Briefly, and to bring the present work into focus, several distinctive characteristics of active and inactive conformations are presented, following, in all descriptions, human EGFR canonical amino acid sequence numbering (Universal Protein Resource, UniProtKB accession P00533, isoform 1, 1210 amino acids in length). Two main structural elements are usually analysed to distinguish between active and inactive kinase typical conformations. Firstly, the αC helix (positions 753-767, N-lobe) orientation: rotated inward against the N-lobe and towards the active site, this is characteristically observed in active conformers, and is crucial for kinase activity. This αC helix disposition shorts the distance between E762 and K745, allowing a stabilising ion-ion interaction (salt-bridge) between E762 of the αC helix and K745 in the β3 strand (740-747, N-lobe; a detailed description is found in Jura et al. 2011 and the references therein [10]) which interact with the α and β phosphates of ATP to anchor and orient the ATP. Secondly, in the activation segment (855-884), the Asp-Phe-Gly (DFG) motif at the beginning exhibits its aspartate in an active state conformation pointing into the ATP-binding site and coordinating a Mg +2 ion (only one per monomer can be observed so far, in all known crystal structures of EGFR kinase). This organisation is accompanied by an open and extended conformation of the activation loop, that is, a part of the activation segment, and is known as a DFG-in conformation. As a counterpart, several inactive kinase conformations show a DFG motif flipping towards the orientation known as DGF-out, with an almost reciprocal change in the relative orientation of D and F. In the out form, F is in the position previously occupied by aspartic acid. The change in the αC helix towards the position known as the out state has been proposed, as a general kinase activation mechanism, to be mediated by intermediate orientations, making the . Key structural elements are coloured magenta (Gly-rich loop), blue (αC helix), violet (activation segment), green (catalytic loop), and sites in red (K745 in the AxK motif), orange (T790 gatekeeper), yellow (catalytic spine residues) and cyan (regulatory spine residues). Ion pair (saltbridge) between K745 and E762 is depicted with a dashed green line. (C) Comparison of these elements in active (PDB 2GS6) and inactive (PDB 3W32) monomers following the same colour scheme, but in lighter tones. Main transitions of these elements from active to inactive are depicted with black arrows. establishment of active or inactive αC helix orientation limits no easy task [11]. These elements are shown in Fig 1b and 1c for two representative conformations of the EGFR kinase domain (inactive PDB 3W32, active PDB 2GS6). Apart from these elements, there are others important to the stabilisation of the ATP-active site interaction, also shared by different kinases, such as the triad HRD (positions 835-837) in the catalytic loop [36]and different proposed amino acid networks [37][38][39][40]. Of these networks, two are proposed to be involved in the regulation of kinase activity: a catalytic spine (C-spine) and a regulatory spine (R-spine) [41].
Here, we examined the above described structural parameters in all human EGFR kinase domains deposited in structural databases and previously characterised in bibliography as active or inactive conformations. While several structures fulfilled all these structural criteria and, consequently, were easily classified as active or inactive, others showed both active and inactive features and, consequently, could not be unambiguously classified. Moreover, some well-characterised structures with variants proposed for a long time to constitutively stabilise the active conformation were, controversially, later reported as inactive [42]. At this point, considering the observed structural differences between conformers, in order to address some of the previously reported controversies or ambiguous conformation classifications, and to recognize the relevance of pockets, cavities and tunnels in protein function, we focused on their differential features as observed in conformational comparisons. Pockets, cavities and tunnels are structures that connect the protein surface with buried active or binding sites in proteins, and that are essential for biological activity in most proteins [43]. Their conformational changes define, for example, differential binding constants that may explain biological function, substrate specificity and important regulatory processes such as allosterism [44,45]. A slight rotation of certain given residues, usually called gatekeepers (e.g. bottleneck dynamics [46]) or larger conformational changes (e.g. conformational gating [47] and malleability [48]) are the main mechanisms controlling the transit of substrates and products to and from the protein inside. The dynamic nature of the native state, at the expense of structural differences between conformers, and their relation to changes in tunnels, pockets or cavities are essential for a complete description of protein function. Their consideration could shed light in the sometimes ambiguous conformer characterisation in proteins in general and in EGFR in particular.
In this work, a quantitative structural comparison of the pocket containing the active site (main pocket) allowed the correct discrimination of EGFR kinase conformations (active/inactive) taking also into account atypical conformer grouping. This comparison was performed using a hierarchical clustering based on the root mean square deviation (RMSD) of α-carbons belonging to positions of this main pocket. Even though there are several works considering pockets related to the kinase active site, quantitative structural-derived comparisons are presented here for the first time [49]. Interestingly, our findings indicate that 53 main pocketbelonging positions hold structural conformer-specific information when compared with non-pocket positions. Additionally, the mapping and characterisation of reported cancer-associated variants were also studied resulting in a notorious proportion of all the 153 kinase position-holding variants (101 [% 65%]) which belong or are in close contact with this main pocket. Finally, it is interesting to highlight the importance of these main pocket-shared positions in reflecting their backbone spatial constraints to regulate protein function.

Structural conformations and sequence variants
Three-dimensional coordinates of the EGFR kinase domain conformers were retrieved from CoDNAS (http://www.codnas.com.ar/ [50]) and PDB (http://www.pdb.org [51]). Multimeric crystals were split into individual chains, resulting in a total of 103 conformers with a crystal resolution less than or equal to 3.00 Å. Available structures not already published with crystal oligomeric structures different from well-known EGFR dimeric forms or involved in heterooligomers were also removed.
Missing atoms of lateral chains were completed with the complete_pdb routine of Modeller [51,52]. Sequence variants of the EGFR kinase domain for all types of cancer were obtained from COSMIC (Catalog of Somatic Mutations in Cancer, http://cancer.sanger.ac.uk/ cancergenome/projects/cosmic/ [7,53]), specifically from the targeted screen (curated) data set (version 78, September 2016).

Pocket calculation, structural alignments and data set building
Pockets, cavities and tunnels predictions on active and inactive conformers were performed using Fpocket program [54]. Pockets related to the active site of the kinase domain, the main pocket, were manually selected by visual inspection of the active conformers PDB ids 1M14 (apo form) and 2GS6, and by considering all EGFR residues within a 5 Å radius from each atom of the ATP analog substrate-peptide conjugate in 2GS6 [55]; in this way, the main pocket of the active site was defined including 53 positions. Also, a neighborhood of close contact residues was delimited, considering residues with atoms within a 5 Å radius from each pocket amino acid. The value of 5 Å as limit to define a contact was chosen as a reasonable balance of the energetic contribution of each type of noncovalent interaction at a given distance between interacting atoms or residues [56][57][58][59][60].
As the rest of the pockets, cavities and tunnels were not shared by all the conformations, even within active or inactive groups, this study was centered on the main pocket. All versus all pairwise α-carbon structural alignments of kinase regions of all retrieved structures with a resolution equal or better than 3 Å and with the exclusions previously mentioned were performed with MAMMOTH [61]. To reduce redundancy and to avoid conformer over-representations, structures derived from the same work and with a global α-carbon RMSD equal or less than 0.50 Å were removed. This value is close to the estimated crystallographic method error [62]. The final set consisted of 58 structures. Also, all versus all pairwise structural alignments were performed using the main pocket positions for each structure. In addition to this, to study the biological information content of the 53 main pocket positions in the active-inactive conformer division, 1000 resamplings were done, choosing randomly 53 non-pocket positions each time. For each sample, the 53 selected positions were pairwise α-carbon structurally aligned, obtaining 1000 all vs. all RMSD matrices.

RMSD-based clustering
In order to explore the biological information content of the main pocket, hierarchical clusterings were performed over all vs. all α-carbon RMSD values obtained taking into consideration pocket and kinase positions. Neighbor-Joining and UPGMA clustering methods taken from the Phylip package were used (Phylogeny Inference Package, version 3.7 a) [63]. Also, to study the contribution of the 53 main pocket positions to the active-inactive conformer division, 1000 α-carbon RMSD hierarchical clusterings were estimated. They were then used to find the Majority Rule Extended (MRE) [63] and Majority Rule (MR) [64] consensus clustering (data not shown) using also the Phylip package.

Sequence variants were extracted from COSMIC and CLINVAR databases
A total of 17234 samples containing EGFR kinase sequence variants were extracted from COS-MIC [65]. A comparison with Clinvar [66] information reported no differences. Of those, 16117 corresponded to NSCLC samples. Sequence variants taken from COSMIC, included 153 positions involved in SASs (missense substitutions), 47 different kinds of deletions and 25 different kinds of insertions in the kinase region. A detailed description of these sequence variants is included in S2 Table. Sequence variants in the Juxtamembrane Segment and C-tail are also included.

Mutations mapping, DFG orientation, salt-bridge distances
The visualisation of conformers and pockets, mutation mapping and DFG orientation analysis were performed using PyMOL in the active, inactive, monomeric and dimeric conformers of the kinase domain. K745-E762 distance measurements of the N-O ion pair were performed with ad hoc scripts. Cut-off distances for the salt-bridge range were taken from the works of J. Thornton and R. Nussinov [67][68][69].

Functional structures in EGFR
As previously mentioned, active conformations have their own structural particularities and several common features that can also be extracted from inactive conformations (Fig 1). The pocket limited by the two lobes houses the active site and is central in our comparisons. It is evident at a glance, as well as from structural alignments, that some active and inactive conformations are different; however, not all conformations are easily distinguished as active or inactive. Specific examples of conformations that exhibit several structural traits of classical active conformations and others that show inactive conformations are described in the next section. Moreover, several specific characteristics appear in some groups. The established structural parameters used to discriminate active from inactive kinase conformations were evaluated in all available EGFR kinase structures, and are: DFG orientation, the distance between the N atom of the epsilon amine group and K745 and the distance between the two oxygen atoms of the gamma carboxyl group and E762. This analysis defines more than two conformer groups, allowing different classification schemes. S1 Table includes the complete set of all available experimental EGFR kinase structures, together with the current structural parameters used to describe alternative conformations. Moreover, alternative orientations of DFG motive lateral chains and distance range between K745-E762 were defined: three orientations for D and 6 for F lateral chains together with three intervals for ion-ion distances. S1 Fig shows these alternatives graphically. As seen in this figure, structural differences can be obtained by a structural comparison of the backbone. The appearance of more than two conformer groups thus does not allow a reliable differentiation of only one active and one inactive group. This impossibility motivated us to search for an alternative structural criterion for comparisons aimed at discriminating active from inactive conformations. Pocket comparisons were consequently performed.

EGFR main pocket definition, alignment and clustering
The main pocket, which involves 53 positions, was defined using structural and biological information, as described in Materials and Methods. Unlike other pockets, tunnels or cavities detected in the kinase region of the EGFR, this main pocket is present and clearly distinguishable in all conformations. Because of that, the other pockets as well as different cavities and sporadic or short tunnels were not considered. Additionally As already mentioned, the previous classification of conformations as active or inactive was taken from bibliography. The different colours reflect different groups of conformers, as defined in S1 Table, according to their structural particularities.

Describing EGFR kinase conformations by pockets
It is interesting to note that node A using pocket-based clustering mainly divides active from inactive EGFR conformations. Alternatively, node B divides conformations leaving structures 5HG7, 5HG5 and 5HG8 grouped with the active ones. In order to explore the biological information content of the main pocket positions, as explained in Materials and Methods, we performed 1000 resamplings selecting at random 53 non-pocket positions that were later structurally aligned and clustered. The1000 clusters that were obtained were used to find the Majority Rule Extended (MRE) and Majority Rule (MR) consensus clustering (data not shown). Node A shows a statistical support of 0.39 and node B of 0.21, even lower than node A support. Moreover, these nodes are absent in Majority Rule (MR) Consensus clustering (data not shown). These results highlight the biological information contained in main pocket positions in reference to their capacity to differentiate active from inactive conformations.
As it was previously mentioned, different groups of conformers have particular characteristics, sharing structural features both with classical active or inactive conformations or having their own particularities. For example, the group represented by the structures reported by Cheng et al. [17], PDB ids 5HG7, 5HG5, 5HG8 and 5HG9, exhibit a classic DFG-out conformation in the presence of ligands; however, their activation loop is in a state nearly identical to the one in the active form. In pocket clustering groups, these conformers are separated from the rest of the inactive group and also from the rest of the active conformations. Nevertheless, in complete sequence clustering, they appear close to and share a node with three (uncommon) inactive conformations, 3IKA_B, 3GOP_A and 3W2R_A, but do not share the orientation of the lateral chains of D845 and F846. 3IKA_B is the activator or donor chain in the asymmetric dimer and it packs its C-lobe against the N-lobe of 3IKA_A. As it was already mentioned, it has been proposed that the stabilisation of the EGFR kinase active conformation is mediated by the formation of this asymmetric dimer [16]. However, unfortunately, 3IKA is the only asymmetric dimer representant that we could include in RMSD clustering; because of  lower resolutions, 2JIT, 2JIU, 4G5P and 4LL0 were discarded. This donor chains, also show clear differences from all the other conformers when superposing their backbones (data not shown). Unfortunately, even if these PDBs are the only ones showing this complete configuration as asymmetric dimers in an asymmetric unit of a crystal, they all harbor the T790M variation. To explore the influence of these conformations in the native state in wild type as well as in T790M and other possible variants, good resolution asymmetric wild type dimer crystals are needed.
A significant number of disease-related EGFR kinase SASs belong to the main pocket A total of 17234 samples containing EGFR kinase sequence variants were extracted from COS-MIC [65]; a comparison with Clinvar data [66] did not report changes. From these, 16117 correspond to NSCLC samples. In terms of the different variations that are represented, sequence variants taken from COSMIC include, in the kinase region, 153 positions involved in SASs (single amino acid or missense substitutions), 47 different kinds of deletions and 25 different kind of insertions. A detailed description of these sequence variants is included in S2 Table,

Discussion
The well-recognized importance of protein dynamics, the existence of an ensemble of conformations in the native state of a protein [1], and changes in pockets' structural features [43] to improve our understanding of protein function make them essential aspects to take into account in normal as well as in disease-related states. In terms of health care, for some wellcharacterised pathologies at the molecular level, patient exon sequencing is, nowadays, conducted much more frequently than in the past and provides very valuable, but not always conclusive, information. Increasing our knowledge on protein function underlying mechanisms would certainly have an impact on our understanding of sequence variants effects on patients.
Although there are well-characterised sequence variants in terms of drug response in EGFR, serious limitations in treatment decisions appear in some cases as a result of the incomplete characterisation of those previously unreported or with controversial classification. In the present work, we studied the structures of previous experimentally determined EGFR kinase domain structures, including the analysis of pockets and cavities. Moreover, all the cancer related sequence variants included in well-curated databases were structurally mapped in different EGFR kinase domain conformations. We found that it is possible to discriminate previously reported conformations as active or inactive, as well as subsets with structural particularities, by performing a main pocket structural comparison using α-carbon RMSD-based hierarchical clusterings. Additionally, %65% of kinase positions with reported variants in  [4,5-b]azepine-derived inhibitor (PDB id 3W32), active chain in asymmetric dimer with WZ4002 irreversible inhibitor (PDB id 3IKA_A), and inactive chain in asymmetric dimer (PDB id 3IKA_B). The most frequently affected sites (those in red, violet and dark blue) map in the catalytic pocket and are found in particular regions: the activation segment (with the classical L858R), the Gly-rich loop, and the region that connects both lobes of the kinase. Interestingly, most of the residues that serve as the docking site for the substrate peptide are not affected by variations (circled with a dashed black line). The total number of patients affected by cancer are in, or in close contact to, this main pocket. The enrichment in disease-related variants of the main pocket position and its contacts compared to the rest of the kinases reflects their functional importance and their putative effect on protein activity in disease. Fig 5 includes a map of cancer-related variations in the EGFR kinase, reflecting how these are enriched in the main pocket.
Previously established conformation classification criteria were sometimes not conclusive due to the presence of intermediate positions, distances, orientations or angles [72]. It is also interesting to note that hierarchical clusterings, taking into account all kinase or main pocket positions, are similar but with some differences. However, the most significant finding of the present work is that only 53 main pocket positions contain the structural information necessary to discriminate active from inactive conformations, also allowing the grouping of atypical conformations. This finding reflects both the biological meaning and the importance of pockets in protein function and their relationship with reported disease-related sequence variants. Thus, only 53 specific positions entail functional significant information, sustained by the low support of nodes A and B as a result of the resampling analysis of 1000 replicates performed after taking 53 positions at random over a total of %220 kinase non-pocket positions. Moreover, of these 53 main pocket positions, an important fraction, 44 (% 83%), are structurally defined in all conformers. This observation reflects, together with the fact that the main pocket was the only recognizable one in all the conformers, the structural importance of these positions and their structural constraints. The information content rests, in addition, in backbone coordinates.
It is noticeable that our results agree with the conclusions of previous works. Firstly, conformations with a structural organisation that is intermediate between classically active or inactive exist [17,73]. Secondly, several conformations with sequence variants, L858R and/or T790M, belong to the inactive group according to the work of Gajiwala et al. [74]. In their work, thermodynamic stability analysis of these structures supported their conclusions. These sequence variants could be activating because of conformational equilibrium displacement not being necessary to assume that the kinase adopts an active conformation (constitutively) to explain their impact on kinase activity. A ligand's presence, its concentration and the environmental physicochemical conditions should also alter conformer populations, displacing the equilibrium of alternative conformations of L858R and/or T790M variants which, by themselves, would not be able to significantly shift the conformational equilibrium towards active forms.
Even though several groups have studied pockets related to the kinase active site, this work presents, for the first time, quantitative calculations using main pocket structural comparisons to allow a better discrimination of conformations. This work extends the quantitative study of conformers with different activities. Moreover, this analysis may help to better structurally characterise and, consequently, distinguish different sequence variants which could impact on decisions related to patient treatment and the design and selection of inhibitors for disease.  Describing EGFR kinase conformations by pockets 3GT8 and 2GS7, as orange and yellow balls respectively) and active chains in asymmetric hetero-dimers (PDB ids 4RIX, 4RIY, 4RIW, as green balls). (B) Different D855 orientations (Dfg). IN, pointing to Mg2+ in active site. OUT and UP, two different orientations pointing away from the active site. (C) Alternative F856 positions (dFg). Numbers 0 to 5 represent the different categories used to divide conformations (a more detailed description is included in S1 Table).