Recognition of sites of functional specialisation in all known eukaryotic protein kinase families

The conserved function of protein phosphorylation, catalysed by members of protein kinase superfamily, is regulated in different ways in different kinase families. Further, differences in activating triggers, cellular localisation, domain architecture and substrate specificity between kinase families are also well known. While the transfer of γ-phosphate from ATP to the hydroxyl group of Ser/Thr/Tyr is mediated by a conserved Asp, the characteristic functional and regulatory sites are specialized at the level of families or sub-families. Such family-specific sites of functional specialization are unknown for most families of kinases. In this work, we systematically identify the family-specific residue features by comparing the extent of conservation of physicochemical properties, Shannon entropy and statistical probability of residue distributions between families of kinases. An integrated discriminatory score, which combines these three features, is developed to demarcate the functionally specialized sites in a kinase family from other sites. We achieved an area under ROC curve of 0.992 for the discrimination of kinase families. Our approach was extensively tested on well-studied families CDK and MAPK, wherein specific protein interaction sites and substrate recognition sites were successfully detected (p-value < 0.05). We also find that the known family-specific oncogenic driver mutation sites were scored high by our method. The method was applied to all known kinases encompassing 107 families from diverse eukaryotic organisms leading to a comprehensive list of family-specific functional sites. Apart from other uses, our method facilitates identification of specific protein interaction sites and drug target sites in a kinase family.


Introduction
Protein kinases, as key regulators of cellular functions, are among the largest and most diverse protein superfamilies known [1,2]. On account of their phosphotransfer function to a Ser / Thr / Tyr residue in eukaryotes, they are also known as STY kinases. Congruent to their function as molecular switches that determine outcome at critical decision points of cell signalling pathways, their indispensable nature is reflected by the conservation of at least 51 unique kinase families across phyla, from yeast to mammals [3]. During the course of divergent evolution, the broad catalytic function and the 3-dimensional fold are well conserved [4]. Despite the monotony, kinases exhibit high diversity in terms of differences in activating triggers [5][6][7][8], regulatory mechanisms [9,10], cellular localisation [11][12][13], domain architectures [1] and substrate specificity [14]. In this context of dualism of similarity and differences, the modules responsible for common and preserved features, like, ATP binding [15], phosphotransfer [16] and 3-dimensional conformation of active state [17] are well known. However, the correlates of kinase-specific functional and regulatory attributes, which differentiate one kinase from another, are not completely understood. Indeed, for many kinases the sites of functional specialization is yet unknown.
In the present study, we aim to identify the sites of functional specialisation in all known eukaryotic protein kinase families. These are residues involved in specific protein-protein interactions, cognate substrate recognition, response to specific signals, etc., and thus are the defining and discriminating attributes of the corresponding kinase. Clearly, knowledge of such sites finds immense application in designing kinase-specific inhibitors, protein engineering and recognition of interaction partners. Such sites should ideally be identified by traditional experimental methods like mutation studies and structural analyses using X-ray diffraction; but these are evidently slow processes as in-depth information on family-specific functional sites is so far known only for a few kinase families such as PKA [18], Src [19], MAPK [20] and CDK [21,22]. In this scenario, we have analysed the sequences of all known STY kinases, comprehensively studied the conservation patterns within and across kinase families, devised a unified scheme and identified kinase family-specific functional sites in each of them.
Residues of functional specialisation in a particular kinase family, say PKA, are by definition, crucial for PKA-specific functions and regulatory mechanisms, and thus are expected to be conserved in all PKAs [23]. Additionally, since the associated function / regulation itself is PKA-specific, evolutionary pressure for conservation of these sites exists selectively in PKA kinases. As a result, such sites also possess the discriminatory ability to distinguish PKA from other kinase families like PKC and Src. Following this rationale, we use two cardinal properties of family-specific functional sites, viz., (i) differential conservation and (ii) discriminatory ability, to identify them.
In the past, several studies have attempted to delineate functionally characterised residues in a family of homologous proteins [24][25][26]. Some methods like evolutionary trace analysis [27] and energetics-based predictions [28] rely on protein structure to identify protein-ligand and protein-protein interfaces. Other methods perform hierarchical analysis [29,30], statistical analysis [31][32][33][34], ortholog and paralog investigations [35,36], calculation of rate of evolution [37] and log-likelihood analyses [38] of protein sequences to identify specificity determinants [39]. These studies may measure absolute conservation of amino acids [27,40], conservation of physicochemical properties [29], correlated mutations [41,42], Shannon entropy and mutual information [35,36,[43][44][45][46][47][48][49][50], and probability [51,52], among others [30,46,[53][54][55][56][57][58]. However, these methods followed an all-or-none approach, in which a residue is labelled either functional or unimportant. The emerging picture of modularity, within and outside of a protein domain, is increasingly pointing towards a continuum of functional importance of residues and regulatory features [18]. Further, these studies carry the limitations of the individual quantification methods used. Most protein kinase studies have considered the entire superfamily of protein kinases as one cluster [59], while others looked into specificity determining residues at the group level [60]. In the current study, we propose an integrated scheme which uses the advantages of several methods (conservation of physicochemical property, Shannon entropy and random probability distribution) and scores the sites on a continuous scale of their functional / regulatory specificity at the family level.
We systematically compiled a dataset of 5488 kinase catalytic domain sequences belonging to 107 distinct kinase 'families' [61,62]. After aligning them into a single multiple sequence alignment, we comparatively analysed the amino acid distributions in topologically equivalent positions of different families. Based on 3 different analytical measures, we identified familyspecific functional sites that are differentially conserved in each of the 107 families. By maximising the discriminability between the kinase families, we integrated the results of the three measures and devised a unified scoring scheme called ID_score. We assessed the competence of this method by testing its ability to (i) cluster kinase sequences into groups and families, (ii) aid a linear classifier in predicting the family of the kinase, and (iii) identify experimentally determined kinase-specific functional sites like protein-protein interaction sites in CDK, substrate recognition sites in MAPK and specific cancer-causing driver mutation sites. Finally, we recognise the sites of functional specialisation in all known kinase families and demonstrate one of the applications of this method in the prediction of specific protein-protein interaction sites. In summary, we developed an integrated discriminatory method to identify regions of functional specialisation, validated the results for known cases and applied the method to all known kinase families to present an exhaustive list of sites of functional specialization in all the kinases involved in this study.

Results
The results of the study are organised into four major sections: (i) dataset curation, explaining the method of selection and organisation of STY kinase catalytic domain sequences, (ii) method development, detailing the rationale and protocol to identify differentially conserved sites in kinases and maximise the discriminability among them, (iii) method assessment and validation, elucidating its performance by application to known sites of functional specialisation in a few well studied kinases, and (iv) application, demonstrating a feasible avenue for practical use of the method and applying it to all known kinases.

Dataset curation
The primary rationale behind our method is to recognize sites in the kinase catalytic domain that are conserved uniquely within a family of kinases [27]. This passively assumes the existence of a reliable system of classification of all known STY kinases into families. Such an empirical and curated system of kinase classification developed after a series of comprehensive studies [3,[63][64][65][66], KinBase (KB), is illustrated in Fig 1A. This system of hierarchical classification clusters Depicted is a schematic of the work flow followed to achieve a master alignment of all known kinase the STY kinases broadly into 'groups' and more finely into 'families'. For demonstration purpose, 7 randomly selected kinase families (cask, camk-tt, musk, utk, sgk495, kin6 and czak) are depicted as coloured circles (yellow, orange, light blue, light green, dark green, dark blue and brown respectively) among other families (gray circles) (Fig 1A). Kinases belonging to these families, as identified by KB, are enlisted inside the corresponding circles.
We note that the KB system of classification, and thus the KB_sequence-to-family mapping (Fig 1A), is available for kinases from as many as 15 species. However, in the exigency of the study, it is vital to construct a dataset with kinase sequences as diverse as possible within each family so as to distinguish sites that are truly conserved through the course of evolution from those that accommodate variation without affecting the stability and function of the protein. Thus, we aim to augment the KB_sequence-to-family mapping ( Fig 1A) with additional sequences from other organisms/phyla. To this end, we first identified the genes of every sequence in the existing mapping by individual BLASTs [67] against UniProt [68], and curated a gene-to-family mapping ( Fig 1B). Gene names of kinases belonging to different families are enlisted inside the corresponding family circles in Fig 1B. We note that gene names could not be identified for a few uncharacterised sequences (e.g., sequences from family camk-tt) in the KB mapping. We then enriched the KB_sequence-to-family mapping with sequences of corresponding genes from other phyla / organisms ( Fig 1C). While doing so, care was taken to include only non-fragment sequences of eukaryotic lineage with kinase domain annotation (Pfam IDs: PF00069 and/or PF007714) [69]. In the cases of ambiguous association of the same gene to multiple families in KinBase, which is rare, kinase sequences of the corresponding gene from all organisms were eliminated from the dataset. This resulted in UniProt_ID-tofamily mapping of 34,881 kinase sequences into 164 families (Fig 1C), consisting of sequences originally present in KB as well as their orthologues in other species (full dataset available as S1 File). In Fig 1C, the UniProt IDs of kinase sequences belonging to different families are enlisted in the corresponding family circles.
The dataset was further subjected to filters and constraints (see Dataset sub-section in Methods section) to eliminate ambiguity and extract the kinase catalytic domain region from full length sequence, as described below. After eliminating sequences with ambiguous associations with multiple families, a template sequence was chosen for every family. Choice of the template sequence was based on (i) availability of information on the boundary (region) of kinase catalytic domain in the full length sequence [70], and (ii) structurally well-studied nature of the sequence as reflected by the highest number of available crystal structures for the kinase when compared to other kinases within the family. In case of absence of crystal structures for a kinase family, a sequence with known boundary of kinase catalytic domain was randomly chosen as the template. Next, all sequences in a family were aligned using MAFFT [71] and the kinase catalytic domains were extracted for all sequences based on the boundary of the catalytic domain of the template sequence. Kinase catalytic domain sequences thus extracted were clustered at 90% sequence identity [72] to remove any bias or redundancy in the dataset. In Fig 1D, the unbiased dataset of kinase domain sequence-to-family mapping is illustrated, wherein the UniProt IDs and the boundaries of kinase domain are enlisted in the corresponding family circles.
Finally, all the kinase domain sequences across families were aligned into a single multiple sequence alignment (See Alignment section in Methods) of 5488 sequences from 107 families of 7 distinct groups, which serves as an input to our method. During the alignment process, a input to the ID_score method. The final dataset of 5488 kinases of 107 families in the multiple sequence alignment are represented by their UniProt IDs and kinase domain extents in (E). few sequences that could not be aligned confidently were discarded and the kinase catalytic domain boundary was further pruned in order to trim the flanking gap regions in the termini. In Fig 1E, the UniProt IDs and the boundaries of kinase domains, as present in the final multiple sequence alignment, are enlisted in the corresponding family circles (full alignment available as S2 File).

Development of a method to predict sites of functional specialization
By parsing the alignment generated in the previous step, we set out to pinpoint the uniquely conserved sites that maximise the discriminability of the family from the rest. This rationale calls for a systematic position-wise comparison of the residues populating the family of interest with those in the other families in a quantitative manner. In the past, several attributes were shown to be useful measures to quantify the residue distributions [27,29,35,40,43,44,49,52,53]. We used 3 such attributes, conservation of physicochemical property (pc), Shannon's entropy (ent) and statistical probability (prob), to measure the similarities and differences in residue distributions between families. Later, we integrated the results of the 3 measures and devised a single scheme (ID_score) that scores the kinase residues in a manner that is reflective of the uniqueness of the site to the family.
Pipeline of the method. The master alignment of 5488 sequences belonging to 107 families, indexed from f1 to f107, is schematised in Fig 2A depicting parts of sequences of families f1 (PKA, blue), f2 (PKG, red), . . ., f107 (CDK, green). It should be noted that the alignment depicted in Fig 2A is only illustrative, and does not reflect the entire master alignment. Also, for purpose of clarity, we outline the pipeline of the method that identifies family-specific sites for a family of interest (FOI) f1. Although the description in the section and illustration in Figs 2 and 3 are limited to f1, the entire procedure was repeated considering every family (f1, f2, . . ., f107) as the FOI in order to identify family-specific sites in each of them.
An overt means to identify family-specific sites would be to compare the residues at every alignment position p in the FOI f1 with those at the same position in all the other families. However, f1 may be evolutionarily and functionally more closely related to a few families (say, f2 and f3) than to others (say, f4, f5, . . ., f107). Thus, some attributes of p would be commonly shared with that of the closely related families alone. As a result, upon comparison of p in the FOI f1 with that of all the other families put together, f2-f107, quantification of the uniquely conserved nature of p is not straightforward. We resolved this by performing pair wise comparisons of the FOI with every family in the dataset (nFOIs), viz., comparisons f1-f1, f1-f2, . . ., f1-f107, as illustrated in Fig 2B. The justification is that a truly family-specific position in f1 will be differentially conserved in f1 in most / all of the pair wise comparisons. On the other hand, a position in f1, say, from the ATP binding loop, whose function is globally conserved across all kinase families, will be differentially conserved only in a few / none of the pair wise comparisons. Thus, the fraction of pair wise comparisons in which p is differentially conserved give a direct weight, on a scale of 0 to 1, indicating the magnitude of uniqueness of the position to the FOI. Thus, every alignment position of the FOI was pair wise-compared with the corresponding position of each of the 107 families in the dataset (nFOIs) as shown in Fig 2B. Comparison of a position p in FOI f1 with that of, say, an nFOI f2 (Fig 2B, boxed position) was made by analysing the residues that populate the family of interest (FOI), the family being compared with (nFOI) and both the families considered together (FOI+nFOI), as illustrated in Fig 2C. It should be noted that, in order to correct for any inaccuracy in the alignment, the residues in position p in a family at a frequency less than 8.5% were disregarded before subsequent analyses. The conservation of physicochemical properties in the 3 categories, FOI, nFOI and FOI+nFOI, is quantified (See Physicochemical measure section) and is denoted as pc_FOI, pc_nFOI and pc_FOI+nFOI respectively. These were then consolidated and a binary decision, as to whether (1) or not (0) the position p is differentially conserved in the FOI f1 when compared to the nFOI f2, with respect to conservation of physicochemical properties of the residues, was taken. Since the FOI f1 was pair wise compared with each of the 107 families in the dataset, we achieved 107 binary decisions (1 or 0) for an alignment position p, one for each pair wise comparison. This procedure was repeated considering every position in the master alignment as p, giving rise to pc_decision (Fig 2D). Likewise, ent_decision and prob_decision were also calculated by measuring Shannon entropy and random probability attributes of the residue distributions respectively (See Entropy measure and Probability measure sections). The 3 methods (pc, ent and prob) weigh the FOI against nFOIs using different sets of rules, and thus might result in different binary decisions for the same position.
We note that pc_decision, like ent_decision and prob_decision, of FOI f1 is simply a matrix of 1s and 0s of size 107xN, where N is the total number of alignment positions in the master alignment. This matrix contains a binary value for each pair wise comparison at every alignment position, indicating whether (1) or not (0) the particular position is differentially conserved when compared with a particular family. The binary values across the 107 pair wise comparisons were then averaged for every alignment position in f1 to get a pc_score (Fig 2E). In other words, pc_score of FOI f1 is a vector of length N, with values ranging between 0 and 1, with 0 indicating differential conservation in the position in none of the pair wise comparisons and 1 indicating differential conservation in all the pair wise comparisons. Likewise, ent_score and prob_score for FOI f1 were calculated, based on ent_decision and prob_decision respectively. Thus, we scored every alignment position in f1 on a scale of 0 to 1 using 3 different measures ( Fig 2E). The 3 individual scores (pc, ent and prob) at every alignment position was then linearly combined (See Integrated discriminatory score section) to achieve an integrated score, or ID_score. ID_score for family f1 is a vector of length N, containing scores ranging between 0 and 1 for each alignment position, with 0 indicating no family-specificity and 1 indicating high family-specificity (data available as S3 File). The pipeline described above was repeated, considering each of the 107 families as FOI to identify family-specific sites in them. The properties that are uniformly present or absent (green) in every amino acid in FOI, nFOI and FOI_nFOI are counted and called pc_FOI, pc_nFOI and pc_FOI+nFOI respectively (B). In case of higher conservation of physicochemical properties in FOI or differential nature of physicochemical properties conserved in FOI, a decision of uniqueness (pc_decision) is declared to be 1 (C). Shannon's entropy for the distribution of amino acids in FOI, nFOI and FOI+nFOI are determined and called ent_FOI, ent_nFOI and ent_FOI+nFOI respectively (D). (E) In case of lower randomness in FOI or differential nature of low randomness in FOI, the position is declared unique (ent_decision = 1). The probability of drawing the distribution of amino acids in FOI upon repeated draws from the amino acid sample in nFOI, without replacement, is calculated and called prob_FOI (F). (G) In case the calculated probability is sufficiently low, the position is considered unique (prob_decision = 1). t p , t e and t s are the thresholds operative for the three measures pc, ent and prob respectively (See text for details).

Physicochemical measure (pc_decision).
This section explains how the conservation of physicochemical properties at a given alignment position p was quantified. Ten physicochemical properties were considered [29,73,74]: hydrophobic (I,L,V,C,A,G,M,F,Y,W,H,K,T), polar (C,Y,W,H,K,R,E,Q,D,N,S,T), small (V,C,A,G,D,N,S,T,P), Proline (P), tiny (C,A,G,S), aliphatic (I,L,V), aromatic (F,Y,W,H), positive (H,K,R), negative (E,D) and charged (H,K,R,E,D). Gaps ('-') in the alignment and unknown amino acids (X) were regarded as having none of the 10 properties. The residues in FOI ( Fig 3A) were scrutinised for the presence (Fig 3B, 1 st panel, filled black box) or absence (Fig 3B, 1 st panel, unfilled white box) of each of the 10 physicochemical properties. The number of physicochemical properties that are absolutely conserved in all the residues in FOI, nFOI and FOI+nFOI was counted and referred as pc_FOI, pc_nFOI and pc_FOI+nFOI respectively (Fig 3B, green box). It is to be noted that if a specific property is either uniformly present or uniformly absent in all the residues in a position, it is counted as a conserved property in the spirit that the presence or absence of that property is evolutionarily selected. The three category measures pc_FOI, pc_nFOI and pc_FOI+nFOI, representing the number of absolutely conserved physicochemical properties in a given position p in FOI, nFOI and FOI+nFOI respectively, could each take integral values between 0 and 10.
Based on the physicochemical measure, there are two scenarios in which position p could be declared differentially conserved in FOI when compared to the nFOI ( Fig 3C): (i) when there is conservation of physicochemical properties in FOI but not in nFOI, i.e., if pc_FOI is greater than or equal to a threshold value t p (pc_FOI ! t p ) and pc_nFOI < t p , or (ii) when both FOI and nFOI have conserved physicochemical properties, but the properties conserved in them are different, i.e., pc_FOI ! t p , pc_nFOI ! t p and pc_FOI + nFOI < t p . Thus, where pc_decision(nFOI,p) is the binary decision of whether the position p in FOI is differentially conserved in comparison with the corresponding position in an nFOI; and t p is the threshold number of conserved physicochemical properties (See Determination of threshold values section). Entropy measure (ent_decision). This measure is used to quantify the difference in a given position p between the FOI and nFOI in terms of randomness of the residues populating with the position [43,75,76]. Shannon entropy was calculated for the FOI, nFOI and FOI +nFOI categories by measuring the frequency of occurrence of residues in FOI, nFOI and FOI +nFOI respectively (Fig 3D). These are accordingly called ent_FOI, ent_nFOI, and ent_FOI +nFOI. In this measure, the frequencies of each of the 22 possible values (20 amino acids, unknown (X) and gap ('-')) were considered individually, irrespective of any physicochemical relatedness between them.
PðaÞ nFOI log e PðaÞ nFOI ent FOI þ nFOI ¼ À Based on entropy measure, position p was declared differentially conserved in FOI when compared to the nFOI ( Fig 3E): (i) if there was low randomness in FOI, but not in nFOI, i.e., ent_FOI is smaller than or equal to a threshold value t e (ent_FOI t e )and ent_nFOI > t e , or (ii) if both FOI and nFOI had low randomness, but the residues conserved in them were different, i.e., ent_FOI t e , ent_nFOI t e and ent_FOI + nFOI > t e . ent decisionðnFOI; pÞ ¼ where ent_decision(nFOI, p) is the binary decision of whether the position p in FOI is differentially conserved in comparison with the corresponding position in an nFOI; and t e is the threshold entropy (See Determination of threshold values section). Probability measure (prob_decision). This measure calculates the probability of obtaining the exact set of residues found in position p in FOI when one repeatedly draws, without replacement, from the set of residues in the same position of nFOI ( Fig 3F). This score essentially captures the chance occurrence of residues in p of FOI from those at nFOI, and ranges between 0 and 1. Using classical statistics, we can calculate this probability, prob_FOI, as follows: where {FOI} 6 ¼ is the set of non-redundant residues in position p in FOI, n k À Á represents the number of combinations of k that can be chosen from n; and |FOI| and |nFOI| are the total number of residues in position p of FOI and nFOI respectively. As a separate exercise, probability of drawing residues in FOI from the distribution in nFOI with replacement was also considered (S1 Fig) and was found to not improve the performance of the method.
We argue that if prob_FOI is sufficiently low, i.e., if prob_FOI is smaller than or equal to a threshold value t s , then the distribution of residues in FOI is different from that of nFOI and thus the position is differentially conserved (Fig 3G). Thus, where prob_decision(nFOI, p) is the binary decision of whether the position p in FOI is differentially conserved in comparison with the corresponding position in an nFOI; and t s is the threshold probability (See Determination of threshold values section). Determination of threshold values t p , t e , t s . In the method described above, a binary decision on the uniquely conserved nature of position p is made by comparing the FOI and nFOI category measures with a corresponding threshold value (t p , t e or t s ). By altering the threshold value, we tinker with the degree of uniquely conserved nature that passes off as a family-specific site. In order to deliberate the threshold values that result in meaningful delineation of family-specific sites, we used the discriminatory property of the family-specific sites.
To this end, we tested several possible values for the threshold, and optimally chose the value that maximally distinguished the families from one another. For instance, the physicochemical threshold t p was systematically tested with values from 0 to 10, at intervals of 1, and the corresponding pc_scores were calculated for every FOI, as shown in Fig 2. For a given threshold value, based on the pc_score of the FOI, conformity scores were assigned to the sequences in the FOI (family_scores) and the rest of the families (nonfamily_scores). Conformity scores reflect how well a given sequence conforms to the FOI using proportional weights at the uniquely conserved sites as measured by pc_score (See Calculation of Receiver Operating Characteristic in Methods). This process was repeated, considering every family as the FOI, to define an accumulated set of family_scores and nonfamily_scores for a given value of t p . Obviously, the family_scores are expected to be reliably higher than the nonfamily_scores if the differentially conserved sites are identified accurately. Thus, at the optimal value of t p , pc_score would best discriminate between the families, and thus between the family_scores and nonfa-mily_scores. To quantify this, we calculated the sensitivity, specificity and Receiver Operating Characteristic (ROC) to distinguish the family_scores from nonfamily_scores. For every t p value, we determined the area under the ROC curve, which is a measure of the ability of pc_score to distinguish the family_scores from the nonfamily_scores, and thus discriminate between the families (S2 Fig). The t p value that yielded the pc_score of the best discriminatory capability with maximum area under ROC curve was chosen as the optimal threshold value t p . Fig 4A shows a plot of the calculated area under the ROC curve for all possible t p values. We find that a t p value of 4 resulted in the highest area under the ROC curve of 0.9904. Thus, if (i) 4 or more physicochemical properties are conserved in FOI and less than 4 physicochemical properties are conserved in nFOI, or (ii) 4 or more physicochemical properties are conserved in FOI and nFOI, but less than 4 physicochemical properties are conserved in FOI+nFOI, the position is decided to be uniquely conserved in FOI with respect to nFOI.
The optimisation procedure described above was followed for the determination of t e and t s values as well. In the case of t e , values from 0 to 2.5, at intervals of 0.125, were tested. For each t e value, ent_score was calculated and the corresponding area under the ROC curve was deter- Integrated discriminatory score (ID_score). The final step in the method is to combine the three measures, pc_score, ent_score and prob_score, and achieve an integrated ID_score. To this end, at every alignment position in a family, we linearly combined the 3 corresponding scores, weighing them appropriately. For an FOI, where ID_score p is the ID_score at position p of the FOI; pc_score p , ent_score p and prob_score p are the pc_score, ent_score and prob_score of the FOI at position p respectively; and pc_wt, ent_wt and prob_wt are the corresponding weights, which were optimised to provide the highest discrimination between families. Each weight, pc_wt, ent_wt and prob_wt, was independently varied from 0 to 1; and the corresponding ID_score and area under the ROC curve were calculated as described previously. This is to achieve an optimal integration of the 3 measures such that it maximised the capability of the ID_score to discriminate between kinase families. In Fig 4D, we show a 3-dimensional plot with axes for weight of pc_score (pc_wt), weight of ent_score (ent_wt) and weight of prob_score (prob_wt). The area under the ROC curve observed for each combination of weights is plotted in a coloured scheme, with blue indicating the least and red indicating the largest area under the ROC curve. We found that if the three measures, pc_score, ent_score and prob_score are linearly combined with corresponding weights of 0.615, 0.385 and 0.0, the ID_score had the highest discriminatory ability (area under ROC curve = 0.992). It is to be noted that the inclusion of prob_score in the method decreased the discriminability of FOI from the rest of the families, and thus the prob_score was optimally weighed at 0.0. Thus, we rejected the prob_score, and combined the pc_score and ent_score in a ratio of 0.615:0.385 to achieve the final ID_score.
In summary, through a process of comprehensive measures and rigorous optimisation, we developed a method to score the sites in 107 families of protein kinases that reflects the specific nature of the site to the family (available as S3 File). Fig 4E shows the complete set of ID_scores calculated for all the 107 families at every alignment position. The colour at each position signifies the ID_score in a blue-to-red scheme, with blue indicating the least ID_score and red indicating the highest. As discussed above, ID_score ranges from 0 to 1, where 0 indicates no specificity and 1 indicates high specificity of the site to the family.

Assessment and validation of ID_score
The ID_score depicted in Fig 4E quantifies the sites based on their differentially conserved nature and the ability to differentiate the family from the other families. Thus, regions of absolute conservation in STY kinases conferring global functions like ATP binding, phosphotransfer catalysis, and overall structural stability of the kinase fold are scored poorly. On the other hand, sites involved in family-specific functions and regulations are likely to be scored favourably. We assessed this by testing the ability of the proposed method to discriminate between the kinase families and identify known family-specific functional and regulation sites.
ID_score identified sites cluster sequences into groups better. In Fig 4E, we observe that there are a select few alignment positions which are consistently scored high in most families. In a manner, these positions are possible "specific functional hotspots" that are highly conserved within families, but the nature of conservation differs across families. Due to this property, these positions are hypothesised to have high information content to discriminate families from one another. We tested this hypothesis by first identifying the sites with an ID_score greater than or equal to 0. In the first category, using all the 1094 positions in the alignment as input, we constructed a phylogenetic tree [77] (See Phylogenetic tree in Methods) of all the 5488 sequences in the dataset. In Fig 5A, the tree is illustrated with branches colour-coded according to the KinBase [62] 'group' of the leaf kinase. It can be appreciated from the tree that although complete information available in the sequences is used as an input, kinases of group CAMK (pink) and CMGC (yellow) are split into 3 and 2 separate clusters respectively.
In the second category, using only the 194 alignment positions identified by ID_score, we constructed a similar phylogenetic tree of all the 5488 sequences ( Fig 5B). In theory, since only a fraction of information is used as input to the tree, we expect the tree to be more error-prone in comparison to Fig 5A. On the contrary, we observe good clustering of all the groups except TKL (Fig 5B, dark blue). This is possibly due to successful filtering of the noisy and indiscriminate sites, and use of only the most informative and discriminative positions. Upon closer examination, we find that the smaller cluster of TKL (Fig 5B, dark blue), separated from the primary TKL cluster, is predominantly composed of sequences of the STKR (Serine Threonine Kinase Receptor) family (S5C Fig). This is indeed interesting because although classified within the Tyrosine Kinase-Like (TKL) group, STKRs are the only receptor kinases that phosphorylate Ser / Thr residues on the substrates and thus have the properties of both the Tyr phosphorylating kinases (TK and TKL) and Ser / Thr phosphorylating kinases. The tree built using ID_score identified sites correctly captures this by placing the STKR family between CAMK (pink) and TK (red).
In the third category, we used 194 least gapped positions from the alignment to draw a phylogenetic tree of 5488 sequences ( Fig 5C). As can be noticed, the tree is highly similar to the one constructed using the ID_score predicted sites, with good clustering of all groups except TKLs (S5D Fig). Are ID_scores merely identifying the ungapped positions in the alignment, and therefore not useful? Or, do ID_scores identify sites which contain only as much information as contained in ungapped positions? Are the trees constructed in Fig 2B and 2C truly similar to each other? To answer these questions, we probed the trees at a finer level and analysed the cluster purity in the 3 categories at the level of families, as described below.
ID_score identified sites cluster sequences into families better. We cut the phylogenetic trees (Fig 5A-5C) at different branch lengths from the root. At every cut, we counted the number of clusters obtained and calculated their purity at the level of 'families' (See Cluster purity calculation in Methods). We note that upon cutting the tree at increasing branch lengths from the root, the number of resulting clusters as well as the cluster purity increases. If cut at the extreme terminal of the tree, there would be as many clusters as there are number of sequences, and the purity of the clusters would be 1. In Fig 5D, we plotted the family cluster purity of the trees constructed using all positions, ID_score predicted positions and ungapped positions (Fig 5A-5C) as a function of the number of clusters in blue, green and purple respectively. It is clear from the plot that when the ID_score identified sites are alone used for the construction of a phylogenetic tree, the family clusters are purer (green) than when all the sites (blue) or an equal number of ungapped positions (purple) are used. We conclude that although the ID_score identified sites seemingly performed only as well as ungapped sites in clustering the kinases into groups (Fig 5B and 5C), they outperform the ungapped positions in clustering the kinases more finely into families (Fig 5D). Thus, we interpret that ID_score does not trivially pick the ungapped positions but successfully identifies sites that efficiently discriminate the kinases into groups and families.
ID_scores help classifiers predict the families of sequences. As interpreted from the previous analyses, if ID_score could indeed identify sites that contain information to discriminate between the families, we hypothesised that a simple Linear Discriminant Analysis (LDA) classifier would be able to predict the families of the kinase sequences better if trained with ID_score identified sites. To this end, we trained a pseudolinear classifier (See Classifier analysis in Methods) with the family associations of a random 90% of the sequences in the dataset. The classifier was then tested on the remaining unseen 10% of the sequences. This hold-out training and testing was repeated 10 times. Upon using all the 1094 alignment positions for training and testing, the error rate for the prediction of the family is about 8.5% (Fig 5E, blue). However, when only the 194 sites identified by the ID_score were used for training and testing, the error rate dropped to 6.7% (Fig 5E, green). This improvement in accuracy is in spite of lesser information as input (194 alignment positions as opposed to 1094 positions leads to 82% decrease in information) which should theoretically decrease the performance of the classifier. This can be seen in the orange boxplot of Fig 5E, which shows the error rate (~20%) of the classifier trained and tested on 194 random positions in the alignment. Also, an equal number of least gapped positions in the alignment, when used to train and test the classifier, made it perform poorly (Fig 5E, purple). Taken together, our analyses strongly suggest that ID_score method has successfully scored the sites in a manner that not only maximises the discriminability across families, thus increasing the efficiency of prediction of family, but also clusters the sequences into groups and families in an accurate and meaningful way.
Identification of family-specific protein-protein interaction sites: A case study with CDK. We have so far demonstrated the ability of the proposed method to identify sites that efficiently differentiates the families of kinases. However, a more meaningful and biologically relevant validation of the method is to check if the method identifies the known family-specific functional sites. For case study, the widely studied CDK family of kinases was chosen. Upon analysing iPfam [78], the database of all domain-domain interactions in protein crystal structures, we identified 4 CDK-specific interacting partners: Cyclin_N (Pfam ID: PF00134), Ank_2 (PF12796), CDKN3 (PF05706) and CKS (PF01111). It is known through several experimental studies and structure determination that these interactions are specific to kinases of CDK family and are unknown to occur in other families of STY kinases. We have illustrated the structural complexes of these interactions in Fig 6A-6D, where CDK is represented in cartoon and the interacting partner is shown in green sticks. The residues in CDK are coloured red or gray depending respectively on whether or not they interact with the partner in the complex. The distance criterion for interaction is considered as 6 Å. In Fig 6E, cartoon representation of CDK, whose residues are coloured in a gray-to-red colour scheme based on their ID_score, is depicted. We observe that the sites known to be involved in family-specific protein-protein interactions (Fig 6A-6D, red) are indeed scored high by the proposed method (Fig 6E, orange and red). As seen in Fig 6F, the residues involved in family-specific interactions have ID_scores significantly higher than all residues (p-value < 0.001), non-interaction residues, random residues and non-specific functional sites (ATP binding loop, catalytic loop, salt bridge residues, DFG and APE motifs). It should be noted that not all the interacting residues, as identified by the distance criterion from the crystal structures, may contribute equally towards the interaction or binding energy. Vice versa, not all residues with high ID_scores ought to be involved in a specific protein-protein interaction. Some of them may play roles in other family-specific functions / regulations. However, we quantified the same and conclude that, as a general trend, the proposed method identifies sites involved in family-specific protein-protein interactions.
Identification of family-specific substrate recognition sites: A case study with MAPK. MAPK is a family of CMGC group of STY kinases, whose substrate recognition sites and motifs are well studied. It is known through several biochemical and structural studies that MAPK stabilises the kinase-substrate complex through interactions at distal docking sites in the kinase domain [79][80][81][82][83]. These distal sites are away from the kinase active site, and recognise specific sites on the cognate substrates. Two such docking sites, D-site and F-site, predominantly dictate the substrate specificity of MAPKs (Fig 6G), apart from the P+1 site. Consolidated from several crystal structures of MAPK complexes with substrate peptides and mimics, we mapped the D-site and F-site residues that interact with substrates (Fig 6G, red). Fig 6H shows the MAPK fold in which the residues are colour coded in a gray-to-red scheme based on the ID_score of the sites. It can be seen that the sites experimentally known to recognise substrates in MAPK (Fig 6G, red) are scored high by the current method (Fig 6H, orangered) as well. This excellent agreement between the known substrate recognition sites and ID_score identified family-specific sites is quantified in Fig 6I, where the substrate recognition sites have significantly higher ID_scores than the other sites in comparison (p-value < 0.001).
The ability of our method to identify the family-specific functional sites in CDK and MAPK, defined from X-ray crystal structures (Fig 6A-6D and 6G, red), was then compared with that of a previously published method referred to as real-value evolutionary trace (rvET) [44]. rvET is a hybrid method that combines entropy and phylogeny information, further drawing from experimental structures. The phylogeny aspect in rvET can be adjusted to cluster the nodes at different distances from the root, thereby making it possible to isolate kinase families from one another at suitable thresholds. The sensitivity and specificity of identifying the family-specific functional sites were calculated at a series of score thresholds and plotted in Fig  6J. From the plot, it can be appreciated that ID_score identifies the family-specific functional sites with better accuracy than rvET.
Identification of family-specific oncogenic driver mutation sites. Being implicated in numerous cancers, STY kinases have an extensive literature on the cancer causing mutation sites [84,85]. We hypothesised that if a mutation disrupts the activity of the kinase by activating or inactivating it, by direct or indirect means, the residue ought to have been crucial for the functionality of the kinase. In other words, the cancer causing mutation sites, if family-specific, The residues in kinases are coloured red or gray depending on whether or not they form contact interface with the binding partner respectively. (E) CDK is represented in cartoon, and the individual residues are colour coded in a gray-red scheme depending on the ID_score of the sites, with red representing high ID_score and gray representing low ID_score. (F) The distribution of ID_scores of the known specific interaction sites is compared with that of all the residues, non-interaction sites, random sites and nonspecific functional sites in CDK. It is seen that the specific interaction sites are scored significantly higher by the ID_score than the other comparisons. (G) Residues experimentally known to be involved in cognate substrate (green sticks) recognition in MAPK (cartoon) is shown in red. (H) MAPK is illustrated in cartoon representation, with individual residues coloured according to their ID_scores. (I) The ID_scores of specific substrate interaction sites are significantly higher than that of all residues, non-interaction sites, random sites and nonspecific functional sites in MAPK. (J) The performance of ID_score method (orange) in detecting the family-specific functional sites in CDK and MAPK, in terms of sensitivity and specificity, is plotted along with that of real-value evolutionary trace (rvET, gray) method. Ã indicates a pvalue < 0.001. are most likely family-specific functional sites and thus will be identified by the present method. From KinDriver [86], a database of all known cancer causing driver mutations in kinases, we identified the missense mutation sites that were recorded only in a specific family of kinases with a relative frequency of >2. The identified family-specific cancer causing driver mutation sites are represented as spheres in Fig 7A-7H in families ACK, ALK, EGFR, FGFR, JAK, RAD53, PDGFR and RET respectively. The mutation sites, as represented by spheres, are coloured in a gray-to-red scheme, representing the ID_score of the site. We can appreciate that most of the sites are scored highly by the ID_score method and are thus represented in the orange-red range. Quantitatively, we find that the known family-specific driver mutation sites were scored high by the ID_score method (Fig 7I) when compared to other residues.

Application
So far, we have discussed the development of the ID_score method and its successful identification of known family-specific functional sites in various families. In this section, we present the application of the method in falsifiable prediction of family-specific functional sites for all known protein kinase families and discuss the predictions made for PKG and PKC families.
Prediction of family-specific protein-protein interaction sites: A case study with PKG and PKC. We parsed BioGrid [87], a database of all experimentally known domain-domain interactions, and identified all interactions of kinases that were verified by at least two different techniques. From this highly confident set, we identified that the WD Repeat domain 77 (WDR77) specifically interacts with the kinases of the PKG family. Although several studies have shown the interaction [88], the site of interaction in the kinase domain is unknown. Cartoon and surface representations of a kinase fold is depicted in Fig 8A and 8B respectively, in which the residues are coloured in a gray-to-red scheme based on the ID_score of the sites. We observe a cluster of solvent accessible, high ID_score sites in close proximity to each other at the conjunction of αE, αF and αH helices (Fig 8A and 8B, green circle). We predict that this region is the potential WDR77 interaction region in PKG. Likewise, we also identified that C1QBP (Complement 1, Q Subcomponent Binding Protein) specifically interacts with kinases of PKC family [89]. As can be seen in the ID_score-colour-coded cartoon and surface representations of PKC, a cluster of family-specific functional sites are seen around the αG helix (Fig 8C and 8D, green circle). We predict that αG helix is involved in the binding of PKC kinases with C1QBP. For the purpose of visualisation, we illustrate the PKG-WDR77 (Fig 8E) and PKC-C1QBP (Fig 8F) complex models docked at the interface predicted by ID_score.
Prediction of sites of functional specialisation in all known kinases. Encouraged by the ability of the method to identify accurately the functionally specialised residues and cancer driver mutation sites in some of the well-studied kinase families, we present the predicted sites of functional specialisation in all the 107 kinase families used in the study (predicted sites are documented in the S3 File). Each site in every family is given a score that ranges between 0 and 1, where 0 indicates no functional specificity and 1 indicates high functional specificity. We emphasise that our method identifies sites of functional specialisation in a family and not global functional regions like catalytic loop and ATP binding site.
To our knowledge, this is the first and only available resource that provides the functional sites in a large scale to the kinase community. We hope that this serves as a useful starting point for biochemical and structural studies as well as insilico drug targeting studies. The users are encouraged to set arbitrary cut-off values of the ID_score to identify the sites to pursue. For instance, for an uncharacterised kinase family under study, the users may set a cut-off of, say, 0.4 and restrict their preliminary analyses to sites with an ID_score of ! 0.4 in the family. This lets the user work with a tangible set of residues, which may further be guided by other known features of the kinase. The current study also opens up a consequent scope to find the functional implications of the identified sites. As already noted, the sites predicted by our ID_score identifies family-specific oncogenic driver mutation sites. Driver mutation sites that are specific to ACK, ALK, EGFR, FGFR, JAK, RAD53, PDGFR and RET (A-H) are represented as spheres in the corresponding kinase folds, and coloured according to their ID_scores. (I) It is seen that the specific mutation sites are scored significantly high by the ID_score method, when compared to all residues, non-mutation residues and random residues in the families. Ã indicates a p-value < 0.001.

Discussion
An emerging picture of signalling networks strongly suggests a complex system of organisation with regulatory orchestration at multiple levels [90]. Traditionally, this complex scheme of  A-B, green  circle). This region that is formed at the junction of αE, αF and αH-helices is a putative candidate region involved in specific protein-protein interaction, potentially with WDR77. PKC kinase is shown in cartoon (C) and surface (D) representations, with the residues coloured according to their ID_scores. A cluster of high ID_scored residues that are solvent accessible and spatially proximal to each other is marked (C-D, green circle). This region that is formed at the junction of αG and αH-helices is a putative candidate region involved in specific protein-protein interaction, potentially with C1QBP. Models of (E) PKG-WDR77 and (F) PKC-C1QBP complexes docked at the ID_score predicted interfaces are presented.
https://doi.org/10.1371/journal.pcbi.1005975.g008 control has been studied in a reductionist approach, attributing modularity to the interacting proteins, associated domains, scaffold proteins, and hubs in the network [91][92][93][94]. If we extend the theory of modularity and reductionism to within a domain, we expect division of the multitude of specific functions of the domain to certain subparts, conventionally known as functional motifs. Supporting evidence of this theory includes alteration and loss of specific function upon mutation of certain residues. Additionally, regardless of whether function of a protein is modular or emergent, it is well known that subtle functional differences between proteins reside in a few key residues.
On an average, an STY kinase catalytic domain is 250 residues long. Within this region exists information for (i) global attributes like stable structure formation and catalysis of phosphotransfer, and (ii) specific attributes like recognition / phosphorylation of its cognate substrate, interactions with specific binding partners and response to specific signal. Identification of functional sites / motifs conferring global attributes is feasible through experimental mutagenesis studies and in silico detection of conserved sequence patterns across all kinases. For instance, global functions like ATP binding and catalysis are attributed to the GXGXXG and HRDLXXXN motifs respectively. On the other hand, specific functional attributes are not only challenging to detect using experimental methods, but also difficult to identify in silico. In the present study, we have identified the family-specific functional sites in all known families of eukaryotic protein kinases.
Previous studies that attempted to identify differentially conserved family-specific functional sites relied heavily on the measurement of a single property and lacked an integrated approach. In the present study, we have for the first time used physicochemical, entropy and probability measures to identify family-specific functional sites in the most meaningful manner. The pc_score, which scores a position based on the conservation of physicochemical properties, tolerates an Arg to Lys substitution more than an Arg to Leu substitution. The ent_score, on the other hand, tolerates certain other substitutions better, and detects conservation of a position that could be missed out by pc_score. For instance, if a position in an FOI is populated with 2 physicochemically dissimilar amino acids, and the equivalent position in nFOI is populated with 6 different hydrophobic amino acids, pc_score would disregard the position, but ent_score classifies it as differentially conserved in FOI. During the course of the study, many metrics including mutual information, that measures correlated mutations between residue positions, were considered to score the uniqueness of sites. We found that the area under the ROC for the mutual information score was not better than that obtained through the individual Shannon entropy (ent_score). This is because, unlike methods that stringently mandate absolute conservation of an amino acid or physicochemical property, ent_score allows for limited number of substitutions at a residue position, thereby not punishing the correlated mutation sites. Although the correlations between residue positions is unidentified by our method, the individual sites harbouring correlated mutations are still scored high by the ID_score. This is evident by the fact that ID_score performs better than the rvET method, which considers covariances.
The third measure, prob_score, calculates the probability that the exact set of amino acids at a position in FOI can be drawn from a pool of amino acids in nFOI. This is a highly stringent score that punishes dissimilarity heavily and gives a high score only if the amino acids in the FOI and nFOI are similar in composition. One could imagine why such a stringent measure failed to add to the discriminability across kinases in the integrated method. Skew and lack of variance in the distribution of prob_scores made the discriminability between FOI and the rest of the families poorer. As an extreme example, the absence in nFOI of one of the amino acid residue types found in FOI will directly lead to a poor score even if the other amino acids residue type distributions matched well.
We note that ID_score primarily uses the multiple sequence alignment and prior classification of sequences into families as input to calculate the specificity determinants. As a result, error in the alignment, especially in cases of large and diverse superfamilies with sequences less than 20% sequence identity, can affect the accuracy of the method. Similarly, incorrect classification and subgrouping will also propagate through the method and result in reduced accuracy.
A couple of previous studies have looked at group-specific functional signatures and specificity determinants in kinases [60,95]. For the purpose of the current study, we intended to understand the family-specific functional sites in kinases, which will be of immense value to suppress drug cross-reactivity. Using the same method to identify group-specific functional sites is an interesting proposition that warrants a separate study by itself. Group-specific features like C-terminal tail binding of AGC-group kinases and SH2 domain interaction of TKgroup kinases may be predicted if ID_score is used at the group-level.
After identifying the family-specific sites in all the 107 families of kinases by maximising the discriminability across families, we might ask if all the identified sites are indeed functional. Although, we have addressed the question by showing excellent agreement between the ID_scores and known family-specific functional sites, answering the question in its entirety is difficult. Nevertheless, it is advantageous to know the bare essential set of sites that renders PKA different from, say, Src. In fact, if needed, pair wise comparisons of families can be easily extracted from the ID_score method, and the specific set of residues that differ between the two families can be identified. This leads us to an interesting thought experiment: if we replaced the bare essential sites in PKA that differentiates a PKA from Src with those of Src, would we expect the PKA to inherit some of the properties of Src kinase? If the modularity of functionality within the kinase domain is valid and we have identified the family-specific functional sites, including the redundant ones, one should expect so. Extending this thought experiment, if one were to identify the functions of each of the family-specific sites identified by the method, would it be possible to build a synthetic protein kinase with customised functions?
Although the literature is ripe with studies of cancer causing mutations in kinases [85], mechanistic / functional reasoning behind why a mutation derails the functionality of a kinase is understood more in retrospect than in advance. This is because a complex network of interactions and abundant redundancy in the protein's functionality makes it an extremely challenging task. In this study, we show that those cancer causing mutation sites, which occur within specific families, are identified with high scores by the ID_method. Furthermore, we also show biologically relevant clustering of kinase families, when aided by ID_score method. Taken together, our analyses suggest that ID_score can successfully predict the family-specific functional sites. Using ID_scores, we predict the interaction sites in PKG and PKC for binding WDR77 and C1QBP respectively. Although developed and demonstrated for the STY kinase superfamily, the method is inherently transferable to other protein superfamilies, and is expected to aid identification of functional sites and characterisation of ambiguous / new families.
In summary, we curated an unbiased and non-redundant dataset of 5488 sequences of kinase catalytic domains from diverse phyla belonging to 107 families of 7 distinct groups. After careful alignment of all the curated sequences, we developed an integrated approach to detect differential conservation of residues in kinase families. Using measurements of physicochemical properties, Shannon entropy and probability, we scored the selectively conserved nature of the all the sites in the kinase families. Furthermore, by maximising the discrimination across families, we succeeded in optimising the threshold criterion for each method that passes a differentially conserved position as a family-specific functional site. Finally, we integrated the 3 scores to attain a unified ID_score that scores the sites in kinase families depending on their functional specificity, characteristic to the family. We have assessed and validated the ability of ID_score to (i) discriminate and cluster the kinase families in a meaningful way and (ii) identify family-specific functional sites. Further, we demonstrate the application of the method in prediction of protein-interaction sites. Taken together, we developed an integrated method and successfully identified the family-specific functional sites in all known eukaryotic kinases.

Dataset
After UniProt_ID-family mapping was established (Fig 1C), the kinase catalytic domains were to be excised from the full length sequences. To this end, families in which information on kinase domain boundary [70] is not known for any sequence were eliminated. Then, within every family, the sequence with the most number of experimentally solved structures was identified as the most well studied STY kinase of the family, or the template sequence of the family. In case of absence of solved crystal structures for a family, a sequence in which the kinase domain boundary is known was randomly chosen as the template sequence of the family. The sequences within every family were multiply aligned [71] and the kinase catalytic domains were extracted from the sequences by mapping the topologically equivalent residues corresponding to the kinase catalytic domain of the template sequence. Thus derived kinase catalytic domain sequences of a family were further filtered to contain residue length in the range of 150 to 350 residues and clustered at 90% sequence identity [72] to remove redundancy and bias in the dataset. After this step, if a family contained less than 5 kinase catalytic domain sequences, it was discarded; and if a family contained more than 200 sequences, it was clustered at progressively higher sequence identity thresholds until it contained less than 200 sequences. In total, clustering threshold of 90% sequence identity resulted in more than 200 sequences in only 6 of the 107 families. The families and their threshold identities are MAPK (80%), STE20 (80%), MLK (80%), CDK (70%), IRAK (60%) and CAMKL (50%). Since only less than 6% of the kinase families were clustered at lower that 90% sequence identity and such families still retained an average of 184 sequences per family, the contribution of error due to this is considered minimal. This procedure was repeated for every family and the kinase domain sequence-family mapping was established (See S1 Text in Supplementary Information and Fig 1D).

Alignment
An accurate multiple sequence alignment (MSA) of the catalytic domain sequences in the dataset is a prerequisite to probe the family-specific functional sites in the sequences. However, given the large divergence in the dataset, precise error-free alignment of all the sequences in a single MSA is challenging. The approach used in the study is to first align [71,96] the sequences within families and obtain a consensus sequence / profile for each family [97]. Subsequently, the consensus sequences of all families within a group were multiply aligned using sequence and structure information. This crucial step was feasible because reliable hierarchical classification of STY kinases into groups / family and crystal structures in active conformations belonging to different families within a group were available (See list of PDB IDs and hierarchy used in S6 Fig). Superposition of the crystal structures, with not more than one structure per family, if available, was used to guide the alignment of profiles of families within groups. This resulted in profile or consensus sequence for each group, which were then multiply aligned using sequence and structure information. Again, superposition of structures, with not more than one structure per group, if available, guided the alignment of across-group profiles, resulting in a profile / consensus sequence for the entire STY kinase dataset. The above described method for hierarchical alignment of profiles to arrive at a consensus sequence for STY kinases was implemented using a python script called Fammer [98]. Considering the consensus sequence of the STY kinase as the template, all the sequences in the dataset were aligned to it in a statistical method [99] to get the final MSA as described in previous studies [52]. During this procedure, a few families / sequences that did not align well within themselves or the entirety of the kinase database were manually eliminated. The final alignment consisted of 5488 STY kinase domain sequences of 107 families.

Calculation of Receiver Operating Characteristic (ROC)
For a threshold value, say, t p , the corresponding pc_score was calculated as described in the Results section. For every FOI, pc_score is a vector of length N, where N is the length of alignment positions, containing values in the range of 0 and 1 with 0 representing no family-specificity of the position and 1 representing maximum family-specificity. All sequences in the database are given a conformity score with respect to the pc_score of a family of interest (FOI), say f1. The conformity score of a sequence is the sum of all pc_score values at those positions in which the sequence conforms to the sequences of the family f1. A sequence is considered to conform to f1 sequences at position p if the amino acid in the sequence at p is one of those that populate the position in the sequences of f1. For instance, if at an alignment position p, where, say, pc_score of FOI f1 at p = 0.4, sequences in f1 are populated with A, L, I and M, and a sequence to be conformity scored contains I, then the conformity score of the sequence is increased by 0.4. On the other hand, if the sequence to be conformity scored contains V at p, then, conformity score is unchanged. conformity score ðs;FOIÞ ¼ X N p¼1 pc score FOI;p ½s p 2 fFOI p g 6 ¼ where s is the sequence for which score is to be evaluated according to its conformity to FOI; N is the total number of positions in the alignment; pc_score FOI,p is the pc_score of FOI at position p; and [s p 2 {FOI p } 6 ¼ ] is the conditional clause as to whether the amino acid at position p in s belongs to a set of non-redundant amino acids in position p of FOI. In this manner, each of the 5488 sequences in the dataset is each given a conformity score with respect to the pc_score of FOI f1. These conformity scores are divided into 2 categories: (i) family_scores, when s is a sequence of the FOI and (ii) nonfamily_scores, when s is a sequence of an nFOI. This process is repeated, considering each family (f1, f2, . . ., f107) as FOI, and the family_scores and nonfamily_scores are augmented.
Similarly, for optimisation of the entropy threshold t e , were carried out. The set of family_scores is expected to be reliably higher than the nonfamily_scores upon accurate determination of the threshold value. Thus, we calculated the sensitivity, specificity and the ROC for the two scores. In essence, the area under the ROC curve implies the discriminability between families based on the pc_score (ent_score or prob_score) derived using a specific threshold t p (t e or t s ).

Phylogenetic tree
A phylogenetic tree was constructed by considering all the 1094 positions of the master sequence alignment using FastTree [77]. This resulting tree was mid-point rooted and made ultrametric by extension of all the terminal branches to a constant distance from root. Finally, if the tree had multiple originating branches at any given node, it was bifurcated and thus converted to a binary tree [100]. This tree is shown as a circular cladogram in Fig 5A. The same protocol was used for the construction of trees in Fig 5B and 5C, with the exception that specific chosen positions (as identified by ID_score and those containing the least number of gaps respectively) of the alignment were used as input to for the construction of the tree.

Cluster purity calculation
The phylogenetic tree that is rooted, ultrametric and binary was cut at different distances from the root resulting in different number of clusters. For each cut, cluster purity of the resulting clusters was calculated as follows: where n is the total number of leaf sequences; k is the number of clusters generated; c i is the set of leaf sequences in the i th cluster; f j is the family classification that has the maximum number of leaves in cluster c i .

Classifier analysis
We trained and tested a simple pseudolinear classifier to understand its ability to ascertain family classification to a kinase sequence. In a hold-out approach, we used a random 90% of the sequences to train the classifier and the remaining 10% to test, repeating 10 times. The sequences were in aligned format and the corresponding family associations of the sequences were used to train the classifier. The k-fold loss in the performance of the classifier in the test set was quantified [101]. The classifiers were trained and tested using all the alignment positions ( Fig 5E, blue), positions identified by the ID_score (Fig 5E, green) or the positions with the least number of gaps (Fig 5E, purple). This analysis was performed using the Statistics and Machine Learning toolbox of MATLAB [102]. The threshold for the prob measure was optimised such that the corresponding prob_score had the highest ability to discriminate between kinase families. This was achieved by quantifying how well the family_scores were separable from the nonfamily_scores at every threshold value in terms of area under the Receiver Operating Characteristic (ROC) curve. t s , the threshold probability of obtaining the exact set of amino acids in position p in FOI when one repeatedly draws, with replacement, from the set of amino acids in the same position of nFOI was systemically tested for all possible values, and the corresponding area under the ROC curve is plotted. The maximum area under the curve (0.970) is achieved at a t s value of 0.02. This is lower than that of the t s calculated without replacement.