Table 1.
Composition of the validation data set.
Table 2.
Examples of domain swapping in protein kinases.
Figure 1.
Sequence alignment of selected protein pairs.
A: Sequences ENSP00000266970 and ENSP00000293215. B: Sequences ENSP00000281821 and ENSP00000350896. Identities are indicated by black background. Pfam domains are indicated by colored boxes: red = catalytic domains, magenta = domains detected in both proteins, blue = domain detected in only one protein. Abbreviations used: Ephrin_lbd = Ephrin receptor ligand binding domain, GCC = GCC2 and GCC3 domain, fn3 = fibronectin type III domain, Pkinase = protein kinase domain, SAM = sterile alpha motif domain (type SAM_1 is detected in ENSP00000350896 and type SAM_2 is detected in ENSP00000281821). Global sequence alignment is obtained using the Needleman-Wunsch algorithm. For each pair of sequences, the different distances are indicated at the bottom of the alignment. Image generated using ESPript software [52].
Figure 2.
Comparison between the different distances computed between protein sequences of the validation data set.
LMScat: LMS distances between catalytic domains, LMSfull: LMS distances between full-length sequences, IDcat: identity distances between catalytic domains, BLOSUMcat: BLOSUM distances between catalytic domains, BLOSUMfull: BLOSUM distances between full-length sequences. The lower panel reports the Spearman rank correlation coefficients between different distances.
Figure 3.
Assessment of different distances to detect homogeneous clusters in the validation data set.
A: each distance matrix is used as input to hierarchical clustering; clusters are extracted from the resulting trees and assessed by the biological homogeneity index (BHI); B: evolution of BHI according to the number of clusters. LMScat: LMS distances computed from catalytic domains, LMSfull: LMS distances computed from full-length sequences, BLOSUMcat: Blosum distances computed from catalytic domains, BLOSUMfull: Blosum distances computed from full-length sequences, IDcat: identity distances computed from catalytic domains. Horizontal and vertical lines indicate respectively BHI = 1 and number of clusters equal to 17.
Figure 4.
AUC distributions obtained on the human kinome.
A: AUC obtained using the iterative procedure starting from BLOSUM full-length distances, B: AUC obtained using the iterative procedure starting from LMS full-length distances. The vertical red line indicates the cut-off for the detection of hybrid kinases.
Figure 5.
Computation of Local Matching Score (LMS) between two sequences without alignment.
Figure 6.
Detection of outliers in a pre-classified data set.
1: the distance matrix and initial weights are used to compute AUC values for each sequence using equation 6; 2: sequences weights are updated using equation 7; 3: the procedure is iterated until convergence; 4: the final AUC values are used to compute a histogram; 5: the histogram shape is used to detect outliers.
Figure 7.
Examples of classification curves.
A: a putatively well classified sequence, B: a putatively misclassified sequence. AUC denotes the area under the curve.