Zero-shot segmentation using embeddings from a protein language model identifies functional regions in the human proteome
Fig 4
Segment Embedding Evaluation and Visualization Domain Types and Sub-Domains.
(A, C) Normalized confusion matrices for 1-nn assessment, precision is shown along the diagonal, and n is the number of ZPS segments. (A) Shows the top 20 most commonly occurring ProRule domains in the ZPS segments and (C) shows clusters a, b, c, and d of the protein kinase domain (as labelled in panel B). (B) A 2-dimensional UMAP of segment embeddings coloured by the domain the segment overlaps with. Clusters a, b, c, and d of the protein kinase domain segments are defined by their position in the UMAP. Each of the protein segments shown here were used in the 1-nn assessment. (D) The AlphaFoldDB structure of KAPCA (UniProt ID: P17612) with the protein kinase domain shown in colour, with sub-domains 1-5 (red), 5-7 (blue), 8-9 (orange), and 10-12(green). Sub-domains 1-5 are the small lobe and 5-12 are the large lobe, as defined in [42]. (E) AlphaFoldDB structures of KAPCA, CDK2 (UniProt ID: P24941), and PLK1 (UniProt ID: P53350), where colours are defined by clusters of protein kinase segments, including cluster c (red), cluster d (blue), cluster b (orange), and cluster a (green).