Zero-shot segmentation using embeddings from a protein language model identifies functional regions in the human proteome

doi:10.1371/journal.pcbi.1012929

Fig 1.

Zero-Shot Protein Segmentation Summary and Comparison to the Literature.

(A) A per-residue protein embedding from ProtT5 for human protein FUS (UniProt ID: P35637). The embedding is visualized as a heatmap where the x-axis represents each amino acid along the proteins sequence (which is maintained across the panels in this figure), and the y-axis is the ProtT5 embedding space. The embedding space is ordered by hierarchical agglomerate clustering to make patterns in the embedding space more visible. RNP1 and RNP2 motifs are indicated with a red asterisk above the protein embedding. The white box indicates the portion of the heatmap shown in panel D. (B) Segments of FUS found using Zero-shot Protein Segmentation (ZPS). The colour of each protein segment was obtained by reducing 1024-dimensional segment embeddings to 3-dimensions, which were scaled to RGB colours (see Methods). (C) FUS annotations from the literature (see Methods) and UniProt. MobiDB and ProRule annotations were retrieved from UniProt. (D) A small portion of protein embedding heatmap showing the first RGG region in FUS (amino acids 210-262). This highlights patterns in this section of the protein embedding that emphasize arginine (R, highlighted in yellow) and glycine (G, highlighted in green).

More »

Expand

Fig 2.

Segment Embeddings and Segment Colours of RNA-Binding Proteins.

(A) Segment embedding boundaries and colours compared to annotations from the literature (see Methods) for the FET family (FUS, EWSR1 (UniProt ID: Q01844), and TAF-15 (UniProt ID: Q92804)), HnRNPA1 (UniProt ID: P09651), and TDP-43 (UniProt ID: Q13148). (B) Segment embeddings shown in panel A were ordered by hierarchical agglomerate clustering and visualized as a heatmap. Corresponding colours of the segment embeddings are shown to the right of the heatmap and labelled with annotations from the literature. (C) AlphaFoldDB structure of TDP-43 showing segment colours. Abbreviations for folded domains labelled here include RNA Recognition Motifs (RRM), N-terminal domain (NTD), and Zinc Fingers (ZF). Abbreviations for IDR sub-regions labelled here include canonical Nuclear Localization Signal (cNLS), Low Complexity (LC) regions, PY-motif Nuclear Localization Signals (PY-NLS), and RGG (arginine-glycine-glycine motif) regions. G-rich and SYGQ-rich regions describe compositional biases that define sub-regions of IDRs. (D) Arginine embeddings from the FET family, TDP-43, and HnRNPA1 proteins were ordered by hierarchical agglomerative clustering and shown in a heatmap. To the right of the heat map (and moving towards the right), we show the numbered clusters, which arginines are annotated as methylation sites on UniProt (black for methylated), which arginines are contained within a ProRule domain annotation on UniProt (black for domain), which protein the arginine originated from (FUS in green, EWSR1 in red, TAF-15 in blue, TDP-43 in yellow, and HnRNPA1 in purple), the colour of the ZPS segment that the arginine originated from, and the names of the domains and regions that the arginines in each cluster originated from.

More »

Expand

Table 1.

Segmentation Evaluation for Human Proteome Segment Annotations from UniProt.

More »

Expand

Table 2.

Boundary Evaluation for Human Proteome Segment Annotations from UniProt.

More »

Expand

Fig 3.

Segment Embedding Evaluation and Visualization of IDRs, Domains, and Compositional Biases.

(A, B) Normalized confusion matrices for 1-nn assessment for (A) ProRule Domain compared to MobiDB IDRs (Disorder Consensus Predictions) and (B) Compositional Biases from MobiDB. In the normalized confusion matrix, we report 1-nn precision along the diagonal and state the number of ZPS segments for each label as n (see S2 Table for confidence intervals). (C, D) Shows 2-dimensional UMAPs of segment embeddings labelled with (C) ProRule domains compared to MobiDB disorder consensus and (D) Compositional Biases from MobiDB. Each point in the UMAP is a protein segment, and each segment shown in the UMAP was used in the respective 1-nn assessment.

More »

Expand

Fig 4.

Segment Embedding Evaluation and Visualization Domain Types and Sub-Domains.

(A, C) Normalized confusion matrices for 1-nn assessment, precision is shown along the diagonal, and n is the number of ZPS segments. (A) Shows the top 20 most commonly occurring ProRule domains in the ZPS segments and (C) shows clusters a, b, c, and d of the protein kinase domain (as labelled in panel B). (B) A 2-dimensional UMAP of segment embeddings coloured by the domain the segment overlaps with. Clusters a, b, c, and d of the protein kinase domain segments are defined by their position in the UMAP. Each of the protein segments shown here were used in the 1-nn assessment. (D) The AlphaFoldDB structure of KAPCA (UniProt ID: P17612) with the protein kinase domain shown in colour, with sub-domains 1-5 (red), 5-7 (blue), 8-9 (orange), and 10-12(green). Sub-domains 1-5 are the small lobe and 5-12 are the large lobe, as defined in [42]. (E) AlphaFoldDB structures of KAPCA, CDK2 (UniProt ID: P24941), and PLK1 (UniProt ID: P53350), where colours are defined by clusters of protein kinase segments, including cluster c (red), cluster d (blue), cluster b (orange), and cluster a (green).

More »

Expand

Fig 5.

Segment Embedding Evaluation and Visualization of IDRs.

(A) 1-nn precision for Disprot functional annotations, reported for ZPS (blue) and 3-mers (red). (B) 1-nn precision for ProtGPS localization annotations, reported for ZPS (blue) and 3-mers (red). Error bars represent the binomial confidence intervals and can be found along with precision values in S1 Table. (C) The normalized confusion matrix for 1-nn assessment of the top 20 most common annotations that overlap with MobiDB IDRs. Precision is shown along the diagonal, and n is the number of ZPS segments. (D) UMAP of the segments used in the 1-nn assessment labelled with the annotations that overlap MobiDB IDRs.

More »

Expand

Fig 6.

Identification of Mitochondrion Related Cluster and SYQG-Rich Prion-Like Domains Similar to FUS.

(A) UMAP of segment embeddings of the entire human proteome. Disordered segments (defined by MobiDB) are shown in a warmer shade of grey to give context to the UMAP. Mitochondria Related cluster (orange) was defined by Leiden clustering of unannotated protein segments (see Methods). Segment embeddings of FUS are shown along with the top 10 nearest-neighbours to the segment embeddings of FUS’s SYGQ-rich regions. The rest of the human proteome is shown in light grey. (B) AlphaFoldDB structures and amino acid sequences of COX18 (UniProt ID: Q8N8Q8, amino acids 1-51), ADCK2 (UniProt ID: Q7Z695, amino acids 1-87), CA5A (UniProt ID: P35218, amino acid 1-40), and HTRA2 (UniProt ID: O43464, amino acids 1-30) colored by pLDDT confidence score (very low in orange, low in yellow, high in cyan, and very high in blue), showing UniProt mitochondrion targeting signal annotations (red) and ZPS segments (green). COX18 has a UniProt mitochondrion targeting signal annotation of unknown length starting at the N-terminus and ADCK2 has no UniProt mitochondrion targeting signal annotation but is known to localize to the mitochondrion [51].

More »

Expand

Fig 7.

Zero-Shot Protein Segmentation of Proteins with Similar Segments to FUS’s SYGQ-Rich Region and PLAAC scores.

Prion-Like Amino Acid Content (PLAAC) scores adapted from the web tool described in [55], segment embedding visualization (ZPS) as shown in Figs 1 and 2, and the amino acid sequence of the query segments from FUS and query result segments (S, Y, G, and Q shown in red).

More »

Expand