Zero-shot segmentation using embeddings from a protein language model identifies functional regions in the human proteome
Fig 6
Identification of Mitochondrion Related Cluster and SYQG-Rich Prion-Like Domains Similar to FUS.
(A) UMAP of segment embeddings of the entire human proteome. Disordered segments (defined by MobiDB) are shown in a warmer shade of grey to give context to the UMAP. Mitochondria Related cluster (orange) was defined by Leiden clustering of unannotated protein segments (see Methods). Segment embeddings of FUS are shown along with the top 10 nearest-neighbours to the segment embeddings of FUS’s SYGQ-rich regions. The rest of the human proteome is shown in light grey. (B) AlphaFoldDB structures and amino acid sequences of COX18 (UniProt ID: Q8N8Q8, amino acids 1-51), ADCK2 (UniProt ID: Q7Z695, amino acids 1-87), CA5A (UniProt ID: P35218, amino acid 1-40), and HTRA2 (UniProt ID: O43464, amino acids 1-30) colored by pLDDT confidence score (very low in orange, low in yellow, high in cyan, and very high in blue), showing UniProt mitochondrion targeting signal annotations (red) and ZPS segments (green). COX18 has a UniProt mitochondrion targeting signal annotation of unknown length starting at the N-terminus and ADCK2 has no UniProt mitochondrion targeting signal annotation but is known to localize to the mitochondrion [51].