Physical Motif Clustering within Intrinsically Disordered Nucleoporin Sequences Reveals Universal Functional Features

Bioinformatics of disordered proteins is especially challenging given high mutation rates for homologous proteins and that functionality may not be strongly related to sequence. Here we have performed a novel bioinformatic analysis, based on the spatial clustering of physically relevant features such as binding motifs and charges within disordered proteins, on thousands of Nuclear Pore Complex (NPC) FG motif containing proteins (FG nups). The biophysical mechanism by which FG nups regulate nucleocytoplasmic transport has remained elusive. Our analysis revealed a set of highly conserved spatial features in the sequence structure of individual FG nups, such as the separation, localization, and ordering of FG motifs and charged residues along the protein chain. These functionally conserved features provide insight into the particular biophysical mechanisms responsible for regulation of nucleocytoplasmic traffic in the NPC, strongly constraining current models. Additionally this method allows us to identify potentially functionally analogous disordered proteins across distantly related species.


Exploration of different biophysical motif clusterings for FG nups
A series of clustering possibilities was considered starting with the hydrophobic/small amino acids (LVIM-CAGSTPFYW) versus the hydrophilic amino acids (EDNQKRH) defined in the two letter reduced amino acid alphabet derived by grouping and averaging the full similarity matrix elements of BLOSUM50 [1]. This group showed considerable overlap between clustering types. The next case considered was the hydrophobic/small amino acids versus charged amino acids (EDKR), which performed in a similar manner. Considered next was the F modulo class of F equivalent AAs in the 4 letter BLOSUM derived alphabet versus the charged amino acids. This case was considered given the high enrichment of F amino acids in FG nups, and for the tendency of FG nups to have high density charged amino acids disordered regions which strongly affects polymer behavior. Possibly the most physically relevant property of FG nups is their FG motifs whose biochemical function is to bind transport receptors. This property was then clustered against the charged amino acids, which resulted in the most disjoint clustering class of all of the clustering groups tested, with 70% of FG clusters found to be more than 90% disjoint from charge clusters and nearly no FG clusters having greater than 90% overlap with charge clusters (Fig. S1). The next biophysical properties considered were simplified versions of FG motif vs. charged amino acids which yielded lower values of disjointness, similar to the results obtained with the other generalizations of these motifs considered. These other cases involved further simplified motif clustering possibilities including reducing in the FG motif to simply F as well as representing charges only by the most highly enriched charged amino acid in nups, K [2].
Explorations across possible biophysical clustering types was limited to proteins from distantly related Saccharomyces species and Homo sapiens FG nup proteins contained within the Uniprot (www.uniprot.org) database to conserve computational resources, while a full exploration of all nucleoporins was done only for FG and charged AAs. The 85 Saccharomyces and Homo genus proteins analyzed are listed in SI Data. In contrast, when the NUP proteins were restricted to those with 400 or more AAs and with fragment proteins removed, two distinct groups of nucleoporins were found to arise naturally, a group with low percentage disorder and low FG density, and another group with relatively high percentage disorder and relatively high FG density. Inspection of the names of the proteins showed that the low percentage disorder and low FG density group consists of karyopherin and structural nups, while the other group conyained known FG nups. An exhaustive and systematic labeling of all nucleoporins for Saccharomyces Cerevisiae yeast and humans confirmed that known karyopherin and structural nups formed one group, while the other group consisted of known FG nups. We took proteins in the NUP group with greater than 0.15 FG/AA linear motif density and greater than 30% disorder to be the FG nups which we analyzed in this study.  Fig. 3: Natural split for nups for NUP restricted to proteins with greater than 400 AA at roughly greater than 10% FG/AA FG motif density and greater than 30% protein disorder.
Examining solely the Saccharomyces Cerevisiae nucleoporins confirms the split between FG nups and the structural/transport proteins as seen in Fig. S4. Fig. 4: Natural split for Baker's Yeast, with 400 AA restriction. Yellow circles highlights refer to known FG nups while grey dots which are not highlighted represent known structural/transport proteins.

Supplementary Discussion
Examining solely the human nucleoporins confirms the split between FG nups and the structural/transport proteins as seen in Fig. S5. Fig. 5: Natural split for humans, with 400 AA restriction. Yellow circles highlights refer to known FG nups while grey dots which are not highlighted represent known structural/transport proteins.

Clustering among individual FG nups
The 1,167 FG nups analyzed were clustered using the PreDeCon algorithm [3] over a five dimensional space consisting of FG-Charge cluster overlap in FG nups, FG-Charge cluster polarity, Folded-Disordered region polarity, percent of disordered region composed of charged AA clusters, and topological complexity. Setting the minimum cluster size to 50 proteins allowed for a large scale overview of how the 1167 FG nups organize themselves in this five dimensional space, which resulted in four major groupings of FG nups. Similarly colored is the second chart which shows charged amino acids, while the third chart has FG motifs represented as red vertical lines with clusters represented by alternating purple and cyan regions. The fourth chart displays the propensity for protein disorder for a given AA as predicted by PONDR, with red representing high propensity and yellow representing low propensity. Green circles represent centers of masses of cluster regions and the purple arrow indicates disordered region to folded region polarity.
Supplementary Discussion Fig. 14: FG nucleoporin from Homo sapiens, Nup62, from the yellow group. Amino acid sequence number is shown along the x-axis for all sub-charts. The first chart starting from the top shows QN amino acids as red vertical lines and their clusters colored alternately blue and green as a control. Similarly colored is the second chart which shows charged amino acids, while the third chart has FG motifs represented as red vertical lines with clusters represented by alternating purple and cyan regions. The fourth chart displays the propensity for protein disorder for a given AA as predicted by PONDR, with red representing high propensity and yellow representing low propensity. Green circles represent centers of masses of cluster regions and the purple arrow indicates disordered region to folded region polarity.
Supplementary Discussion Fig. 15: FG nucleoporin from S. cerevisiae, Nup116, from the green group. Amino acid sequence number is shown along the x-axis for all sub-charts. The first chart starting from the top shows QN amino acids as red vertical lines and their clusters colored alternately blue and green as a control.
Similarly colored is the second chart which shows charged amino acids, while the third chart has FG motifs represented as red vertical lines with clusters represented by alternating purple and cyan regions. The fourth chart displays the propensity for protein disorder for a given AA as predicted by PONDR, with red representing high propensity and yellow representing low propensity. Green circles represent centers of masses of cluster regions and the purple arrow indicates disordered region to folded region polarity.
Supplementary Discussion Fig. 16: FG nucleoporin from Homo sapiens, Nup98, from the green group. Amino acid sequence number is shown along the x-axis for all sub-charts. The first chart starting from the top shows QN amino acids as red vertical lines and their clusters colored alternately blue and green as a control. Similarly colored is the second chart which shows charged amino acids, while the third chart has FG motifs represented as red vertical lines with clusters represented by alternating purple and cyan regions. The fourth chart displays the propensity for protein disorder for a given AA as predicted by PONDR, with red representing high propensity and yellow representing low propensity. Green circles represent centers of masses of cluster regions and the purple arrow indicates disordered region to folded region polarity.
Supplementary Discussion Fig. 17: FG nucleoporin from S. cerevisiae, Nsp1, from the red group. Amino acid sequence number is shown along the x-axis for all sub-charts. The first chart starting from the top shows QN amino acids as red vertical lines and their clusters colored alternately blue and green as a control. Similarly colored is the second chart which shows charged amino acids, while the third chart has FG motifs represented as red vertical lines with clusters represented by alternating purple and cyan regions. The fourth chart displays the propensity for protein disorder for a given AA as predicted by PONDR, with red representing high propensity and yellow representing low propensity. Green circles represent centers of masses of cluster regions and the purple arrow indicates disordered region to folded region polarity. Fig. 18: FG nucleoporin from Homo sapiens, Nup153, from the red group. Amino acid sequence number is shown along the x-axis for all sub-charts. The first chart starting from the top shows QN amino acids as red vertical lines and their clusters colored alternately blue and green as a control. Similarly colored is the second chart which shows charged amino acids, while the third chart has FG motifs represented as red vertical lines with clusters represented by alternating purple and cyan regions. The fourth chart displays the propensity for protein disorder for a given AA as predicted by PONDR, with red representing high propensity and yellow representing low propensity. Green circles represent centers of masses of cluster regions and the purple arrow indicates disordered region to folded region polarity.

Spatial Sequence Correlation of FG and GF motifs
Within disordered regions of FG nups we found that the FG and GF motifs are colocalized. Excluding the motifs that were isolated from other motifs by greater than 100 AAs as noise, we found the distance to the nearest motif neighbor of a specific type from a given motif of a specific type. This measurement, averaged over all motifs in all the FG nups studied in this paper, resulted in the following average nearest neighbor separation values of 18.3 AAs between FG motifs, 24.0 AAs between GF motifs, and 21.7 AA between GF motifs and FG motifs.