Fig 1.
Regions high in charge are prevalent in the yeast proteome and are low complexity.
a Top: the average frequency of amino acids in the yeast proteome. Light gray trace is the expected frequency based on the average nucleotide frequency and the codons which code for each amino acid. Bottom: The difference between the dark and light traces in the top panel. b Cartoon of the algorithm which detects highly charged regions. c Summary of the per-region fraction of charged residues (FCR) and normalized net charge for identified regions. d Complexity of the charged region calculated according to the compositional entropy defined in [14]. Gray and black distributions are normalized to the yeast proteome entropy, the purple distribution is normalized to the entropy of a charged-enriched hypothetical proteome. Reference sequences are known low complexity domains from Sup35 (a typical prion-like low-complexity protein) and Pab1 (atypical hydrophobic-rich intrinsically disordered region), and a folded sequence (actin). e Normalized hydropathy of the charged regions (purple) and length-matched randomly drawn regions (gray). Reference sequences are the same as those in d.
Table 1.
Example highly charged regions from the yeast proteome.
Fig 2.
Secondary structure is pervasive in highly charged, low complexity regions.
a The distribution of the fraction of a region predicted to be disordered with AlphaFold or present in DisProt. See Methods for secondary structure scoring. b The number of predicted helical, sheet, and disordered (which includes coil; see Methods) residues for the highly charged regions, randomly-selected regions, and experimentally verified disordered regions from DisProt. c (Left) The proportion of the top 200 most structured regions for which the protein that contains them is in the PDB. (Right) For those that are in the PDB, the proportion of regions that is a helix, sheet, or missing. d Uversky plot of the highly charged regions. The dividing line is from [2].
Fig 3.
eIF3A, an essential eukaryotic translation initiation factor, contains a conserved, highly charged helix that varies in length but not in secondary structure.
a Alignment of the eIF3A highly charged region (orthologs from all distantly-related species with predicted proteomes in AlphaFold) with negatively charged residues colored red, positively charged residues colored blue, and gaps and all other amino acids in white. Although the highly charged nature of the region is conserved, the sequence itself is variable. b Representative image of the cryoEM structure of yeast eIF3A [42] with the highly charged region (resolved as a helix) shown in purple, and a reference helix shown in black. c Alignment of eIF3A (same as in a) colored by the secondary structure predicted by AlphaFold; S. cerevisiae sequence is highlighted in red. Note that the highly charged region is predicted to be helical in every species represented. d Despite strong secondary structure conservation, the length of the highly charged helix varies significantly more than a reference helix from the same protein.
Fig 4.
Helical regions can be predicted from composition.
a Uversky plot of all regions used to train the LR model. The marginals of the distribution are shown on the plot border. b The coefficients of the logistic regression (LR) model which predicts whether a region is helical or disordered on the basis of amino acid composition. The model was trained on purely helical and disordered regions (predicted by AlphaFold) selected from the S. cerevisiae proteome. Amino acids with a positive coefficient are correlated with helices, those with a negative value are correlated with disordered regions. c The helix propensity from [46] plotted against the LR model coefficients. d Accuracy of both the LR model (top), the Uversky dividing line (middle, from [2]), and flDPnn [47] (bottom) on purely helical and disordered regions (held out data from the training set, n = 3360; left), randomly-drawn regions, which are predicted by AlphaFold to be majority (but not completely) helical or disordered (n = 3405; center), and the highly charged regions (n = 681; right). e Summarized accuracy for all categories in d. f Summarized accuracy of the LR model with only a subset of coefficients, g (Left) Accuracy of the LR model prediction of regions from other organisms. (Right) Timetree showing the evolutionary divergence of the organisms.
Fig 5.
Highly charged regions are evolutionarily conserved.
a The distribution of gap frequencies (left) and average position-wise entropy (right) summarizing multiple sequence alignments (MSAs) of the highly charged regions (purple), randomly drawn regions from the rest of the proteins that contain them (black), and IDRs from DisProt (gray). b Summary of the variance (in log odds space) of usage of categories of amino acids for all proteomes in the AYbRAH database. c FCR values, averaged across AYbRAH alignments, for the highly charged regions identified in the S. cerevisiae proteome and length-matched randomly-drawn regions and their associated AYbRAH MSA. d The average compositional conservation of regions enriched for all sets of four amino acids with the same total frequency as the charged amino acids, plotted as a CDF. Higher values indicate less drift, and lower values indicate more drift (regression to the proteome average).
Fig 6.
Rethinking assumptions of disorder in highly charged regions.
a The same set of predictors that have been associated with disordered regions turn out to also be compatible with a fully structured helical region. b Present leading-edge methods for disorder and structure prediction disagree completely and also fail to capture experimental reality. The highly charged region of Rcf1 is alternatively predicted to be near-completely disordered or completely helical; experimentally and biologically, it forms most of the membrane-spanning helices and exposed loops in this dimeric mitochondrial inner membrane protein.