Hydrodynamic Radii of Intrinsically Disordered Proteins Determined from Experimental Polyproline II Propensities

The properties of disordered proteins are thought to depend on intrinsic conformational propensities for polyproline II (PP II) structure. While intrinsic PP II propensities have been measured for the common biological amino acids in short peptides, the ability of these experimentally determined propensities to quantitatively reproduce structural behavior in intrinsically disordered proteins (IDPs) has not been established. Presented here are results from molecular simulations of disordered proteins showing that the hydrodynamic radius (R h) can be predicted from experimental PP II propensities with good agreement, even when charge-based considerations are omitted. The simulations demonstrate that R h and chain propensity for PP II structure are linked via a simple power-law scaling relationship, which was tested using the experimental R h of 22 IDPs covering a wide range of peptide lengths, net charge, and sequence composition. Charge effects on R h were found to be generally weak when compared to PP II effects on R h. Results from this study indicate that the hydrodynamic dimensions of IDPs are evidence of considerable sequence-dependent backbone propensities for PP II structure that qualitatively, if not quantitatively, match conformational propensities measured in peptides.


Introduction
Many proteins, and protein domains, that perform critical biological tasks have disordered structures under normal solution conditions [1][2][3]. These proteins are referred to as intrinsically disordered [4] and, accordingly, molecular models of disordered protein structures are needed to understand the physical basis for the activities [2,3], roles regulating key signaling pathways [5], and relationships to human health issues [6][7][8][9] that have been linked to intrinsically disordered proteins (IDPs).
The properties of disordered protein structures are often associated with conformational propensities for polyproline II (PP II ) helix [10][11][12] and charge-based intramolecular interactions [13][14][15]. PP II propensities are locally-determined [16] and intrinsic to amino acid type [17][18][19], while charge-charge interactions seem to be important for organizing disordered structures owing to both long and short range contacts [13][14][15]20,21]. Since chain preferences for PP II increase the hydrodynamic sizes of IDPs [22,23], and Coulombic interaction energies are distance-dependent, it could be argued that charge effects on IDP structures are modulated locally by intrinsic PP II propensities. A number of issues with that hypothesis, however, are apparent. First, it has not been established if PP II propensities measured in short peptide models of the unfolded states of proteins [17][18][19] translate to IDPs. It could be that PP II propensities are negligible and unimportant in IDP systems. Second, methods capable of separating the impact of weak to possibly strong local conformational propensities and charge-charge interactions in the context of flexible and disordered protein structures have not been demonstrated, but are required for testing any potential interdependence.
To investigate such issues, a computer algorithm [22][23][24] based on the Hard Sphere Collision (HSC) model [25] was developed for parsing the contributions of intrinsic PP II propensities and charge to the structures of IDPs, as represented by the hydrodynamic radius (R h ). A HSC model was chosen since PP II propensities and charge effects could be added separately and in steps, to isolate contributions to simulated IDP structures. R h was chosen since experimental values are available for a wide range of IDP sequences, allowing direct comparisons to model-simulated R h .
Here we demonstrate that R h for disordered proteins trend with chain propensities for PP II structure by a simple power-law scaling relationship. Using experimental PP II propensities for the common biological amino acids from Kallenbach [17], Creamer [18], and Hilser [19], this relationship was tested against experimental R h from 22 IDPs [23,[26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42] ranging in size from 73 to 260 residues and net charge from 1 to 43. We observed that the power-law scaling function was able to reproduce IDP R h with good agreement when using propensities from Hilser, while the Kallenbach and Creamer scales consistently overestimated R h . The ability to describe R h from just intrinsic PP II propensities associated with a sequence was supported by simulation results showing that charge effects on IDP R h are generally weak. Relative to the effects of PP II propensities, charge effects on IDP R h were substantial only when charged side chains were separated in sequence by 2 or fewer residue positions and if the sequence had higher than typical bias for one charge type (i.e., positive or negative). Overall, these results demonstrated that two seemingly disparate experimental datasets, IDP R h and intrinsic PP II propensities, are in qualitative agreement; providing evidence for considerable sequence-dependent conformational preferences for PP II structure in the disordered states of biological proteins.

Results
Computer simulation of R h dependence on PP II propensity R h for IDPs are sensitive to site-specific and general structural perturbations such as amino acid substitutions [23], changes in net charge [13,14], charge rearrangements [15], and temperature changes [22,43,44]. Fig 1 shows that IDP R h differ substantially from R h for folded proteins [22,45,46] that have similar residue length, N. R h from modeling proteins with no strongly preferred conformations [22], which is referred to as a random coil [47], is also provided for comparison to the experimental values. The solid line representing coil R h was determined from computer simulation of randomly configured polypeptide chains using a HSC model [22]. Owing to favorable native contacts that promote stable globular structures, folded proteins have R h that are compacted relative to the R h of simulated random coils. In contrast, the data in Fig 1 indicate that R h from IDPs are generally larger than random coil estimates.
The dependence of R h on N for chemically denatured proteins follows a power-law scaling relationship, where R o is 2.2 Å and v is 0.57 [45]. where f PRO is the fractional number of PRO residues and |Q| the absolute net charge determined from sequence [49]. Since PRO residues have strong propensities for PP II helix, which is an extended structure [50], and repulsive interactions between charged groups likewise favor extended conformations to minimize unfavorable energetics, a simple molecular interpretation of Eq (2) can be offered whereby the R h dependence on N for IDPs follows a baseline trend of R h = (2.17 Å)ÁN 0.509 (i.e., R o with f PRO and |Q| set to zero) with sequence-dependent increases in R h from this baseline owing to chain propensities for PP II and repulsive charge-charge interactions. Simulated R h for random coils were observed to trend with N by R h = (2.16 Å)ÁN 0.509 [22], supporting this hypothesis (and reproduced in Fig 1). The effects of ALA to GLY substitutions on IDP R h also indicated that chain propensities for PP II structure modulate IDP R h and not simply PRO content [23].
To model the effects of PP II propensities on coil R h , a sampling bias for PP II structure was applied to random coil simulations and the relationship between R h , N, and fractional number of residues in the PP II conformation, f PPII , was determined [22,23]. This is shown in Fig 1 by stippled lines to demonstrate that increases in f PPII cause increases in coil R h . These results were generated from simulations that modeled PP II bias by applying an identical sampling bias for PP II structure at each residue position in a polypeptide chain and, accordingly, did not include effects that could be caused by position-specific variations in PP II propensity.
To test for effects on coil R h owing to PP II propensity variations within a polypeptide chain, conformational ensembles for N = 15, 25, 35, 50, and 75 were generated for poly-ALA with the algorithm modified to allow position-specific sampling rates for PP II structure. It was shown previously that the effects of N on R h were mostly insensitive to amino acid sequence in HSC model simulations of disordered proteins [22] and thus poly-ALA was chosen as a computational simplification. Variations in PP II propensity among residue positions were simulated by applying a sampling bias for PP II structure (S PPII ) at every position, every second position, every third position, every fourth position, or every fifth position in the poly-ALA chains. S PPII at values of 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9 were tested at the indicated residue locations. This PP II sampling strategy resulted in 225 separate simulated ensembles (5 N lengths X 5 patterns X 9 S PPII values).
A set of simulations using randomly determined position-specific bias for PP II structure was also modeled using poly-ALA chains. These additional simulations used N = 15, 25, and 35, with each residue position assigned a different random value for S PPII . Position-specific random assignments were repeated 3 times for S PPII ranging from 0 to 1, 0 to 0.5, 0.25 to 0.75, and 0.5 to 1, resulting in an additional 36 simulated ensembles (3 N lengths X 3 distributions of random position-specific PP II biases X 4 applied ranges in PP II sampling bias).
The ensemble-averaged fractional number of residues in the PP II conformation (i.e., the propensity) can be different from S PPII in these simulations since randomly generated structures containing van der Waals contact violations are removed from the calculation. Differences between the applied sampling rate (i.e., S PPII ) and the observed ensemble-averaged rate (i.e., f PPII ) at S PPII -targeted positions followed the same Gaussian relationship that was established previously for whole-chain S PPII and f PPII comparisons [22] and thus straight-forward conversion between applied and observed bias rates was available (S1 Fig). f PPII determined from simulation for residue positions with no applied S PPII was 0.012 ± 0.004.
doi:10.1371/journal.pcbi.1004686.g001 exponential term, v, was dependent on S PPII while R o was mostly independent of S PPII with an averaged value of 2.16 Å [22]. Fig 2A shows  and f PPII,chain as, In Eq (3), <L> = ∑ L i ÁP i , where L i is the maximum Cα-Cα distance calculated for state i, P i is the Boltzmann probability for state i, and the summation was over all states i of an ensemble. In Eq (4), <N PPII > = ∑ N PPII,i ÁP i , where N PPII,i is the number of residues in the PP II conformation for state i. The distinction of "chain" given to f PPII in Eq (4) was provided to limit confusion between f PPII calculated for a whole chain versus f PPII calculated for specific residue positions.
The relationship between v and f PPII,chain for all simulations followed a logarithmic trend that was fit to the equation, using the Levenberg-Marquardt method of nonlinear least squares [51,52]. The parameters v o and β were found to be 0.503 ± 0.002 and -0.11 ± 0.003, respectively. Fig 2B shows that R h determined from f PPII,chain (Eq (4)) and N by combining Eqs (1) and (5) (see Eq (6) below) correlated strongly with R h calculated directly from a simulated ensemble (Eq (3)). All possible patterns of position-specific PP II bias were not tested in our computer trials. Results in Fig 2 predict, however, that in general a quantitative relationship exists for disordered proteins between R h , N, and the ensemble-averaged per-residue chain propensity for PP II structure (f PPII,chain ).

Test of model using experimental PP II propensities
Results from HSC model simulations that are summarized in Figs 1 and 2 can be interpreted as an ideal relationship between R h and N that includes the general effects of sterics and PP II propensities but is absent other intrinsic and intramolecular factors. Contributions from Coulombic interaction energies to IDP R h will be discussed below and added to this model. First, the simulation-derived relationship between R h , N, and f PPII,chain is tested by applying experimental PP II propensities to the sequences of IDPs in Fig 1. The identity, sequence, and experimental R h for each IDP are given in Supporting Information (S1 and S2 Tables Amino acid PP II propensities reported by Kallenbach [17], Creamer [18], and Hilser [19] for disordered proteins are reproduced in Table 1 and were used for testing the relationship, These propensity scales were chosen since weak correlations are observed among the group (S2 Fig), indicating a potential for yielding different results when each set is used separately with Eq (6) for a given IDP sequence. A physical explanation for the different PP II propensity values reported for the amino acids is not given here (e.g., the reported ALA PP II propensities are very different when compared), other than to note that their measurements used host peptide sequences that were also very different (Table 1). Kallenbach measured PP II propensities in the background of a GLY-rich host peptide, whereas the scale reported by Creamer was determined for positions flanked on both sides by PRO residues. The propensity scale from Hilser was measured for positions located in between PRO and valine (VAL). Other PP II propensity scales were not included in these tests due to similarities to the Kallenbach, Creamer, or Hilser reported values. For example, a PP II propensity scale from Zondlo [53] correlated with the Creamer values (coefficient of determination, R 2 , gave a correlation of 0.58), likely owing to the use of a host peptide that also flanked the guest position with PRO residues.
Inspection of Table 1 shows that PP II propensities for tryptophan (TRP) and tyrosine (TYR) were not reported by Creamer. For these amino acids, we used the averaged Creamer-reported value calculated from the 18 other amino acids (0.58). In the Hilser set, TRP and TYR had lower than average PP II propensity. In contrast, TRP and TYR had higher than average PP II propensity in the Kallenbach set. Using the Creamer average was a compromise that likely had low significance in our tests since TRP and TYR had very low representation among the IDP sequences; 0.008 and 0.012, respectively. PP II propensities were not reported for PRO and GLY by Kallenbach. Here, we used 1 for PRO since it is generally accepted that PRO has the highest propensity for PP II structure [10,12,[17][18][19]. This gave PRO a larger value than ALA (0.818), which was the amino acid with the highest reported propensity in the Kallenbach set. GLY was given a propensity of 0.50, which is lower than the Kallenbach average (0.626) but higher than the lowest value (0.428). This also was a compromise from observing that GLY had the lowest value in the Hilser set (0.13), but an average value in the Creamer set (0.58).
f PPII,chain was calculated for each IDP by using the amino acid PP II propensity given in Table 1, summing over the IDP sequence, and dividing by N. Fig 3A shows the experimental scales predict different chain propensities for PP II structure for each IDP sequence. The scale from Kallenbach gave f PPII,chain ranging from 0.746 to 0.628, whereas the Creamer and Hilser are smaller than observed for each IDP. In contrast, PP II propensities from Kallenbach and Creamer, which report relatively large f PPII,chain values, predict R h that are larger than observed for each IDP. Experimental propensities from Hilser predict R h that trend with the identity line, showing good agreement, but also showing scatter relative to that line (average error was 2.5 Å). In an attempt to reduce prediction error, a composite PP II propensity scale that used the Hilser values by default but the Kallenbach values for residues located between GLY (i.e., GLY-X-GLY) and Creamer values for residues located between PRO (i.e., PRO-X-PRO) was tested. This context-specific composite propensity scale (identified as "Composite" in Table 2 and Fig 3B) caused only small changes in predicted R h , with no significant improvement in prediction capabilities relative to using only the Hilser reported PP II propensities.
Since R h increases with N (Fig 1), prediction error was normalized for peptide length by, Random coil R h was calculated using Eq (6) with f PPII,chain = 0.012, the null model value. Average normalized error is given in Table 2 for each propensity scale. Fig 4 shows trends in the normalized error with N and net charge density, determined as the absolute net charge normalized for peptide length, net charge density ¼ jQj=ðrandom coil R h Þ Á ð8Þ S1 Table gives net charge and N for each IDP. No obvious bias with peptide length (i.e., N) was observed in the normalized error for the Hilser and composite propensity scales. Normalized error clearly increased with N when using Kallenbach and Creamer values, indicating that these PP II propensities may be over-estimated when applied to IDP sequences to predict R h . Since the exponent in Eq (6) becomes larger with increasing f PPII,chain , a set of propensity values that systematically are too large would cause normalized errors that increase with N.
It is interesting to note that normalized error correlated with net charge density for each experimental propensity scale (Fig 4B and Table 2), suggesting that prediction error was caused partially by charge effects on R h that were not included in the model. This is not surprising since Marsh and Forman-Kay demonstrated that increases in net charge correlate with increases in IDP R h [49] and the trend we observed of decreasing normalized error with increased net charge density is consistent with their conclusions. Extrapolating this trend to zero net charge density for the Hilser and composite propensity scales yields positive normalized errors suggesting that, in the background of no net charge contributions to R h , the PP II propensities reported by Hilser may also be slightly too large when using Eq (6) to predict R h . While this analysis of experimental PP II propensities indicated that one of the scales was capable of reproducing experimental R h with good agreement for a set of IDPs, it is important to recognize that comparative tests based on Eq (6) may not be suitable for affirmation. Since R h in this model depends only on N and chain averaged propensity for PP II structure, contrived Correlation of normalized error in predicted R h to N and net charge density. Normalized error and net charge density were calculated for each IDP using Eqs (7) and (8), respectively. In both panels, red triangles show normalized error from R h predicted using the Kallenbach reported propensities, blue squares from Creamer reported propensities, open circles from Hilser reported propensities, black dots from the composite propensity scale, and X is the null model. Lines are linear fits to the five prediction sets colored as the symbols (Kallenbach scale was red; Creamer was blue, Hilser was stippled black, composite was solid black, and null was dotted black). scales that predict IDP R h with similar agreement in terms of the average prediction error are simple to generate. For example, each IDP could be given a sequence-independent f PPII,chain value of 0.364, which was determined by converting experimental R h to an apparent f PPII,chain using Eq (6) and then averaging over the IDP dataset. Using this static f PPII,chain to predict IDP R h gives an average prediction error (identified as "Static" in Table 2) that is close to the error obtained when using the experimental scale from Hilser. Correlations between predicted and observed R h and between normalized error and net charge density for the contrived static scale, however, decreased relative to the correlations that were observed with the experimental scales, suggesting that static representations of f PPII,chain may not fully capture some molecular dependencies that are inherent to IDP R h .
To further investigate the capabilities of Eq (6) for relating IDP R h and PP II propensity, random sets of amino acid scales were generated following a two-step protocol and analyzed. First, a random number between 0 and 1 was used to target an average propensity for a scale. Then, random scales were generated, where each amino acid was assigned a different random value between 0 and 1, until a set was found whose average for the 20 amino acids matched the target determined in the first step (±0.05). The goal from using two steps to generate scales was to ensure that chain averaged propensities in the high, medium, and low range were evenly sampled. This sampling scheme was repeated until 100,000 random scales were generated. Each propensity scale was then used to predict R h from Eq (6) and the results are summarized in Fig 5. It was observed that randomly generated scales gave average prediction errors for the IDP dataset ranging from 1.9 to 239.8 Å, correlations between predicted and observed R h ranging from 0.02 to 0.88, and correlations between normalized error and net charge density from 0 to 0.81. Optimal values for these metrics (i.e., highest correlations coupled with lowest average error), seem to focus toward values of R 2 and average error that are obtained when using experimental PP II propensities from Hilser. This result shows that experimental R h of the IDP dataset are in good qualitative agreement with experimental PP II propensities reported by Hilser, and vice versa, giving evidence that the molecular properties of IDPs that link R h , N, and f PPII,chain are well-approximated by the simple power-law scaling relationship of Eq (6).

Effects of Coulombic interaction energies on R h
In the HSC model used for this study, a computer algorithm generates polypeptide structures by random conformational search until R h (Eq (3)) converges to a stable ensemble-averaged value [22]. A structure-based energy function parameterized to solvent-accessible surface areas that has been tested extensively [54][55][56][57][58][59][60][61][62] is used to population-weight each randomly generated structure. To approximate charge effects on ensemble populations, the energy function was modified to include Coulombic interaction energies by, where the constant 332 converts the energy into units of kilocalories per mole at 25°C, D H2O is the dielectric of water, Z is the charge at site i or j, R ij is the distance between two charged sites i and j (in Å), κ (the Debye parameter) accounts for screening from solution ionic strength, and the sums are over all charge-bearing sites. The Debye parameter was calculated as, where I is ionic strength (in molarity, M). D H2O used was 78.3 [63] and I was 0.1 M to represent normal conditions. Since the simulations used poly-ALA chains, charged residues were modeled with a positive or negative charge located at the coordinates of the Cβ atom to denote the approximate location for flexible and charged side chains. Coordinates for the backbone N and O atoms of the first and last residues were used to assign positive and negative charge, respectively, to N-and C-termini. Simulations were limited to 25 residue poly-ALA chains to establish trends for the effects of charge on R h in this model. For each ensemble, an identical S PPII was applied at each residue position. S PPII was varied among the different simulations to target ensemble-averaged f PPII,chain ranging from 0.1 to 0.92. Fig 6A shows that introducing charge at N-and C-termini had no effect on simulated R h for poly-ALA chains. Modeling negative charge at the Cβ position of each residue, or positive charge (S3 Fig), caused large increases in R h from repulsive electrostatic intramolecular interactions. Identical charge at every other residue position caused smaller increases in R h , while identical charge at every third position gave R h that were mostly similar to R h of poly-ALA modeled with no charges. These data predict that the effects of charge on IDP R h should weaken as charged residues separate in sequence, as expected. Fig 6B shows the ensemble-averaged distance between "charged" Cβ atoms that were closest in sequence for each ensemble in panel A, indicating repulsive charge-charge interactions at distances !9 Å had only minor effects on R h . The Debye length for the modeled conditions (i.e., 1/κ) was 9.6 Å, which is the distance where interactions between charged groups become negligible at a given ionic strength. The simulation results thus trend with expected outcomes for fully solvated charges. It was also observed that, for polypeptides with each residue position charged, f PPII,chain calculated for an ensemble was larger than expected based upon the applied S PPII (Fig 6A inset). This result predicts that repulsive charge-charge interactions between side chain groups preferentially select for the extended PP II structure to minimize unfavorable interaction energies.
To test the effects of clusters of charge on R h , polypeptides with patterns consisting of three consecutively charged residues were also simulated (Fig 7). Similar trends were observed, Fig 5. R h prediction from random PP II propensity scales. Random scales were generated as described in the text and used to predict R h for each IDP by Eq (6). Shown is the correlation (R 2 ) obtained for each scale between observed and predicted R h plotted against the correlation obtained between the normalized error (n. error) and the net charge density (ncd). Shown by color is the average prediction error of each scale. Random scales giving average prediction error larger than 75 Å were omitted to emphasize differences at lower error values. whereby the effects of charge on R h weaken as charged groups (i.e., clusters) were separated in sequence. Charge clusters, however, affected R h when modeled with 4 intervening non-charged residues, with weaker effects persisting at even larger separation distances between the clusters. This contrasts with the simulation results for non-clustered charged residues that exhibited comparison of observed f PPII,chain (shown as obs f PPII ) to f PPII,chain expected from the applied S PPII (shown as applied f PPII ; calculated as f PPII = S PPII − 0.062Áexp(-(S PPII -0.63) 2 /(2Á0.28 2 )) [22]. Note that filled circles trend higher than other plotted data. Inset symbols match panel representations. negligible effects on R h when charges were separated by as little as 2 intervening uncharged residue positions (Fig 6A).
Since IDPs, in general, contain both positive and negative charges, simulations with opposite charge at adjacent residue positions were also performed. Fig 8A shows that repeating patterns of opposite charge had minimal effects on R h in these simulations, even when each residue position was charged. This was mostly the case for charge clusters too (Fig 8B) with the exception that the simulation would sporadically generate ensembles with compacted R h , whereby "compacted" is used to indicate R h smaller than what was observed for non-charged poly-ALA coils of identical N. Overall, the amount of R h compaction owing to favorable interactions between oppositely charged residues (or clusters) was small when compared to increases in R h that were observed owing to unfavorable interactions between identically charged residues (or clusters).
The results in Figs 6-8 from modeling charge effects on R h indicate that, in general, the strongest effects on R h should occur owing to identical charges at sequentially-adjacent residue positions (Figs 6 and 7) and for polypeptides with the least amount of mixing of positive and negative charge types (Fig 8). To test these two general observations, the IDP dataset was analyzed to determine the net number of adjacent charges in each IDP sequence. This was calculated by first summing the number of ASP residues that had GLU or ASP immediately next or prior in sequence with the number of GLU residues that had GLU or ASP immediately next or prior in sequence to determine the total number of negative charges with an adjacent negatively charged neighbor. A similar calculation was performed using LYS and ARG to determine the number of positive charges with an adjacent positively charged neighbor. The net number of adjacent charges for an IDP was then the absolute value in the difference between the positive  Fig 6A, R h was calculated from poly-ALA simulations with N = 25. A charge cluster was defined as three consecutive residues with negative charge modeled at the Cβ atoms. Charge clusters separated in sequence by two uncharged residues (no charge modeled at Cβ) are shown with filled squares whereas charge clusters separated by four uncharged residues are shown with filled triangles. X and + symbols represent charge clusters separated by six and eight uncharged residues, respectively. Inset: comparison of observed f PPII,chain to f PPII,chain expected from the applied S PPII (following Fig 6A inset description). Inset symbols match panel representations.  In panel A, filled circles have each residue modeled with charge at the Cβ atom (first residue negative, second residue positive, third residue negative, etc.). Filled squares have every other residue modeled with charge (first residue negative, third residue positive, etc.), filled triangles have every third residue modeled with charge, and X represents every fourth residue modeled with charge. In panel B, each residue in a cluster had identical charge while clusters adjacent in sequence had opposite charge. Filled circles are poly-ALA with every residue charged (i.e., residues 1-3 having negative charge, residues 4-6 with positive charge, residues 7-9 with negative charge, etc.). Charge clusters separated in sequence by two uncharged residues are shown with filled squares (i.e., residue 1-3 with negative charge, residues 4-5 uncharged, residues 6-8 with positive charge, etc.) whereas charge clusters separated by four uncharged residues are shown by filled triangles. X and + symbols represent charge clusters separated by six and eight uncharged residues, respectively. Insets: comparison of observed f PPII,chain to f PPII,chain expected from the applied S PPII (following Fig 6A inset description). Inset symbols match panel representations.
doi:10.1371/journal.pcbi.1004686.g008 and negative adjacent charge numbers (provided in S1 Table). Fig 9A shows that normalized error in predicted R h for the IDP dataset trends with the net adjacent charge density (i.e., net adjacent charge normalized for peptide length), similar to the correlation that was observed between normalized error and net charge density (Fig 4B). This should be expected since net charge and net adjacent charge correlate with R 2 = 0.64 in the dataset.
The set of IDPs was also split according to the amount of mixing of positive and negative charge types in a given sequence. To do this, a "charge bias" was calculated for each IDP as the simple ratio of total negative charges (sum of ASP and GLU residues) to total positive charges (sum of LYS and ARG residues), or vice versa, depending on which ratio gave a value greater than 1. As a metric for separating IDPs with "high" and "low" charge bias, a "typical" charge bias was calculated for the entire dataset by the concatenated sequence and found to be 1.9. The average IDP charge bias, found to be 4.2, was not used to separate IDPs since: 1) ratiobased distributions are skewed, 2) only 7 IDPs would have been in the "high" charge bias set, and 3) 4 of these 7 were sequences derived from the p53 protein. Using the charge bias of the concatenated sequence gave 12 IDPs in the high charge bias set and 10 IDPs in the low charge bias set. Fig 9B shows that correlations between net adjacent charge density and normalized error in predicted R h persisted in the set of IDPs with high charge bias and mostly disappeared for IDPs with low charge bias, seeming to agree with the simulation prediction that significant mixing of positive and negative charge types in a sequence should reduce charge effects on R h . Applying this analysis to net charge density gave different results (S4 Fig). Correlations between net charge density and normalized error in predicted R h decreased for both the high and low charge bias sets. This could be owing to trends shown in Fig 6, whereby net charge effects on R h depended strongly on the distance between the charged groups. Overall, these results seem to indicate that charge effects on IDP structures are highly dependent on sequence, however, charge effects on R h can be weakened substantially by mixing negative and positive charge types or by slight increases in the distances between charged groups in sequence. The hypothesis that charge effects on R h may be generally weak for IDPs is supported by data in Fig 3B showing that R h could be predicted without specific consideration of charges when provided an appropriate amino acid scale for intrinsic PP II propensities. Fig 1 shows that experimental R h for IDPs are much larger than computational predictions based on random coil modeling of the R h dependence on N. Numerous studies have demonstrated the importance of Coulombic effects for regulating IDP structural preferences [13][14][15]. Thus, it could be surprising to note that sequence effects on IDP R h can be predicted with good agreement from sequence differences in PP II propensity, even when other intramolecular factors are ignored. R h predicted from IDP sequence and Eq (6) seemed to work best when using an experimental PP II propensity scale from Hilser and colleagues [19], or a composite scale that combined the Hilser, Kallenbach [17], and Creamer [18] propensities, giving an average error of~2.5 Å for an IDP dataset covering a wide range of residue lengths, net charge, and sequence composition. As examples of sequence differences in this dataset, the fractional number of PRO residues (f PRO = (# PRO residues)/N) varied from 0 to 0.24, SER from 0.02 to 0.20, GLU from 0.06 to 0.31, and ALA from 0 to 0.16, indicating significant sequence diversity among the IDPs that were tested.

Discussion
If it were established that molecular descriptions for R h depend mostly on PP II propensities for disordered proteins, this would have important implications. First, R h well-above random coil estimates would indicate non-trivial preferences for PP II structure. Fig 1 shows this to be the case for many IDPs. And second, large variations in R h for IDPs with similar N would indicate large differences in propensity for PP II structure among the biologically common amino acids. Observed differences in amino acid propensity for PP II [17][18][19]53] are thus consistent with the observed differences in R h for IDPs with similar N. For example, consider that R h varied from 24.5 Å to 32.4 Å for IDPs with N = 87-97 in Fig 1. The average prediction error in R h for these 8 IDPs from using Eq (6) and the composite propensity scale was only 1.7 ± 0.7 Å, though net charge ranged from 4 to 29 for these proteins. In contrast, predictions using random coil values give R h from 20.5 to 21.7 Å with an average error of 6.4 ± 2.7 Å.
The simulation-derived relationship between R h , N, and f PPII,chain appears to be surprisingly simple for disordered proteins. As noted above, Eq (6) should be interpreted as an ideal relationship that excludes many molecular factors known to regulate structural preferences in proteins (e.g., electrostatic effects, cis-trans isomerization rates). Observed deviations from this "ideal" behavior can then be interpreted in terms of factors that were not modeled, as shown ( Fig 4B). We recognize that exclusive use of poly-ALA for computational modeling may prove to be unjustified with further studies. Poly-ALA was used as a simplifying step since the effects of N on R h were mostly independent of amino acid sequence in previous HSC-based simulations and agreed with general IDP trends determined from a literature survey [22,49]. As shown here, this simulation-derived relationship provides a straight-forward molecular explanation for R h variations among IDPs. The R h dependence on f PPII,chain also predicts heatinduced compaction of IDP R h since the enthalpy of unfolding PP II structure is positive [16,64]. Many studies have demonstrated R h compaction caused by elevated temperatures for IDPs [22,43,44].
As noted above, the simulation results presented here could be interpreted as indicating that charge effects on R h are generally weak for IDPs, relative to the effects of intrinsic PP II propensities. These data demonstrate, however, that certain sequence patterns of charge can modulate R h substantially (see Fig 6). For charged groups, this would be those that are separated at distances averaging less than the solution Debye length, involving identical charge type (i.e., positive or negative), and within a region showing higher than typical charge bias. These general rules are in qualitative agreement with results from Pappu and colleagues showing that simulated hydrodynamic sizes for highly charged and disordered polypeptides, with every residue modeled as GLU or LYS, depend strongly on the mixing of negative and positive charge types [15]. In that study, mixing of charge types in a sequence caused structural compaction relative to biased charge distributions, similar to our own conclusions. The observation that unfavorable charge-charge interactions between side chain groups can promote PP II structure (Figs 6A and 7 insets) has also been noticed in computational studies from other researchers [14,65]. This result predicts multiple mechanisms for charge-mediated regulation of IDP structure; possibly owing to both the accumulation of charge and local modulation of PP II propensities. Overall, these data demonstrate the importance of sequence context for understanding the structural properties of IDPs and for describing quantitatively how disordered protein structures respond to discrete perturbations such as changes in charge state and amino acid substitutions.

Computer generation of polypeptide structures
Detailed description of the computer algorithm that was used is provided elsewhere [22,24]. Briefly, simulations of disordered protein structures were limited to poly-ALA polypeptides. Main chain atoms of poly-ALA were generated using the standard bond angles and bond lengths [66] and a random sampling of the dihedral angles F, C, and ω. The dihedral angle ω was given a Gaussian fluctuation of ±5°around the trans value of 180°. To sample conformational space efficiently, (F,C) values were restricted to the allowed Ramachandran regions [67]. Of the two possible positions of the side chain Cβ atom, the one corresponding to L-alanine was used throughout the studies. To calculate state distributions typical of protein ensembles, a structure-based energy function parameterized to solvent-accessible surface areas was used to population-weight the generated structures [54][55][56][57][58][59][60][61][62].
Supporting Information S1 Fig. Comparison of f PPII and S PPII . In this figure, S PPII is the average applied sampling rate for PP II for residues with S PPII 6 ¼ 0 in a simulation, while f PPII was the observed per-position average PP II rate, also excluding residues with S PPII = 0. Open circles are from ensembles where positionspecific S PPII followed the pattern specified in the text (i.e., different simulations had different S PPII ranging from 0.1 to 0.9 in 0.1 increments applied to each residue, every other residue, every third residue, etc.) which is why circles align at S PPII = 0.1-0.9 in 0.1 increments. Blue circles give the average f PPII for each applied S PPII . Open squares represent this calculation performed on simulations using randomly assigned position-specific S PPII . Stippled line is the identity; solid line is the relationship between f PPII and S PPII established previously for S PPII applied at constant values across all residues [22]. In general, f PPII trends with S PPII by: f PPII = S PPII -0.062Áexp(-(S PPII -0.63) 2 / (2Á0.28 2 )). This gives the algorithm the ability to target specific f PPII from the applied value of S PPII . (TIF) Shown are correlations (R 2 ) between normalized error and net charge density for all IDPs, IDPs in the high charge bias group (labeled as "high bias"), and IDPs in the low charge bias group (labeled as "low bias"). Red columns are correlations from using the Kallenbach propensity scale to predict R h , blue from using the Creamer propensities, white the Hilser propensities, and black the composite propensity scale. (TIF) S1