Marked Variability in the Extent of Protein Disorder within and between Viral Families

Intrinsically disordered regions in eukaryotic proteomes contain key signaling and regulatory modules and mediate interactions with many proteins. Many viral proteomes encode disordered proteins and modulate host factors through the use of short linear motifs (SLiMs) embedded within disordered regions. However, the degree of viral protein disorder across different viruses is not well understood, so we set out to establish the constraints acting on viruses, in terms of their use of disordered protein regions. We surveyed predicted disorder across 2,278 available viral genomes in 41 families, and correlated the extent of disorder with genome size and other factors. Protein disorder varies strikingly between viral families (from 2.9% to 23.1% of residues), and also within families. However, this substantial variation did not follow the established trend among their hosts, with increasing disorder seen across eubacterial, archaebacterial, protists, and multicellular eukaryotes. For example, among large mammalian viruses, poxviruses and herpesviruses showed markedly differing disorder (5.6% and 17.9%, respectively). Viral families with smaller genome sizes have more disorder within each of five main viral types (ssDNA, dsDNA, ssRNA+, dsRNA, retroviruses), except for negative single-stranded RNA viruses, where disorder increased with genome size. However, surveying over all viruses, which compares tiny and enormous viruses over a much bigger range of genome sizes, there is no strong association of genome size with protein disorder. We conclude that there is extensive variation in the disorder content of viral proteomes. While a proportion of this may relate to base composition, to extent of gene overlap, and to genome size within viral types, there remain important additional family and virus-specific effects. Differing disorder strategies are likely to impact on how different viruses modulate host factors, and on how rapidly viruses can evolve novel instances of SLiMs subverting host functions, such as innate and acquired immunity.

content, the asymmetric replication of viral strands frequently gives rise to biases in base composition in which G and C are not at equal frequencies. Intrinsically disordered proteins are less complex and are dominated by certain residues (E, K, R, G, Q, S, P, A) [1,2,3,4]. Mutational pressure has been the dominant evolution driving force in determining the different codon usage preferences in ssRNA viruses [5]. For example, coding sequences of all negative-stranded RNA viruses are biased toward high A in the coding sequences and high T in genomes suggesting that RNA viruses with different genome polarity are under different mutational pressure [6].
We investigated whether intrinsic disorder is determined by GC content in these viral sequences. Indeed, intrinsic disorder shows an overall significant positive correlation ( = 0.36, p < 2.

GC content and experimentally observed disorder
We wished to check if the relationship between GC content and predicted disorder in viral sequences is reflected in a similar relationship between GC content and experimentally determined disorder. A number of viral proteins with an appreciable extent of disorder have been structurally investigated. We investigated 17 viral proteins from the DISPROT [7] database of disordered proteins of known structure (Table S6). We anticipated that there would be a higher GC content among those viral proteins for which a greater proportion of the residues are observed to be disordered. We noted that IUPred strongly underpredicted the extent of disorder in the shorter proteins (Table S3). Among the longer proteins, the one Pushker et al supporting information whose gene had the highest GC content (HHV2 alpha trans-inducing protein), had an IUPred prediction that matched the DISPROT observation exactly, suggesting that GC content does not necessarily bias IUPred predictions upwards. Selecting the nine proteins that are more than 250 residues long, there is a positive correlation between observed disorder and GC content (Pearson's correlation coefficient of 0.38), but the sample size is too small to draw any general inference. However, we note that this correlation is similar to that seen for predicted disorder. Thus, we have no reason to believe that IUPred is overpredicting disorder in GC rich proteins, since the observed disorder in this small dataset of larger proteins shows a similar level of correlation.

Base composition is a partial predictor of genome size in families
We explored the individual correlations between genome size and base composition. Genome size in viral genomes is positively correlated with the proportion of A (ρ = 0.09) and G (ρ = 0.0001) and negatively correlated with T (ρ = -0.068) and C (ρ = -0.064). However, the p-value associated with G was not significant (p = 0.99). We did a multiple linear regression to determine the relationship between genome size and base composition. Overall only 0.3% of the overall variance in genome size could be explained by base composition. However, within 19 families with ten or more viral genomes, between 15% (Podoviridae) and 76% (Microviridae) of the variance in genome size could be explained by base composition (p < 0.01; Table 2). Some of the families such as Microviridae (76%), Calciviridae (74%), Anelloviridae (54%), Virgaviridae (53%) and Rhabdoviridae (52%) have more than 50% variance in genome size associated with base composition. Poxviridae and Herpesviridae have bigger genomes but the variance explained by base composition is 15% and 3% respectively. Geminiviridae (N = 254) is the biggest family of viruses but only 3% of variance in genome size is explained by base composition. Thus, we conclude that base composition could be a partial predictor of genome size within viral families.
Therefore, since disorder is also related to base composition, we need to account for this when investigating the relationship between disorder and genome size.

Smaller genomes with greater disorder often have more overlapping genes in ssRNAp and the ssDNA viruses.
Disordered regions are often found in overlapping regions of viral proteins [8]. This may arise in part because the constraints on out of frame proteins alter the amino acid preferences, in part because many overlapping proteins are short. Shorter proteins tend to be more disordered, and may be in part because overlapping proteins may be accessory proteins, without a key conserved structural role. The negative Pushker et al supporting information correlation seen for most viral types may in some instances relate to family specific effects within the type.
Among the ssRNAp, the negative correlation can largely be accounted for by the large size and low disorder of the Coronaviridae (with large non-segmented RNA genomes; Figure S4), contrasted with the high disorder of a subset of the smaller families, primarily the Luteoviridae, the Tymoviridae, and the Alphaflexiviridae ( Figure S5). In terms of understanding the biology of these proteins, the latter three families have a greater degree of overlap in their genomes, in contrast with the Coronaviridae.
Overlapping proteins may in some circumstances be more disordered [8]. Among the ssDNA, the Annelloviridae are short and disordered, while the Nanoviridae and the Inoviridae are longer and ordered ( Figure S6). For this group, as for the ssRNAp, the short and disordered families appear to have more overlapping open reading frames ( Figure S7). There is a very large number of dsDNA viruses, but the negative correlation in part reflects a very small genome size and high disorder for the Papillomaviridae ( Figure S8). Again these viruses exhibit a large degree of viral gene overlap. Among the dsDNA-RT, the shorter Hepadnaviridae have more disorder than the longer Caulimoviridae ( Figure S9). Again, it is striking that the Hepadnaviridae display more overlapping open reading frames than any other.

Greater disorder among larger genomes of ssRNAn viruses unlikely to reflect differences in extent of overlapping genes.
We were interested as to why ssRNAn viruses should show an opposite (positive) correlation between genome size and disorder compared to the relationship seen among the other viral types. One feature that distinguishes this type is that it has the lowest mean disorder (µ D = 7.4%; Table 1). A second feature of this type is that it has a low variability in genome size (Table 1). However, there could be other explanations. One question is whether the individual families within this type show very different patterns.
While indeed, the correlation could be explained to some degree by the fact that the Paramyxoviridae have larger genomes and more disorder than the Bunyaviridae, Rhabdoviridiae and Arenaviridae, there was still a positive correlation among the Paramyxoviridae (Table 2; Figure S10). In this type, overlapping sequences are found within both the more disordered Paramyxoviridae and in the less disordered smaller Bunyaviridae ( Figure S11). Thus, the trend is likely to reflect other aspects of viral function. This may relate to additional proteins, or in the case of Filoviridae to the addition of a disordered region to the NCAP protein [9], which may partly explain the observation.
Thus, overall we see a general pattern emerging within each major type. Typically, the more disordered viral proteomes have more overlapping proteins. Strikingly, in one viral type, the larger genomes are more disordered, but there is no relationship between disorder and overlapping genes. If we put this one unusual type to one side, we are left with a general trend that the viruses with more disorder have smaller genomes with more overlapping genes. There are two potential explanations for this association. The first is that the overlapping segments are themselves more disordered. The second is that the solution of Pushker et al supporting information overlapping proteins, and of more disorder, are both independent responses to the problem of how to fit more coding functionality into the viral proteome encoded by a spatially constrained genome.
To address this, we plotted the predicted disorder across some representative proteins containing overlapping segments. For Hepatitis B, some of the overlapping gene regions of protein P appear disordered, either in protein P, or in protein S. A more systematic analysis of this relationship is complicated, since many overlapping proteins are very short, and very short proteins are often very disordered simply because they are too short to support an ordered structure. Thus, it is difficult to determine whether the overlapping regions have a much higher disorder than expected, given their short length, compared with other similarly short regions.

Considering cleaved polypeptides produced similar results
We also investigated the effect of polyproteins that are cleaved to produce a number of polypeptides. ). When we incorporated these to the prediction of percent disorder for the whole genome, we found that both methods produced almost similar values of percent disorder (ρ = 0.87, p < 2.2 x 10 -16 ; Figure   S3).
We repeated the complete analyses by substituting percent disorder with the new percent disorder obtained from cleaved peptides and found similar results. Thus, while for individual proteins, investigators need to consider carefully whether they are interested in the disorder in the precursor polyprotein, or as is more usually of interest, in the mature post-cleavage products, it is important to ensure that the predictions are carried out on the appropriate state, since disorder is greater at protein termini. However, for this overall survey it is unlikely that the true cleavage states of all protein products are known. Our overall analysis will tend towards under-predicting the disorder of proteins found within poloyproteins, since in general incomplete viral genome analysis under-predicts cleavage. However, the overall analysis appears insensitive to the minor biases introduced by this under-prediction.

Effect of alternative approaches to estimating disorder
We noted that short and long disorder predictions from IUPRED gave highly correlated results (Supplementary Figure S13). There was a reasonable correlation of predictions using an alternative method, E-spritz [10], which is a machine learning predictor trained on known datasets, in contrast to the biophysical prediction of IUPRED ( Supplementary Fig. S13) While we focused on predictions for individual residues, it must be noted that disorder predictions are strongly correlated with the predictions for adjacent residues, so that much of these residues fall into regions of more extended disorder. Supplementary Table S4 indicates that estimating a viral proteome's disorder propensity based on regions rather than on residues generated results that were broadly correlated. We opted to focus on residue predictions, to avoid the increase in sampling variance created by surveying the presence or absence of larger regions across very short proteins and smaller proteomes.

Which proteins display strong disorder?
We selected the most disordered proteins from each family in order to sample and illustrate typical potential roles and functional relevance. We looked among viruses longer than 10kb (all of which are dsDNA and ssRNA) at all viral proteins with more than 200 residues and with high disorder (D > 50%). Table 4a shows the most disordered protein identified in each family. These proteins are discussed further below.
C4 from Callitrichine herpesvirus 3 is predicted to be 100% disordered. This viral protein was identified in this marmoset virus, but displayed no sequence homology to previously identified proteins [11]. This is not unusual for disordered proteins, which often evolve much faster than ordered proteins. It does present a particular challenge in elucidating the role of these proteins in viruses, since homology to ordered domains often gives insights into their functional roles. The Porcine adenovirus A protein 22K (DUF2890) has homologues across many adenoviruses, but the role of this disordered protein has not been established. While the herpesvirus protein with 100% predicted disorder is clearly high, the average disorder for Herpesviridae (Table 2) is also relatively high, at 17.9%. The observation that another protein (CPXV136) from the large Cowpox virus is 73% disordered is particularly striking, in the context that the mean disorder for this genome is only 5.6% (Table 2). It will be very interesting to determine whether this protein interacts with host components.
The D protein of human parainfluenza virus 3 is derived from an internal alternative reading frame within protein P [12], so it is possible that the level of disorder may have been affected by the coding constraints of the P gene. The 64.6kDa ascoviral protein (Table 4) shows a strong amino acid compositional bias, being highly basic [13]. This is likely to relate to the particular, and unidentified, function of this protein. Pushker et al supporting information The Lymantria dispar MNPV mucin-like protein (putatively involved in horizontal gene transfer [14]) identified, but such a highly disordered protein may play a role in interaction with its host bacterium.
Interestingly, this protein displays suggestive similarity to a protein from its host (27% identity over 162 residues to the protein BURPS1710b_2246 from Burkholderia pseudomallei), raising the possibility that this disordered protein may have been hijacked from host to phage, or vice versa. As these phages have been investigated as therapeutic treatments for bacterial infection [15,16], it may be of interest to determine the role of this disordered phage protein. Two of the predicted highly disordered proteins from larger viruses encoded collagen-like proteins (Lymphocystis disease virus 1, accession NP_078660; and Acanthamoeba polyphaga mimivirus YP_142550). If these proteins do indeed form collagen like triple helical structures, they are clearly not disordered proteins while forming that structure. The fact that the mimivirus encodes enzymes capable of hydroxylating lysine increases the likelihood that this is indeed the case, and it has been speculated that the collagen-like fibres may form part of the fibre layer surrounding the Mimivirus particle [17]. If this is the case, such proteins are not strictly disordered in their typical state, and therefore this represents an incorrect prediction of the IUPRED method, and we have accordingly omitted them from Table 4. However, it is worth considering that these proteins also perform biological functions in their monomeric disordered states.
While disordered proteins from the larger viruses discussed above may benefit from the availability of greater genomic space in which they can evolve and acquire functions, we found that the smaller viruses showed a similar range of functions for their longer disordered proteins (Table 4b). The overlapping/movement protein from Turnip yellow mosaic virus was particularly long and disordered, given the smaller genome size (L = 628, D = 92%). Movement proteins are encoded by many plantinfecting viruses, and are important because of their ability to allow movement from the initially infected cell to neighboring cells [18]. The E4 protein of Human Papillomavirus has been identified to harbor cyclin-binding motifs, supporting a role for this disordered protein in interacting with the host proteome to support genome amplification [19]. NP1 of Bocavirus is also important in DNA replication [20]. A similar nuclear role is implicated for human Torque Teno Virus ORF2/2, which has been postulated to play a role in binding nucleic acids during either transcription or replication [21,22]. ORF-X of Finch polyomavirus is an apparent accessory protein not present in other polyomaviruses [23], whose role is as yet unclear.
Thus, disordered proteins from both small and large genome viruses can play roles in viral structure and function, including interactions with host proteins.