Markov Models of Amino Acid Substitution to Study Proteins with Intrinsically Disordered Regions

Background Intrinsically disordered proteins (IDPs) or proteins with disordered regions (IDRs) do not have a well-defined tertiary structure, but perform a multitude of functions, often relying on their native disorder to achieve the binding flexibility through changing to alternative conformations. Intrinsic disorder is frequently found in all three kingdoms of life, and may occur in short stretches or span whole proteins. To date most studies contrasting the differences between ordered and disordered proteins focused on simple summary statistics. Here, we propose an evolutionary approach to study IDPs, and contrast patterns specific to ordered protein regions and the corresponding IDRs. Results Two empirical Markov models of amino acid substitutions were estimated, based on a large set of multiple sequence alignments with experimentally verified annotations of disordered regions from the DisProt database of IDPs. We applied new methods to detect differences in Markovian evolution and evolutionary rates between IDRs and the corresponding ordered protein regions. Further, we investigated the distribution of IDPs among functional categories, biochemical pathways and their preponderance to contain tandem repeats. Conclusions We find significant differences in the evolution between ordered and disordered regions of proteins. Most importantly we find that disorder promoting amino acids are more conserved in IDRs, indicating that in some cases not only amino acid composition but the specific sequence is important for function. This conjecture is also reinforced by the observation that for of our data set IDRs evolve more slowly than the ordered parts of the proteins, while we still support the common view that IDRs in general evolve more quickly. The improvement in model fit indicates a possible improvement for various types of analyses e.g. de novo disorder prediction using a phylogenetic Hidden Markov Model based on our matrices showed a performance similar to other disorder predictors.


Introduction
Contrary to the traditional sequence-structure-function paradigm, the function of a protein is not determined solely by its stable 3D structure. Today it is known that naturally unfolded or so called intrinsically disordered proteins (IDPs) fulfill a multitude of functions, such as signaling and regulation. While some proteins are completely unstructured, others may contain only short disordered regions. Current estimates suggest that more than 30% of eukaryotic proteins contain long intrinsically disordered regions (IDRs), but IDRs are also frequently found in prokaryotes [1]. According to estimates from Ward et al. [2], on average 10% of proteins are fully unstructured, while half of all proteins contain at least one long IDR. IDPs (or proteins with IDRs) often depend on structure instability for their function [3][4][5]. The absence of a stable 3D or secondary structure makes IDPs more flexible when binding and forming protein complexes, providing important advantages over ordered proteins [6]. Compared to ordered proteins, IDPs often participate in molecular recognition, signaling processes, cell-cycle regulation and modulating gene expression or chaperone activity [7]. Due to their flexibility, IDPs are more resistant to perturbations in the molecular interactions environment and tend to act as hubs in molecular interaction networks [8]. Proteins with IDRs are increasingly associated with diseases such as cancer and neurodegeneration [7]. For example, the CREB transcription factor is crucial in neuronal plasticity and long-term memory formation in the brain; malfunctions of CREB may contribute to the development of Huntington's disease and some types of cancers. Other famous examples include prion protein and tumor suppressor proteins p53 and BRCA1 [7,9].
To date most studies contrasting the differences between ordered and disordered proteins focused on simple summary statistics, such as sequence complexity and amino acid composition [10,11]. For example, regions of low sequence complexity are likely to be disordered [10]. IDRs usually have few large hydrophobic residues but favor polar and charged amino acids. Such sequence composition properties are often used by computational methods of disorder prediction (see [12]).
Brown et al. [11] estimated separate Markov amino acid substitution models for ordered and (wholly) intrinsically disordered proteins at three levels of sequence similarity. These models were used to compare amino acid frequencies and average rates of evolution. The authors concluded disordered proteins having a generally higher rate of evolution than ordered. Midic et al. [13] published a scoring matrix for the alignment of protein sequences with disordered regions. This study also confirms a higher rate of evolution in IDRs and shows differences in amino acid substitution patterns between ordered and disordered parts of proteins.
Here, we take an evolutionary approach and study multiple sequence alignments of homologous proteins with IDRs using Markov amino acid substitution models in the maximum likelihood (ML) framework. Based on a large set of homologous groups with experimentally annotated IDRs, we estimate two empirical amino acid substitution models, each describing the evolution either in ordered or disordered regions. An expectationmaximization (EM) algorithm is used to obtain ML estimates of model parameters [14]. A new method is suggested to evaluate whether components of two inferred substitution models are significantly different. This test shows that our models are indeed significantly different and capture the essential features of ordered and disordered regions. As using the new model with a priori known IDRs significantly improves the fit to data, the new model may be recommended for other downstream evolutionary analyses. For example, using the new two component orderdisorder model we define a phylogenetic Hidden Markov Model (phylo-HMM) and apply it as a de novo predictor of intrinsic disorder in multiple sequence alignments of homologous proteins. Our predictor demonstrates the potential of achieving a competitive accuracy-power balance compared to other disorder prediction methods.
Further, the estimated empirical models were used to contrast patterns and rates of the evolution in disordered and the corresponding ordered protein regions. It is typically thought that disordered regions are in general faster evolving than structured regions of the same protein. While we confirm that IDRs tend to have a higher rate of evolution compared to the rest of the protein, we find a significant number of protein groups where the reverse is the case. We present examples of proteins where the evolutionary rates at IDRs are significantly slower (or faster) compared to the corresponding ordered regions. Finally, we discuss other properties of IDPs such as their distribution among functional categories and biochemical pathways and their preponderance to contain repeats in tandem (another important property correlating with enhanced protein binding [15]).

Assembling homologous protein groups with IDRs
Previous analyses [16] relied on the computational prediction of intrinsic disorder [13,[17][18][19][20][21]. Here, we decided not to use computational prediction methods. One one hand this drastically reduces the amount of data available for model estimation. On the other hand we avoid introducing unforeseeable biases due to prediction inaccuracies.
Instead our analyses were based on the DisProt database [22] as a starting point of data acquisition. This database comprises about 500 proteins annotated with a total of about 1000 experimentally verified intrinsically disordered regions.
DisProt was scanned for the presence of homologous proteins using the BLASTCLUST program, which finds pairs of sequences with statistically significant matches (using the BLAST algorithm) and groups them based on single-linkage clustering. This program is part of the BLAST suite [23]. We found that the DisProt database contains very few homologs. Consequently, we expanded the set of IDPs through further searches for homologous proteins in SwissProt and Pfam-based PANDIT databases. The similarity threshold was set to be sufficiently stringent to assume structural homology so that disorder annotations could be propagated to all homologous positions. A more detailed description of this procedure is provided below.

PANDIT data set
Each DisProt sequence entry was mapped to a representative homologous group in the PANDIT database [24] based on pairwise local alignment [25] score with BLOSUM62 [26]. The score threshold was set to 100 which corresponds to an E-value of 10 {6 . PANDIT consists of Pfam protein families [27] together with multiple sequence alignments and inferred phylogenetic trees based on protein-coding DNA and amino acid sequences.
When multiple DisProt sequences mapped onto a single homologous group in PANDIT, the group was successively bisected by its longest branch until the mapping became injective, so that there was only one disorder annotation per group. Groups with no mapping or with v3 taxa were discarded. The corresponding multiple sequence alignments were restricted to the homologous sites as determined by the pairwise alignments to the reference sequence from DisProt. To avoid noise in the matrix estimation, distant sequences were filtered out based on the alignment score. The final set contained 223 homologous groups with a total of 1805 sequences with 54233 disordered and 254308 ordered residues.
The PANDIT data comprises a set of reliable alignments but due to its limited size it imposes considerable uncertainty on model estimation. Thus we use this data set mainly for verification of our results and for functional analyses.

SwissProt data set
To improve the reliability of our model estimation, we constructed a second larger data set of homologous protein groups with IDRs based on the SwissProt database [28]. For each DisProt entry an initial homologous group from SwissProt sequences was built from pairwise alignments. Multiple sequence alignments were constructed from pairwise homologies and trimmed to sites present in the reference sequence. Further, the groups were refined by removing distant sequences so that each sequence had a distance v100 PAM to the reference sequence. The resulting data set included 373 homologous protein groups with a total of 15490 sequences with 1043845 disordered and 3986493 ordered residues and was used as the main source for the estimation of our Markov model of evolution. To overcome potential biases due to errors in the group-wise multiple sequence alignments and estimated phylogenies, we compare the separate model estimates for both data sets.

The estimation of Markov amino acid substitution models
The evolution of amino acids was described by a Markov process with the generator matrix Q~(q ij ) defining the instantaneous rates of changes from amino acid i to j. As usual, the substitution process was assumed to be reversible so that p i q ij~pj q ji , where p i are the equilibrium amino-acid frequencies. For a reversible process the instantaneous rates of change from i to j can be expressed as q ij~sij p j , a product of equilibrium (or stationary) amino acid frequency p j and the exchangeability s ij between residues i and j. We further refer to the matrix S~(s ij ) as the amino acid exchangeability matrix. For a multiple sequence alignment the substitution process flows along a phylogeny relating the sequences in a sample. The transition probability matrix over time t is computed as P(t)~e Qt . On this basis a likelihood function can be constructed for each site for a given tree. The total likelihood of the alignment is calculated as a product of site likelihoods based on the site-independence assumption (for computational reasons).
We estimated separate amino acid substitution models for ordered and disordered regions, each described with instantaneous substitution matrices D and O, respectively. Overall, the mixed DO model describes the evolution of a priori annotated IDRs using matrix D, while structured regions are described using matrix O.
Model parameters were estimated using an EM algorithm [14] on our two assembled training sets. The EM approach finds the ML estimates of substitution model parameters, with the substitution histories and counts being unobserved latent variables. The EM iteratively estimates parameters and latent variables in an alternating manner until convergence. Each model (both for ordered and disordered regions) required estimating 190 exchangeability and 19 amino acid frequency parameters.
For the training set based on PANDIT groups we used phylogenies provided by the PANDIT database. For the SwissProt data set phylogenies were built using PhyML3.0 [29] with LGzC 4 zI [30,31], thereby estimating evolutionary rates per site.
We followed the procedure described by Le et al. [30] and separated the alignment columns by their most likely rate class, as estimated by PhyML, to normalize for among-site heterogeneity of evolutionary rates.

Evaluating the significance of differences between models estimated for ordered and disordered regions
The significance of differences in estimated amino acid frequencies was evaluated using two likelihood-ratio tests on the estimated amino acid counts computed by XRate [14]: Pearson's x 2 -test and the G-test. Both tests compare the null hypothesis that the two count vectors arose from a common distribution against the alternative hypothesis where each vector originates from a distinct distribution. Similarly, exchangeability rates were compared using the estimated substitution counts.
In addition to Pearson's x 2 and the G tests, confidence intervals for model estimates were computed by a bootstrapping technique. For each homologous group replicate data sets were generated by bootstrap on alignment columns and by jackknife on rows. For each replicate, substitution models for ordered and disordered regions were re-estimated using the EM-based procedure identical to that applied to the original data. The resulting distributions of model estimates were used to estimate empirical variances for exchangeabilities and amino acid frequencies ( Figure 1).
In particular, we investigated whether the IDRs may be characterized only by the bias in amino acid composition, or if a bias in exchangeability between different classes of amino acids (order and disorder promoting) may also be observed. To achieve this we computed the substitution rates between order and disorder promoting residues for ordered and disordered regions separately: where S O~f C,F ,I,L,V ,W ,Y g and S D~f A,E,G,K,P,Q,R,Sg are the sets of order promoting and disorder promoting amino acids, respectively. In order to compare these terms between ordered and disordered regions we normalized the terms by frequencies of occurrences of amino acids in sets S O and S D : which rendered these terms independent of the target and source amino acid frequencies. To detect bias in exchangeabilities with regard to order and disorder promoting residues, we compared the ratios s S D =s S O between ordered and disordered regions.

Comparison of evolutionary rates in ordered and disordered regions
To compare average rates of evolution we computed the groupwise total tree lengths (sum of branch lengths) for the SwissProt data set from pairwise distances and least-squares distance trees estimated with Darwin [32], because for a given set of taxa tree lengths are expected to be proportional to the average rates of evolution. We will refer to the estimated evolutionary rates as r D for IDRs and r O for ordered regions.
The ordered and disordered portions of multiple alignments were bootstrapped separately, and the significance was computed with the Mann-Whitney-U-Test.

Prediction of IDRs using phylogenetic Hidden Markov Models (phylo-HMMs)
The estimated empirical models for ordered and disordered regions may be used to define a phylo-HMM for predicting IDRs. We applied XRate [14] in annotation mode to obtain a prediction of order/disorder for each alignment column in the testing set compiled from PANDIT. This was done using the model estimates for order and disorder trained on either the PANDIT or the SwissProt sets. The HMM consists of 4 hidden states: start, end, and states for emitting ordered and disordered alignment columns. The emission probabilities are defined by the estimated evolutionary model and the transition probabilities were trained simultaneously from data. To correct for the differences in group size we divided the error statistics by the corresponding number of sequences in the homologous group.
The quality of prediction of intrinsic disorder for our phylo-HMM was compared with the quality of two sequence-based disorder predictors: VSL2 [20] and iupred [21]. VSL2 was used with two different parameter sets. One version of VSL2 uses auxiliary information from PSI-Blast PSSM and PSI-Pred secondary structure prediction, while the ''fast'' version is executed without this additional data. Iupred was used with its ''long'' and ''short'' presets.

Results
The new DO model requires twice as many parameters to be estimated from data compared to a standard empirical amino acid model that does not distinguish between order and disorder. Despite this, the model significantly improved the model fit to data with a priori annotated IDRs. For example, for the SwissProt data set the AIC decreased by 1916 (with an increase in log-likelihood of Dl~1167). Consequently, we used the DO model to analyze differences between ordered and disordered regions in terms of amino acid composition and exchangeabilities, evolutionary rates, and content of tandem repeats. We also tested whether the two components of DO may be used for disorder prediction from multiple alignments of homologous protein sequences.

Comparison of model estimates
We compared the ML estimates for the disorder and order components of the DO model in the PANDIT and SwissProt data sets. The majority of the ML estimates were similar between the two data sets. Only for a few amino acids the estimated equilibrium frequencies differed significantly between the data sets, based on variances estimated with bootstrap and jackknife resampling (Supplementary Figure S1). This may reflect a heterogeneity of gene and lineage composition. On the other hand, no significant differences were observed between exchangeability estimates in the two training sets. Such stability of our estimates is reassuring. The uncertainty in model estimates for the SwissProt training set was lower compared to that for the PANDIT set. Moreover, we observed a lower variance in the estimates for the ordered regions compared to the disordered model. This may be explained by the amount of data available for each estimation, since the SwissProt set is larger than the PANDIT set, and since we have more residues in ordered regions compared with IDRs. Consistent with this explanation, we observe high variance in exchangeabilities between rare amino acids.
Next, we contrasted model estimates for ordered and disordered regions in the SwissProt data set. The estimates of model parameters for IDRs are shown in figure 2 and can be downloaded as supplementary datasets S3 and S4 in a format compatible with PAML [33]. The amino acid equilibrium frequencies and exchangeabilities are displayed and compared separately. The components of O and D matrices for ordered and disordered models were found to be significantly different based on Pearson's x 2 -test and the G-test (pv0:01 for both tests) applied to estimated substitution counts.
Based on the estimates of amino acid frequencies for ordered and disordered regions (Figure 2), we observed that orderpromoting amino acids I, L, V (large and hydrophobic) and W, Y, F (aromatic) appeared in IDRs at a lower frequency. In addition, IDRs contained a low frequency of the non-polar amino acid C. On the other hand, IDRs contained high frequencies of disorder-promoting amino acids: positively charged R and K, polar E and Q, and small A, G, S and P. Our estimates of amino acid frequencies were largely in agreement with other empirical observations [10,11]. Our observations held for both data sets with only minor differences.
We clearly observed that IDRs are enriched with disorderpromoting amino acids while ordered regions are enriched with order-promoting amino acids. Further, significant differences in the amino acid exchangeability patterns between the models inferred for ordered and disordered regions were found (Figure 1).
In IDRs we observed relatively fewer substitutions between disorder promoting residues compared to ordered regions (s SD =s SO~0 :41 in IDRs and 0:66 in ordered regions). In addition, in IDRs the exchangeability rates are higher between order-promoting residues, whereas in the ordered regions the exchangeability rates tend to be higher between disorderpromoting residues and between residues from the two classes (order or disorder promoting). Thus, it may be concluded that IDRs are characterized not only by the compositional bias but also by exchangeability biases between the classes of order and disorder-promoting residues.

Performance of HMMs for de novo disorder prediction
Using a test data set compiled from PANDIT, we compared the performance of our phylo-HMM based disorder predictor with two well established sequence-based predictors. Table 1 summarizes the numbers of correctly and incorrectly annotated sites with different methods tested. In our tests, VSL2 exposed the best performance in marking sites as disordered, while Iupred was too conservative, annotating too many sites as ordered. According to precision and recall values (Table 1), our phylo-HMMs outperformed the simple sequence based Hidden Markov Models based only on amino acid frequencies and yield results comparable to iupred and VSL2. It should be noted that VSL2 and iupred preformed similar or even better on the test set compared to predictions on DISPROT (results not shown).

Comparison of evolutionary rates in ordered and disordered protein regions
It is typically thought that IDRs evolve at a higher rate compared to proteins with stable 3D structure [11,34]. Here, we tested this consensus view using the rate estimates from our inferred models.
For about 60% of the PANDIT data set the evolutionary rates were significantly higher in IDRs compared to rates in ordered regions ( Table 2). For example, the tumor suppressor protein p53 (PF00870) was found to evolve significantly faster in its IDR (r D =r O~4 :2). Another example from this class is discussed below in more detail. However, for 25% of our homologous groups the estimated rate of evolution in IDRs was significantly lower than in respective ordered regions (pv0:05). The full results are available in supplementary dataset S1 (r D wr O ) and S2 (r D vr O ). This may indicate that the contribution of IDRs to the overall function of the protein may vary significantly, and which is confounded with a multitude of other factors including the properties of the primary sequence.
Further we explored the distribution of functional categories among groups with either significantly higher or lower evolutionary rates between ordered and disordered residues. For this task we used the PANDIT dataset since the information on functional categories (GO [35] terms) and biochemical pathways (KEGG [36]) was already available from PanditPlus [37]. For each protein group we parsed GO terms from the highest hierarchical level down to collect all relevant ancestral terms. The class with higher rate in IDRs (r D wr O ) was enriched with proteins from the functional categories 'nucleotide binding' (p~0:0162, p-values before multiple testing correction), and especially 'adenyl nucleotide binding' (p~0:0423) and 'ATP binding' (p~0:0569).
In the other class with r D vr O the cellular component 'membrane part' (p~0:0241) and the biological process 'regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic process' (p~0:0671) were overrepresented. Due to the small number of homologous groups we had, these were the only terms close to a significant level. However, functional categories such as binding and regulation  were reported to be enriched in other studies, which used disorder predictors for such analyses [38,39]. For KEGG pathways we found that proteins with less conserved IDRs (r D wr O ) tend to be involved in 'proteasome' (p~0:0370),'apapoptosis' (p~0:0569) and 'colorectal cancer' (p~0:0906) pathways. Proteins with conserved IDRs (r D vr O ) were found to be highly significantly overrepresented in 'tyrosine metabolism' (p~0:00765) as well as 'pyruvate metabolism' (p~0:0242), 'valine, leucine and isoleucine degradation (p~0:0188), 'urea cycle and metabolism of amino groups' (p~0:0188), '1-and 2-methylnaphthalene degradation' (p~0:0290), 'fatty acid biosynthesis' (p~0:0290), and '3-chloroacrylic acid degradation' (p~0:0241) pathways. Interestingly, according to KEGG most of these pathways fall into the same larger category or/and are related.
Note that the estimates of tree lengths were robust to the model choice (estimates for new model DO and LG differed by only +5%), and thus had little influence on our conclusions regarding the comparisons of the evolutionary rates r D and r O (Table 3).

Relationship between intrinsic disorder and protein repeats in tandem
It has been suggested that about 50% of the protein regions with tandem repeats may be intrinsically disordered [40][41][42], implying a higher incidence of IDRs in proteins with tandem repeats compared to their average frequency among all proteins. Here, we examined whether the reverse observation may be made, i.e. proteins with IDRs are more likely to contain tandem repeats. To assess whether IDPs are enriched with tandem repeats, we examined the frequency of tandem repeats in our homologous groups. For each group in our SwissProt data set and for each DisProt sequence, protein repeats were detected using a recent algorithm based on a k-means clustering approach [43]. We found that 69% (362=522) of the sequences in DisProt contained predicted repeats and that 76% (285=373) of our SwissProt groups contained at least one sequence with tandem repeats. This is significantly higher than what is typically observed among all proteins (reportedly 9% in SwissProt and 14% in a GenBank-based protein census [44]).
This analysis demonstrated that tandem repeats tend to occur more frequently in intrinsically disordered regions (pv0:05; Table 4). Furthermore, in our data proteins with tandem repeats tended to have higher rates of evolution in IDRs (r D wr O ) more frequently compared to proteins without tandem repeats ( Table 4).

Examples of proteins with IDRs
r D wr O in mouse SOCS3. Significantly higher rate of evolution in the IDR compared to the ordered portion of the protein was found in the homologous protein group constructed for the DISPROT sequence DP00446. This protein is a suppressor of cytokine signaling (SOCS3) in mouse. The IDR between the SH2 domain and the C-terminal SOCS box (Figure 3) is believed to be a PEST-like sequence and is not required for primary function (phosphothyrosine binding) [45]. Instead it is likely to have an enhancing effect in protein degradation.
We conducted a more thorough analysis of this protein and assembled a superset of this homologous group from the OMA project [46]. By doing so we obtained a group of 27 sequences and a total alignment length of 225 amino acids with 36 disordered columns. Babon et al. [45] note that the IDR of this protein is highly conserved in mammals. Despite this, our analysis confirmed a significantly higher substitution rate in the IDR compared to the rest of the protein. For this data set the estimated average tree length measured in expected substitutions per site was 1:2 in ordered regions but 8:4 for the IDR, with highly significant Mann-Whitney-U-Test. Further, protein-coding DNA sequences were analyzed using codon models with variable selection pressure over sites (models M0, M1, M2, M3, M7 and M7 in PAML [33]). No positive selection was detected on this protein, but the purifying selection pressure was less stringent in the IDR compared to the ordered part of the protein -the trend consistent with our observation of r D wr O .
r D vr O in rat GNMP. The Glycine N-methyltransferase (GNMP) is an example of a protein where the rate of evolution is significantly lower in the IDR compared with the ordered regions of the protein -contrary to the predominant view. This protein creates a tetrameric complex shaping a molecular basket. The 40 unstructured N-terminal residues of each subunit regulate access to the active site by filling the core of this basket (Figure 4). In presence of AdoHcy these IDRs unclog the core and give access to the active site [47].
GNMP is involved in the following GO biological processes: 'adenosylhomocysteine metabolic process', 'S-adenosylmethionine metabolic process', 'folic acid metabolic process', 'protein homotetramerization'. Further it is annotated with the molecular functions 'folic acid binding', 'glycine N-methyltransferase activity', and 'glycine binding'. The protein is part of the KEGG pathway 'Glycine, serine and threonine metabolism'.
Similar to SOCS3, we expanded the original homologous group containing GNMP (around the DisProt sequence DP00031) with additional sequences from OMA. Thus this group was extended  from 5 (originally) to 43 taxa and a total alignment length of 292 amino acids with 40 disordered columns. The analysis of the extended dataset resulted in the estimated average tree length of 4:7 for the ordered regions versus 2:0 for the IDR, with highly significant Mann-Whitney-U-Test. The analyses with codon models (as for SOCS3) reported higher purifying selective pressure in the IDR. This again confirms our previous result suggesting that the IDR in GNMP protein is more conserved although it does not contain the active site.

Discussion
Here we estimated an empirical Markov amino acid substitution model for IDRs and ordered regions of proteins, which provided a significant improvement in model fit to data (as measured by AIC).
Based on the a priori annotated alignments, the mixed DO model succeeded at detecting several significant distinctions between evolutionary patterns in IDRs and the corresponding structured parts of the protein. First, the stationary amino acid distribution was found to be significantly skewed towards disorder promoting amino acids, which confirmed previous empirical observations [10,11]. Moreover, the exchangeability rates in IDPs were also biased, with significantly higher rates between order promoting residues. At the same time, the exchangeability rates for other types of changes were lower compared to what was observed in ordered regions. Probably, in IDRs disorder promoting amino acids are under higher functional constraints than order promoting residues. As a result, the DO model may better reflect the biological reality for IDPs and therefore may improve the accuracy of inferences for various types of analyses, such as maximum likelihood phylogeny inference with mixture models [48], ancestral reconstruction, and sequence alignment. As an example, we used our model to construct a phylo-HMM to predict intrinsic disorder from a multiple sequence alignment of IDPs based on the difference in evolutionary patterns. The phylo-HMM based on the estimated models was shown to be competitive compared with other sequence-based predictors. Combining this approach with the use of summary statistics, such as energy calculations or the inclusion into a meta-predictor may improve   The first 2 columns contain numbers of ordered or disordered characters in DisProt which are predicted to be inside or outside of tandem repeats. Tandem repeats are significantly more frequent in disordered regions (pv0:001). The last 3 columns represent a comparison of evolutionary rates between ordered and disordered columns in the SwissProt data set restricted to groups with or without tandem repeats, respectively. Each cell contains the number of homologous groups which pass a test of significance at pv0:05 or the number of those with indistinguishable rates. doi:10.1371/journal.pone.0020488.t004 the prediction even further. One limitation of our study was the relatively small sample of IDPs that are currently known to contain structural disorder based on experimental work. Recently however, the natively unfolded proteins have been under spotlight [49]. The increased attention to IDPs is likely to increase the amount of structural information available for this kind of proteins. Larger sets of homologous alignments of IDPs may be used in the future to re-estimate the two empirical components of our DO model. Such new estimates should be more accurate and have smaller variances. However, the composition of IDRs may depend on their relevance to the function of the protein and the specifics of performed function. Given sufficient data, different DO models may be estimated for different classes of IDPs, where IDRs play different functional roles. At the moment this is not foreseeable due to lack of both structural and functional data. Our analyses suggest that for the majority (59%) of IDPs the unstructured regions are indeed less conserved than the rest of the protein, as is typically thought. However, this is not a general rule and many exceptions exist. For 15% of IDPs the rates of evolution in IDRs and ordered regions were not significantly different. Moreover, a large proportion of IDPs in our set (27%) had higher conservation in their disordered parts, contradicting the common view (one such example, the GNMP protein, was presented above). Our functional enrichment analyses of this protein class showed that IDPs with r D vr O tend to be involved in pathways responsible for amino acid and carbohydrate metabolism or related. In particular, the amino acid metabolism function involved the metabolism of order promoting residues (but not disorder promoting) -namely Tyrosine, Valine, Leucine and Isoleucine. We hypothesize that this may be related to our observation of higher exchangeability rates between orderpromoting residues in IDRs compared to ordered regions. Overall, IDPs with slower evolving IDRs (compared to their structured parts) seem to exhibit a preferential involvement in certain biochemical pathways. Indeed, proteins whose IDRs are directly involved in function, or are crucially important for function, may be expected to evolve more slowly due to additional functional constraints. In addition, IDRs abundantly found in alternatively spliced regions [50,51] may evolve slower with respect to other regions due to additional constraints for functional proteins in different alternative frames.
A recent study [42] found that tandem protein repeats are enriched with IDRs. Here, we found that the reverse statement also may be made, i.e., proteins with IDRs are enriched in tandem repeats. So the presence of tandem repeats in a protein should have strong correlation with the presence of IDRs. This supports the theory that at least some of IDRs originate via repeat expansion [40]. This evolutionary mechanism provides a means of interactome scaling, where certain nodes in the interaction network increase their fitness by incorporating intrinsic disorder and repeats [9]. Sandhu [52] is also supportive of this view in his study of chromatin remodeling proteins that frequently contain IDRs. The IDRs resulting from repeat expansion may enable reversible binding to different interacting partners, which overall contributes to functional diversity and specialization of chromatin remodeling complexes. Moreover, Jorda et al. [42] found that the level of repeat perfection correlates with the amount of intrinsic disorder. If the repeat perfection is representative of recent evolutionary origin (rather than due to functional importance), then this finding is in a perfect agreement with the hypothesis that repeat expansion drives the origin of new IDRs. With time the repeat perfection should be decreased, especially that in our study we found that most IDPs with repeats evolve significantly faster in their IDRs compared to the structured regions. This may be also indicative that IDPs with tandem repeats fall into particular functional classes, a premise that should be studied when more structural and functional data (especially on IDPs) becomes available.