We have constructed the clustered Protein Data Bank and obtained clusters of chains of different identity inside each cluster, http://bioinfo.protres.ru/st_pdb/. We have compiled the largest database of disordered patterns (141) from the clustered PDB where identity between chains inside of a cluster is larger or equal to 75% (version of 28 June 2010) by using simple rules of selection. The results of these analyses would help to further our understanding of the physicochemical and structural determinants of intrinsically disordered regions that serve as molecular recognition elements. We have analyzed the occurrence of the selected patterns in 97 eukaryotic and in 26 bacterial proteomes. The disordered patterns appear more often in eukaryotic than in bacterial proteomes. The matrix of correlation coefficients between numbers of proteins where a disordered pattern from the library of 141 disordered patterns appears at least once in 9 kingdoms of eukaryota and 5 phyla of bacteria have been calculated. As a rule, the correlation coefficients are higher inside of the considered kingdom than between them. The patterns with the frequent occurrence in proteomes have low complexity (PPPPP, GGGGG, EEEED, HHHH, KKKKK, SSTSS, QQQQQP), and the type of patterns vary across different proteomes, http://bioinfo.protres.ru/fp/search_new_pattern.html.
Citation: Lobanov MY, Galzitskaya OV (2011) Disordered Patterns in Clustered Protein Data Bank and in Eukaryotic and Bacterial Proteomes. PLoS ONE 6(11): e27142. https://doi.org/10.1371/journal.pone.0027142
Editor: Niall James Haslam, University College Dublin, Ireland
Received: June 28, 2011; Accepted: October 11, 2011; Published: November 4, 2011
Copyright: © 2011 Lobanov, Galzitskaya. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The work was supported by the Russian Foundation for Basic Research (grant № 11-04-00763), Russian Academy of Sciences (programs “Molecular and Cell Biology” (01200959110) and “Fundamental Sciences to Medicine”), as well as a grant from the Federal Agency for Science and Innovations (#02.740.11.0295) and a grant from the Federal Agency for Science and Innovations (№16.512.11.2204). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Intrinsically disordered regions serve as molecular recognition elements, and play an important role in the control of many cellular processes and signaling pathways –. It is useful to be able to predict positions of disordered regions in protein chains. Prediction methods are aimed at identifying disordered regions through the analysis of amino acid sequences using mainly the physicochemical properties of amino acids – or evolutionary conservation –.
Many examples of proteins with intrinsically disordered regions which exhibit coupling between folding and binding have been described in the literature –, –. Nevertheless, the universality of this phenomenon and functional importance of many disordered regions remain unclear.
A database of continuous protein fragments (Molecular Recognition Features or MORFs) was compiled from the Protein Data Bank which includes short protein chains (with fewer than 70 residues) bound to larger proteins , . It has been argued that MORFs participate in the coupling of binding and folding, a hypothesis that was supported by the analysis of the composition and predicted disorder of MORF segments. As a result of studying the subtle structural differences of the same proteins in bound (Complex) and unbound (Single) states in relation to their intrinsic disorder the database of protein structures (ComSin) has been constructed .
Recently several computational tools for identifying Linear motifs  and minimotifs in protein-protein interactions  have been published. Linear motifs are short segments of multidomain proteins that provide regulatory functions independently of protein tertiary structure  but minimotifs are short functional peptide sequences obtained after analysis of known protein-protein interactions .
Low-complexity regions attract our attention since they are regions of a protein in which a particular amino acid, or a small number of different amino acids, are enriched. Single amino acid repeats (homorepeats) belong to these regions. It turned out that homorepeats play important roles in some biological process  and may play a more important role in human diseases than it was previously recognized.
In the current study we search for sequence patterns consisting of a number of consecutive residues along the polypeptide chain that are nearly always associated with disordered segments. It has been found that two types of patterns appear to be recurrent: a proline-rich pattern and a positively or negatively charged pattern . It should be noted that the old and new versions of our libraries include patterns enriched by proline and charged residues .
The statistical analysis of disordered residues was done considering 34 464 unique protein chains taken from the PDB database. In this database, 4.95% of residues are disordered (i.e. invisible in X-ray structures) . The statistics was obtained separately for the N- and C-termini as well as for the central part of the protein chain. It has been shown that frequencies of occurrence of disordered residues of 20 types at the termini of protein chains differ from the ones in the middle part of the protein chain , .
It is necessary to construct a clustered PDB because this simplifies the filtering process of protein structures under their analysis and searches general structural characteristics among non-identical proteins. It is necessary to construct a clustered PDB which is important for the analysis of actualized data.
In this work we constructed a clustered PDB and used clusters of protein chains where identity between chains inside of the cluster exceeds 75% (version of 28 June 2010). Combining the motif discovery and disorder protein segment identification in the clustered PDB allows us to create the largest library of disordered patterns . At present the library includes 141 disordered patterns. Such an approach is new and promising for further studying and understanding the functional role of the obtained patterns in different proteomes. Taking into consideration the library of disordered patterns will help one improve accuracies of predictions for residues to be structured or unstructured inside the given region. The previous version of the library includes 109 disordered patterns and has restrictions on the minimal length of the patterns. Using more simple rules without restriction on the pattern length and clustered PDB of the same version we constructed the largest library of disordered patterns.
The patterns occur more often as short fragments. Patterns of four-six residues long occur more frequently (105 out of 141) among the disordered patterns of the library. It should be noted that six residue patches affect the folding/aggregation features of proteins, and they are important “words” for the understanding of protein dynamics . Moreover, nucleation sites are constrained by patches of approximately six residues , . There is evidence that the minimum length necessary for a peptide to elicit an allergenic response and molecular mimicry (a patch of a protein eliciting an immune response equivalent to the entire protein) is about six . All these facts suggest the existence of a fragment of biologically meaningful information located along approximately six residues .
With the library of disordered patterns taken into account, it would be easier to improve accuracy of prediction of ordered/disordered residues inside the given region.
Proteome-wide calculations are a great way to place our work in a larger, evolutionary frame. In this paper of interest is the occurrence of 141 disordered patterns in 97 eukaryotic proteomes, since eukaryotic proteomes include more disordered regions than other proteomes , , , and for comparison, in 26 bacterial proteomes. A comparative analysis of the number of proteins containing the 141 disordered selected patterns in these proteomes has been performed. The disordered patterns with the most frequent occurrence in eukaryotic and bacterial proteomes have low complexity.
It should be noted that each proteome has a specific set of disordered patterns, and this results in different correlation coefficients between numbers of proteins where a disordered pattern appears at least one time. We came to some important observations of a higher correlation coefficient within a kingdom or a phylum than across kingdoms or phyla after analysis of occurrence of disordered patterns in 123 proteomes. The disordered patterns appear more often in eukaryotic than in bacterial proteomes. One can suggest that such short similar motifs are responsible for common functions for nonhomologous, unrelated proteins from different organisms.
Materials and Methods
Construction of clustered PDB
We have considered all protein structures determined by X-ray analysis with a resolution better than 3 Å, and the size of protein is larger than or equal to 40 amino acid residues, published in the PDB (version of June 28, 2010); the structures contain 116 997 protein chains (51 048 PDB entries). At the first step these 116 997 chains can be divided into 34 464 classes. We call these classes as clusters with 100% identity. This means that the chains from the same cluster have the same amino acid sequences, the sequences of chains from different classes are different i.e. they differ at least at one position. In total these 34 464 different sequences contain 9 085 893 residues. At the second step we created clusters of chains with identity inside each cluster ≥75%.
Identity is calculated by using equation:(1)where I is the number of identical residues, L1 and L2 are the numbers of amino acid residues in each considered protein. For calculation of Identity we used BLAST with default parameters .
At the beginning a pair of chains with maximal identity was combined, then another pair of chains or a chain with the cluster again with maximal identity, etc. If the combining of a chain with the cluster or combining of clusters occurred, then the average identity of gathering was considered. If identity of at least a pair of chains from different clusters was less than 75%, then the clusters were not combined. The procedure was repeated until there were clusters which could be combined. At the second step of grouping of chains, we obtained 18775 clusters of chains with identity inside each cluster ≥75%. Then the clusters C75 have been combined into clusters with identity Id≥50%, etc. Figure 1 demonstrates the dependence of the number of clusters on identity between chains inside the cluster. Further we consider the identity of 75% because the general grouping has occurred below 90% identity.
Construction of the library of disordered patterns
Among 116 997 chains, approximately 4.5% of their residues are disordered, i.e. are not resolved by X-ray analysis. To reveal such residues, we compared (for each protein chain) records SEQRES and records ATOM in the corresponding PDB-file. Residues which were present in record SEQRES, but their coordinates were absent in record ATOM (namely, the coordinates of the Cα-atom were absent in record ATOM), were considered as unstructured ones. We considered the residues as disordered if there were not coordinates of Cα atoms.
Below we consider only clusters with ≥75% identities between any pair of chains inside each cluster because the general grouping has occurred below 90% identity. Considering this level of identity, we have created the Clustered Disordered Residues Data Base (CDRDB), its elements are 18 775 clusters of protein chains. Figure 2 illustrates two clusters with 100% identity combined in one cluster with 75% identity. One can see that the sequences from two clusters are different in one position 110, serin is changed for cystein, and the weight of the chain from the first cluster is(2)and the weight of the chain from the second cluster is , respectively. Analogously the weight of each chain from any cluster is calculated by using equation:(3)where NC100 is the number of chains in the cluster with 100% identity and MC75 is the number of clusters with 75% identity. It should be noted that the sum of weights inside one cluster with 75% identity will be equal to one. The weight of residue we consider to be the same as the weight of chain so at the level of 75% identity a cluster may include protein chains of different lengths.
The sequences from two clusters are different only in one position (110), serin is changed for cystein. U denotes disordered residues in the chain and dash denotes ordered residues, respectively.
Our goal is to create a database of disordered patterns i.e. amino acid sequences that are likely to be found in disordered parts of protein chains using CDRDB by applying simpler rules for the creation of the library of disordered patterns than in our previous work . Let P be a protein chain and A be a pattern of length L. The database was compiled using a two-stage procedure. At the first stage, we created a list of candidate patterns. To be a candidate in the patterns the considered pattern should be disordered in half cases among the chains from the cluster with 100% identity. Then the desired disordered patterns were selected into the candidate list. 855 775 candidates in the disordered patterns were gathered.
We say that pattern A matches chain P at position s if the following conditions are valid:
- two residues from each end should coincide:
- there could be done substitutions at most L/5 positions r in the middle of pattern in which
This means that for patterns with a length of L≤5 no change is possible, for 5<L≤10 – only 1 change, for 10<L≤15 – 2 changes, etc. The occurrence is terminal if it belongs to the first 40 residues (“N-terminal”) or last 40 residues (“C-terminal”) of the chain. The other occurrences are called internal ones.
If the distance between the edges of the pattern and the chain is less than 40 residues the pattern is considered to match these residues. The pattern length is not limited in this paper. Further we consider the following terminology: Nu is the sum of weights (wchain) of disordered residues matched by the pattern; Nf is the sum of weights (wchain) of ordered residues matched by the pattern; Cu is the number of clusters with identity 75% (C75) in which Nu>Nf; Cf is the number of clusters with identity 75% (C75) in which Nu≤Nf. Protein P has an occurrence of pattern A if A matches P at position s.
There are 16 918 patterns meeting conditions C1, C2, and C3. The longest pattern has the length of 45 amino acid residues (HHHHHHSSGLVPRGSGMKETAAAKFERQHMDSPDLGTDDDDKAMA), and the shortest pattern has 2 residues (HH). In the next step we selected disordered patterns from the candidate list using the following iterative greedy procedure. From 16 918 patterns we chose the pattern with the maximal value D = Nu−Nf. Then for the rest patterns the values of Nu, Nf, Cu, Cf were recalculated not taking into account the residues matched by the first pattern. Again all the rest patterns were checked to meet conditions C1, C2, and C3. Among the rest patterns meeting conditions C1, C2, and C3 the pattern with a maximal D value was chosen. If there were no patterns meeting conditions C1, C2, and C3, then the procedure was stopped. The iterative procedure was stopped when 390 patterns were selected (D>0). Finally, we were interested in the patterns for which D≥10 and D≥25 (the value 25 corresponds to the summation of weights of 5 whole disordered patterns with 5 residues in length in 5 clusters without neighboring regions, or terminal occurrence). The numbers of such patterns are 249 and 141, respectively (see Dataset S1). The lengths of patterns are in the region: 4≤L≤24. Further we will consider only the set of patterns meeting the condition that D≥25.
Significance of disordered occurrences
We have studied the statistical significance of the selected patterns from two points of view. First, we were interested whether the disordered fragments are overrepresented among the occurrences of each pattern, and, second, whether the patterns are overrepresented in the database. The features are described with the proper Z-scores, called Zdisorder and Zoccur, respectively. To estimate the significance of the number of disordered occurrences of pattern P we have implemented the following procedure. First, we determined the fraction of disordered fragments among all fragments with the given length taking into account the weight of the disordered residues in each case:(4)where N is the number of chains in the CDRDB, Li is the length of the considered chain, n is the fragment length, wi is the chain weight, is equal to 1 if the fragment with adjoining regions is disordered more than in half positions, and 0 in the opposite case. For each pattern we know the number of clusters Cu where this pattern in more than half cases is disordered, and also the number of clusters Cf where this pattern is folded in more than half cases (see Dataset S1, columns J and K). We should calculate the probability P (Y) that the number of successes will be larger or equal to Cu at the given number of trials Y = Cu+Cf.
In other words, this is the probability that at the given or larger number of trials:(5)where p is the probability of success of one trial (see above). The significance of disordered occurrences is estimated with the Z-score:(6)
Statistical significance of the observed number of occurrences of pattern X in proteomes
The probability of finding patterns with possible changes is equal to the summation of probabilities over all sequences compatible with the given pattern.(7) is the sequence compatible with the given one (see the rules of coincidence, for example i = 39 for n = 6).(8)where the probability p(X) that pattern X occurs in a sequence and pi is the probabilities of occurrence of amino acids in the considered proteome. We calculated the probability p(X, N) that pattern X with n amino acid residues occurs in a sequence of length N:(9)
The probability distribution on protein sequences is assumed to be binomial.
The statistical significance of pattern X is estimated with the Z-score(10)where S is the number of sequences containing at least one occurrence of homorepeat X. R is the number of proteins in the considered proteome. N is the average length of proteins in the considered proteome.
Statistical significance of the observed number of occurrences of pattern X in two different proteomes
Let ni and nj be the numbers of proteins with the given pattern X in proteomes i and j. Ni and Nj are the whole numbers of proteins in both proteomes, and is the frequency of proteins with the given pattern. is the standard deviation. Li and Lj are the average length of proteins in the considered proteomes i and j. The scoring function is:(11)We consider that the difference is significant if its Z-score exceeds the proper value with absolute meaning 3 and 5. These values correspond to the probabilities 3*10−3 and 6*10−7, respectively.
Database of proteomes
We considered 3279 proteomes from the EBI site (ftp://ftp.ebi.ac.uk/pub/databases/SPproteomes/uniprot/proteomes/). Since the patterns with the frequent occurrence in proteomes have low complexity we did a preliminary analysis. The analysis showed that the number of proteins with at least one occurrence of homorepeats of 6 residues long is less than 500 for proteomes with an overall number of residues below 2500000. Even so, only 22 proteomes out of 3156 have more than 100 proteins with at least one occurrence of 6-residue homorepeats. The data gave grounds for our research involving only proteomes with an overall number of residues exceeding 2500000.
We obtained 123 proteomes taking into account the length of proteomes representing 9 kingdoms of eukaryotes and 5 phyla of bacteria (see Table 1 and Dataset S2). Unfortunately, only three kingdoms of eukaryotes (Metazoa, Viridiplantae, and Fungi) are given at http://www.ncbi.nlm.nih.gov/Taxonomy/. In other cases, the rank of kingdom is missing. In such situations, we chose the highest taxonomic category proceeding from the subkingdom of eukaryotes instead of the kingdom. We chose 97 out of 120 eukaryotic proteomes, and a small number of bacterial proteomes. The smallest eukaryotic proteome belongs to Hemiselmis andersenii, class Cryptophyta. It is evident that 498 proteins with an overall number of 167452 of amino acid residues are not sufficient for reliable statistics. Historically, the superkingdom of bacteria is divided into phyla but not kingdoms. We preferred to consider such phyla separately.
Among 97 eukaryotic proteomes, 17 belong to the kingdom of Metazoa or animals: Homo sapiens (51778 protein sequences), Bos Taurus (18405), Mus musculus (42120), Rattus norvegicus (28166), Gallus gallus (12954), Danio rerio (21576), and Tetraodon nigroviridis (27836) belong to Chordata phylum, Drosophila melanogaster (15101), Drosophila pseudoobscura (16000), Aedes aegypti (16042), Anopheles darlingi (11437), and Anopheles gambiae (12455) to arthropods, and Caenorhabditis briggsae (18531), Caenorhabditis elegans (23817), Loa loa (16271), and Trichinella spiralis (16040) belong to nematodes, Nematostella vectensis (24435) belongs to cnidaria phylum.
Results and Discussion
Library of disordered patterns
Following the procedure described in the Materials and Methods section, we constructed the clustered PDB (CDRDB) at the level identity of 75% (http://bioinfo.protres.ru/st_pdb/) and obtained a library of disordered patterns. The dataset includes 141 patterns (see Dataset S1). Figure 3 demonstrates the distribution of the patterns according to their lengths. The patterns occur more frequently as short fragments (105 out of 141 are patterns of 4–6 residues long). The largest pattern with condition D≥25 consists of 17 amino acid residues (HHHHHHSSGLEVLFQGP). It is interesting that the strong pattern is HHHH, but not HHHHHH as in the last version of the library . We suggest that the residues matched by these patterns will be disordered in new protein chains because more than half of residues in these patterns are disordered (see conditions C2 and C3 in the Materials and Methods section).
The statistical significance of disordered occurrences in the selected patterns was estimated with the Z-score (see Materials and Methods). We calculated the probability that the number of successes will be larger or equal to Cu at the given number of Cu+Cf (for each pattern we know the number of clusters Cu where this pattern in more than half cases is disordered, and also the number of clusters Cf where this pattern is ordered in more than half cases). This probability for all 141 disordered patterns is less than 7•10−5.
All 141 patterns have Zdisorder>6.4 that corresponds to the P-value of 7•10−5, which is in good agreement with the procedure of the disordered patterns determination. The worst variant is Cu = 5, Cf = 4, and the length of patterns is 6. We have four such cases: SVAESS, ASIGQA, PPSGSP, and DSDVSL (see Dataset S1, columns O and P).
Comparison of the new and the previous libraries of disordered patterns
After construction of the new library the question about similarity of two databases (previous and new) arises. For this purpose the previous patterns matched the clustered pdb (CDRDB) and the sum of weights was calculated analogously to the new patterns. Then we calculated the sum of weights for residues matching both the previous and the new patterns (intersections, I12). The number of clusters with identity of 75% in which there were new and previous patterns was calculated, as well as the number of intersections. The degree of coincidence was calculated using equations (13) and (14):(13)(14)where I12 is the sum of weights for intersections (coincidences), and N is the weight of a single pattern. We considered only pairs where F2>0.1, F2(C75)>0.1, I12≥3, and the number of clusters where two patterns appear together, C12≥3 (see Dataset S3). The measure F1 points to the coincidence between two considered patterns. At the same time the measure F2 demonstrates a level of inclusion of the pattern with smaller N into a larger one. Large difference between N1 and N2 results in a wide difference between F1 and F2.
For example, the sequence GSSHHHHHHSSGLVPRGSHM occurs in 393 clusters on the N-termini, where it is disordered more than half in 387 clusters. This sequence is matched by pattern GSHM, and its beginning is matched by the HHHH pattern. If we have a test database with one protein where there is such a sequence at the N-end, then NGSHM = 20. N is the weight of a pure pattern with the neighboring part, in this case this is the length of the whole N-terminal fragment, NHHHH = 9, I12 = 9, F1 = 9/20 = 0.45, F2 = 9/9 = 1. In a real situation in the whole CDRDB NHHHH = 29 560.4, NGSHM = 8 452.0, I12 = 3 163.1, F1 = 0.09, F2 = 0.37. It should be noted that real F2 is less than test F2. This occurs because GSHM appears usually in sequence GSSHHHHHHSSGLVPRGSHM or in similar sequences. Yet sometimes GSHM appears alone.
The result of intersections of the two libraries (the previous library includes 109 patterns and the new one includes 390 patterns if D>0) is presented in Fig. 4. One can see that there are 16 precise coinciding patterns: ENLYFQ, ASMTGGQQMGR, GSSHHH, WSHPQFEK, EGGSHHHHH, RRGKKK, PTTENLYFQGAM, PTTENLYFQGAM, SHHHHHHSQDP, HHHHHMA, SMTGGQQMGRGS, KKGEKK, SRSHHHH, ENLYFGGS, GGRHHH, HHHGSM, GSHMSQ, and 8 with not precise coincidence, for example HHHHHH and HHHH (Dataset S3).
The measure F1 points to the coincidence of protein regions covered by the considered patterns.
It is interesting that some patterns appear in a protein together with other patterns (57 out of 141). Such pairs can be seen in Dataset S4. Also we calculated the number of patterns which appear in proteins together with the considered pattern (see Fig. 5). Pattern HHHH occurs more often with other patterns in proteins. It should be noted that there are several patterns which appear alone in the CDRDB (see Fig. 5, Dataset S4). We used the same criteria as for the intersections of the two libraries.
Occurrence of disordered patterns in 97 eukaryotic and 26 bacterial proteomes
After creating the library of disordered patterns taken from the CDRDB, another interesting question arises: how often the obtained patterns could occur in some proteomes. Since eukaryotic proteomes include more disordered regions than other proteomes , ,  we compared 97 eukaryotic proteomes and 26 bacterial ones (see Table 1, Dataset S1, and Materials and Methods).
We considered two cases for coincidence. In the first case we calculated the number of proteins where the patterns match with precise coincidence a polypeptide chain fragment. In the second case we analyzed the coincidence according to the definition suggested here and in the paper . According to the rule mentioned in the Materials and Methods section for patterns with a length of L≤5 no change may occur, for 5<L≤10 – only 1 change may take place, for 10<L≤15 – 2 changes, etc.
Among 141 disordered patterns 17 occur (with precise coincidence) only in the PDB but are very sparse in 123 proteomes (see Dataset S5). Such patterns as RASQPELAPEDPED, SMTGGQQMGRGS, SHHHHHHSQDP, PTTENLYFQGAM, HHHHHHSSGLEVLFQGP, EQKLISEEDLN, and ASMTGGQQMGR do not appear in the analyzed proteomes even in two cases (precise coincidence and exact coincidence of two terminal residues and no coincidence in L/5 positions) (see Figure 6). This suggests that such patterns are an artificial addition to proteins from the CDRDB for their better purifications.
(A) H. sapiens, Chordata phylum; (B) D. Melanogaster, Arthropoda phylum; (C) C. elegans, Nematoda phylum; (D) N. vectensis, Cnidaria phylum. The blue color corresponds to precise coincidence of the considered patterns with the fragment of polypeptide chains, the aqua color corresponds to exact coincidence of two terminal residues from both termini and incomplete coincidence in the L/5 positions.
From Figure 6 it is evident that the homorepeats occur very often in eukaryotic proteomes. The patterns with the most frequent occurrence in the eukaryotic proteomes have low complexity: PPPPP, GGGGG, EEEED, HHHH, KKKKK, SSTSS, and QQQQQP. From Tables 2 and 3 it is evident that the disordered patterns with the most frequent occurrence in the eukaryotic and even in bacterial proteomes are patterns with low complexity GGGGG, PPPPP, TTTPTT, GGGGSGG, KKKKK, etc.
According to work  we suggest that these patterns will be disordered in most cases. It should be noted that low-complexity regions can additionally include ordered structural proteins or proteins with strong structural propensity, like collagens, coiled-coils or fibrous proteins . Recently, it has been demonstrated that an increased number of perfect tandem repeats correlates with their stronger tendency to be unstructured . Moreover, strong association between homorepeats and unstructured regions was shown elsewhere . Such patterns as GGGGSGG, EEEEVEE, EDEREE, APIPAP, and PSRSPS (see Table 2) often occur in the considered 17 animal proteomes.
It should be noted that poly H fragments are artificial parts of proteins in the PDB which have been added for better purification of proteins, but in eukaryotic proteomes such a repeat is likely to have a biological function. The locations of poly-H fragments can be found in different proteomes from our site, http://bioinfo.protres.ru/fp/search_new_pattern.html.
We calculated the statistical significance of the observed patterns in 123 proteomes by using equation (10) (see Materials and Methods). It should be noted that the average length of proteins in considered proteomes is larger (about 400 residues) than the average length of the protein in the PDB database (about 260 residues). On the one hand, Zoccur≤0, varies from 40 patterns for the human proteome to 91 ones for the bacterial proteome B. xenovorans. On the other hand, Zoccur>5 varies from 65 patterns in the rice proteome (O. sativa) to 8 patterns in the bacterial proteome B. xenovorans. Several examples deserve our attention. For instance, the appearance of pattern GGSGGGGSGGG varies from 7 cases in T. spiralis (the expected occurrence is 0.0004) to 149 cases in humans (the expected occurrence is 0.013), but the Zoccur value is 353 and 1291, respectively. Such patterns as MSLN and SNAM appear more sparsely in comparison with the expected value (Zoccur<0) for all considered 17 animal proteomes. Although the first pattern occurs 100 times (that is not rare) in the human proteome, and the second pattern appears 61 times, correspondingly. At the same time pattern HHHH appears more often than expected (from 10 for the human to 4 for the actinia (N.vectensis)), but Z is 68 and 12, respectively.
We calculated the frequencies of occurrence of 141 disordered patterns in 123 proteomes. To make a statement that the given pattern X occurs more often in the i proteome than in the j one we introduced the scoring function for such difference between occurrences of the pattern in two proteomes (by using equation (11), see Materials and Methods). This scoring function should have a normal distribution according to the central limit theorem. We considered the difference occurrence of 141 patterns in some pairs of proteomes (see Dataset S5) and illustrated here the example for eukaryota and bacteria superkingdoms. It turns out that the appearances of 55 patterns in the two superkingdoms do not differ significantly at the level of 10−7. The negative value of the scoring function points out that the frequency of appearance of the given pattern is higher in bacteria than in eukaryota superkingdoms. For example pattern APIPAP occurs 1.5 times more frequently in 26 bacterial proteomes than in 97 eukaryotic proteomes ( = −20.4). It should be added, that HHHH and QQQQQP patterns occur in Arthropoda's proteomes more often than in the Chordata proteomes ( = −38.4 and −34.7, correspondingly) (see Table 2 and Dataset S5).
For each proteome we calculated a set of 141 values reflecting the number of proteins containing at least one disordered pattern for each of the 141 patterns from the library. Then considering all possible pairs of proteomes, the correlation coefficients between the 141 values have been calculated resulting in the matrix of correlation coefficients. The correlation coefficient was calculated for each pair of proteomes separately (see Table 4), and then averaging has been done inside each kingdom and phylum (see Table 5). As a rule, the correlation coefficients are higher inside the studied kingdom and phylum than between them.
From Table 4 four clusters can be selected with a high correlation coefficient between the numbers of proteins where all considered patterns appear for all pairs between 17 animal proteomes. The first cluster corresponds to phylum Chordata (7 proteomes), the second corresponds to Arthropoda (5 proteomes), the third to Nematoda (4 proteomes), and the fourth to Cnidaria (only 1 proteome). In Tables 4 and 5, bold formatting is used to show a correlation higher than 75%, normal size of numbers to show the correlation from 50% to 75%, and smaller size of numbers to show the correlation smaller than 50%. From Table 4 it is evident that the number of proteins from the human proteome correlates with that from chicken and fish lesser than with bovine, rat, and mouse proteomes. At the same time, the correlation between the number of proteins from proteomes from the Chordata phylum is high for such proteomes as C. briggsae and C. elegans. High correlation coefficients also are observed for such pairs as T. spiralis for the Arthropoda proteomes, and N. vectensis for the Chordata proteomes.
Combining the motif discovery and disorder protein segment identification in the clustered PDB allows us to create the largest library of the disordered patterns. At present the library includes 141 disordered patterns. Such an approach is promising for further studying and understanding the functional role of the obtained patterns in different proteomes. We came to some general conclusions after analysis of 123 proteomes. The disordered patterns appear more often in eukaryotic than in bacterial proteomes. We can conclude that the occurrence of disordered patterns is more monotonous within the same kingdom (phylum) than between kingdoms (phyla). One can suggest that such short similar motifs are responsible for common functions for nonhomologous, unrelated proteins from different organisms.
Number of proteins and residues for each out of 123 proteomes.
Comparison of the new and the previous libraries of disordered patterns.
Pairs of patterns which appear in the same protein from the whole clustered PDB.
Conceived and designed the experiments: OVG. Performed the experiments: MYL. Analyzed the data: MYL OVG. Contributed reagents/materials/analysis tools: OVG. Wrote the paper: OVG. Designed the software used in analysis: MYL OVG.
- 1. Finn RD, Mistry J, Tate J, Coggill P, Heger A, et al. (2010) The Pfam protein families database. Nucleic Acids Res 38(Database issue): D211–222.
- 2. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, et al. (2009) InterPro: the integrative protein signature database. Nucleic Acids Res 37(Database issue): D211–215.
- 3. Sigrist CJ, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, et al. (2010) PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res 38(Database issue): D161–166.
- 4. Tompa P (2002) Intrinsically unstructured proteins. Trends Biochem Sci 27: 527–533.
- 5. Wright PE, Dyson HJ (1999) Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol 293: 321–331.
- 6. Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 6: 197–208.
- 7. Linding R, Russell RB, Neduva V, Gibson TJ (2003) GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res 31: 3701–3708.
- 8. Lobanov My, Galzitskaya OV (2011) The Ising model for prediction of disordered residues from protein sequence alone. Phys Biol 8: 035004.
- 9. Dosztanyi Z, Csizmok V, Tompa P, Simon I (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21: 3433–3434.
- 10. Coeytaux K, Poupon A (2005) Prediction of unfolded segments in a protein sequence based on amino acid composition. Bioinformatics 21: 1891–1900.
- 11. Galzitskaya OV, Garbuzynskiy SO, Lobanov MY (2006) FoldUnfold: web server for the prediction of disordered regions in protein chain. Bioinformatics 22: 2948–2949.
- 12. Galzitskaya OV, Garbuzynskiy SO, Lobanov MY (2006) Prediction of amyloidogenic and disordered regions in protein chains. PLoS Comput Biol 2: e177.
- 13. Schlessinger A, Punta M, Rost B (2007) Natively unstructured regions in proteins identified from contact predictions. Bioinformatics 23: 2376–2384.
- 14. Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z (2006) Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 7: 208.
- 15. Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK (2005) Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 61: Suppl 7176–182.
- 16. Obradovic Z, Peng K, Vucetic S, Radivojac P, Brown CJ, et al. (2003) Predicting intrinsic disorder from amino acid sequence. Proteins 53: Suppl 6566–572.
- 17. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337: 635–645.
- 18. Hecker J, Yang JY, Cheng J (2008) Protein disorder prediction at multiple levels of sensitivity and specificity. BMC Genomics 9: Suppl 1S9.
- 19. Su CT, Chen CY, Ou YY (2006) Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics 7: 319.
- 20. Yang ZR, Thomson R, McNeil P, Esnouf RM (2005) RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 21: 3369–3376.
- 21. Sugase K, Dyson HJ, Wright PE (2007) Mechanism of coupled folding and binding of an intrinsically disordered protein. Nature 447: 1021–1025.
- 22. Bordelon T, Montegudo SK, Pakhomova S, Oldham ML, Newcomer ME (2004) A disorder to order transition accompanies catalysis in retinaldehyde dehydrogenase type II. J Biol Chem 279: 43085–43091.
- 23. Shoemaker BA, Portman JJ, Wolynes PG (2000) Speeding molecular recognition by using the folding funnel: the fly-casting mechanism. Proc Natl Acad Sci U S A 97: 8868–8873.
- 24. Cheng Y, Oldfield CJ, Meng J, Romero P, Uversky VN, et al. (2007) Mining alpha-helix-forming molecular recognition features with cross species sequence alignments. Biochemistry 46: 13468–13477.
- 25. Mohan A, Oldfield CJ, Radivojac P, Vacic V, Cortese MS, et al. (2006) Analysis of molecular recognition features (MoRFs). J Mol Biol 362: 1043–1059.
- 26. Lobanov MYu, Shoemaker BA, Garbuzynskiy SO, Fong GH, Panchenko AR, et al. (2010) ComSin: Database of protein structures in bound (Complex) and unbound (Single) states in relation to their intrinsic disorder. Nucleic Acids Res 38: D283–D287.
- 27. Rajasekaran S, Merlin JC, Kundeti V, Mi T, Oommen A, et al. (2011) A computational tool for identifying minimotifs in protein-protein interactions and improving the accuracy of minimotif predictions. Proteins 79: 153–164.
- 28. Gould CM, Diella F, Via A, Puntervoll P, Gemünd C, et al. (2010) ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res D167–D180.
- 29. Karlin S, Burge C (1996) Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc Natl Acad Sci U S A 93: 1560–1565.
- 30. Lise S, Jones DT (2005) Sequence patterns associated with disordered regions in proteins. Proteins 58: 144–150.
- 31. Lobanov MY, Furletova EI, Bogatyreva NS, Roytberg MA, Galzitskaya OV (2010) Library of disordered patterns in 3D protein structures. PLoS Comput Biol 6: e1000958.
- 32. Lobanov MY, Garbuzynskiy SO, Galzitskaya OV (2010) Statistical analysis of unstructured amino-acid residues in protein structures. Biochemistry (Moscow) 75: 236–246.
- 33. Zbilut JP, Chua GH, Krishnan A, Bossa C, Colafranceschi M, et al. (2006) Entropic criteria for protein folding derived from recurrences: six residues patch as the basic protein word. FEBS Lett 580: 4861–4864.
- 34. Galzitskaya OV (2008) Search for folding initiation sites from amino acid sequence. J Bioinform Comput Biol 6: 681–691.
- 35. Nikiforovich GV, Frieden C (2002) The search for local native-like nucleation centers in the unfolded state of beta -sheet proteins. Proc Natl Acad Sci USA 99: 10388–10393.
- 36. Hemmer B, Kondo T, Gran B, Pinilla C, Cortese I, et al. (2000) Minimal peptide length requirements for CD4(+) T cell clones–implications for molecular mimicry and T cell survival. Int. Immunol 12: 375–383.
- 37. Bogatyreva NS, Finkelstein AV, Galzitskaya OV (2006) Trend of amino acid composition of proteins of different taxa. J Bioinform Comput Biol 4: 597–608.
- 38. Dunker AK, Obradovic Z, Romero P, Garner EC, Brown CJ (2000) Intrinsic protein disorder in complete genomes. Genome Inform Ser Workshop Genome Inform 11: 161–171.
- 39. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402.
- 40. Dosztányi Z, Mészáros B, Simon I (2010) Bioinformatical approaches to characterize intrinsically disordered/unstructured proteins. Brief Bioinform 11: 225–243.
- 41. Jorda J, Xue B, Uversky VN, Kajava AV (2010) Protein tandem repeats - the more perfect, the less structured. FEBS J 277: 2673–2682.
- 42. Simon M, Hancock JM (2009) Tandem and cryptic amino acid repeats accumulate in disordered regions of proteins. Genome Biol 10: R59.