Disordered Patterns in Clustered Protein Data Bank and in Eukaryotic and Bacterial Proteomes

We have constructed the clustered Protein Data Bank and obtained clusters of chains of different identity inside each cluster, http://bioinfo.protres.ru/st_pdb/. We have compiled the largest database of disordered patterns (141) from the clustered PDB where identity between chains inside of a cluster is larger or equal to 75% (version of 28 June 2010) by using simple rules of selection. The results of these analyses would help to further our understanding of the physicochemical and structural determinants of intrinsically disordered regions that serve as molecular recognition elements. We have analyzed the occurrence of the selected patterns in 97 eukaryotic and in 26 bacterial proteomes. The disordered patterns appear more often in eukaryotic than in bacterial proteomes. The matrix of correlation coefficients between numbers of proteins where a disordered pattern from the library of 141 disordered patterns appears at least once in 9 kingdoms of eukaryota and 5 phyla of bacteria have been calculated. As a rule, the correlation coefficients are higher inside of the considered kingdom than between them. The patterns with the frequent occurrence in proteomes have low complexity (PPPPP, GGGGG, EEEED, HHHH, KKKKK, SSTSS, QQQQQP), and the type of patterns vary across different proteomes, http://bioinfo.protres.ru/fp/search_new_pattern.html.

Many examples of proteins with intrinsically disordered regions which exhibit coupling between folding and binding have been described in the literature [4][5][6][21][22][23]. Nevertheless, the universality of this phenomenon and functional importance of many disordered regions remain unclear.
A database of continuous protein fragments (Molecular Recognition Features or MORFs) was compiled from the Protein Data Bank which includes short protein chains (with fewer than 70 residues) bound to larger proteins [24,25]. It has been argued that MORFs participate in the coupling of binding and folding, a hypothesis that was supported by the analysis of the composition and predicted disorder of MORF segments. As a result of studying the subtle structural differences of the same proteins in bound (Complex) and unbound (Single) states in relation to their intrinsic disorder the database of protein structures (ComSin) has been constructed [26].
Recently several computational tools for identifying Linear motifs [27] and minimotifs in protein-protein interactions [28] have been published. Linear motifs are short segments of multidomain proteins that provide regulatory functions independently of protein tertiary structure [27] but minimotifs are short functional peptide sequences obtained after analysis of known protein-protein interactions [28].
Low-complexity regions attract our attention since they are regions of a protein in which a particular amino acid, or a small number of different amino acids, are enriched. Single amino acid repeats (homorepeats) belong to these regions. It turned out that homorepeats play important roles in some biological process [29] and may play a more important role in human diseases than it was previously recognized.
In the current study we search for sequence patterns consisting of a number of consecutive residues along the polypeptide chain that are nearly always associated with disordered segments. It has been found that two types of patterns appear to be recurrent: a proline-rich pattern and a positively or negatively charged pattern [30]. It should be noted that the old and new versions of our libraries include patterns enriched by proline and charged residues [31].
The statistical analysis of disordered residues was done considering 34 464 unique protein chains taken from the PDB database. In this database, 4.95% of residues are disordered (i.e. invisible in X-ray structures) [31]. The statistics was obtained separately for the Nand C-termini as well as for the central part of the protein chain. It has been shown that frequencies of occurrence of disordered residues of 20 types at the termini of protein chains differ from the ones in the middle part of the protein chain [31,32].
It is necessary to construct a clustered PDB because this simplifies the filtering process of protein structures under their analysis and searches general structural characteristics among nonidentical proteins. It is necessary to construct a clustered PDB which is important for the analysis of actualized data.
In this work we constructed a clustered PDB and used clusters of protein chains where identity between chains inside of the cluster exceeds 75% (version of 28 June 2010). Combining the motif discovery and disorder protein segment identification in the clustered PDB allows us to create the largest library of disordered patterns [31]. At present the library includes 141 disordered patterns. Such an approach is new and promising for further studying and understanding the functional role of the obtained patterns in different proteomes. Taking into consideration the library of disordered patterns will help one improve accuracies of predictions for residues to be structured or unstructured inside the given region. The previous version of the library includes 109 disordered patterns and has restrictions on the minimal length of the patterns. Using more simple rules without restriction on the pattern length and clustered PDB of the same version we constructed the largest library of disordered patterns.
The patterns occur more often as short fragments. Patterns of four-six residues long occur more frequently (105 out of 141) among the disordered patterns of the library. It should be noted that six residue patches affect the folding/aggregation features of proteins, and they are important ''words'' for the understanding of protein dynamics [33]. Moreover, nucleation sites are constrained by patches of approximately six residues [34,35]. There is evidence that the minimum length necessary for a peptide to elicit an allergenic response and molecular mimicry (a patch of a protein eliciting an immune response equivalent to the entire protein) is about six [36]. All these facts suggest the existence of a fragment of biologically meaningful information located along approximately six residues [33].
With the library of disordered patterns taken into account, it would be easier to improve accuracy of prediction of ordered/ disordered residues inside the given region.
Proteome-wide calculations are a great way to place our work in a larger, evolutionary frame. In this paper of interest is the occurrence of 141 disordered patterns in 97 eukaryotic proteomes, since eukaryotic proteomes include more disordered regions than other proteomes [17,37,38], and for comparison, in 26 bacterial proteomes. A comparative analysis of the number of proteins containing the 141 disordered selected patterns in these proteomes has been performed. The disordered patterns with the most frequent occurrence in eukaryotic and bacterial proteomes have low complexity.
It should be noted that each proteome has a specific set of disordered patterns, and this results in different correlation coefficients between numbers of proteins where a disordered pattern appears at least one time. We came to some important observations of a higher correlation coefficient within a kingdom or a phylum than across kingdoms or phyla after analysis of occurrence of disordered patterns in 123 proteomes. The disordered patterns appear more often in eukaryotic than in bacterial proteomes. One can suggest that such short similar motifs are responsible for common functions for nonhomologous, unrelated proteins from different organisms.

Construction of clustered PDB
We have considered all protein structures determined by X-ray analysis with a resolution better than 3 Å , and the size of protein is larger than or equal to 40 amino acid residues, published in the PDB (version of June 28, 2010); the structures contain 116 997 protein chains (51 048 PDB entries). At the first step these 116 997 chains can be divided into 34 464 classes. We call these classes as clusters with 100% identity. This means that the chains from the same cluster have the same amino acid sequences, the sequences of chains from different classes are different i.e. they differ at least at one position. In total these 34 464 different sequences contain 9 085 893 residues. At the second step we created clusters of chains with identity inside each cluster $75%.
Identity is calculated by using equation: where I is the number of identical residues, L 1 and L 2 are the numbers of amino acid residues in each considered protein. For calculation of Identity we used BLAST with default parameters [39]. At the beginning a pair of chains with maximal identity was combined, then another pair of chains or a chain with the cluster again with maximal identity, etc. If the combining of a chain with the cluster or combining of clusters occurred, then the average identity of gathering was considered. If identity of at least a pair of chains from different clusters was less than 75%, then the clusters were not combined. The procedure was repeated until there were clusters which could be combined. At the second step of grouping of chains, we obtained 18775 clusters of chains with identity inside each cluster $75%. Then the clusters C75 have been combined into clusters with identity Id$50%, etc. Figure 1 demonstrates the dependence of the number of clusters on identity between chains inside the cluster. Further we consider the identity of 75% because the general grouping has occurred below 90% identity.

Construction of the library of disordered patterns
Among 116 997 chains, approximately 4.5% of their residues are disordered, i.e. are not resolved by X-ray analysis. To reveal such residues, we compared (for each protein chain) records SEQRES and records ATOM in the corresponding PDB-file. Residues which were present in record SEQRES, but their coordinates were absent in record ATOM (namely, the coordinates of the C a -atom were absent in record ATOM), were considered as unstructured ones. We considered the residues as disordered if there were not coordinates of C a atoms.
Below we consider only clusters with $75% identities between any pair of chains inside each cluster because the general grouping has occurred below 90% identity. Considering this level of identity, we have created the Clustered Disordered Residues Data Base (CDRDB), its elements are 18 775 clusters of protein chains. Figure 2 illustrates two clusters with 100% identity combined in one cluster with 75% identity. One can see that the sequences from two clusters are different in one position 110, serin is changed for cystein, and the weight of the chain from the first cluster is and the weight of the chain from the second cluster is w 2a4gA~1 2|2 , respectively. Analogously the weight of each chain from any cluster is calculated by using equation: where N C100 is the number of chains in the cluster with 100% identity and M C75 is the number of clusters with 75% identity. It should be noted that the sum of weights inside one cluster with 75% identity will be equal to one. The weight of residue we consider to be the same as the weight of chain so at the level of 75% identity a cluster may include protein chains of different lengths.
Our goal is to create a database of disordered patterns i.e. amino acid sequences that are likely to be found in disordered parts of protein chains using CDRDB by applying simpler rules for the creation of the library of disordered patterns than in our previous work [31]. Let P be a protein chain and A be a pattern of length L. The database was compiled using a two-stage procedure. At the first stage, we created a list of candidate patterns. To be a candidate in the patterns the considered pattern should be disordered in half cases among the chains from the cluster with 100% identity. Then the desired disordered patterns were selected into the candidate list. 855 775 candidates in the disordered patterns were gathered.
We say that pattern A matches chain P at position s if the following conditions are valid: 1) two residues from each end should coincide: 2) there could be done substitutions at most L/5 positions r in the middle of pattern in which A r ½ =P szr ½ : This means that for patterns with a length of L#5 no change is possible, for 5,L#10 -only 1 change, for 10,L#15 -2 changes, etc. The occurrence is terminal if it belongs to the first 40 residues (''N-terminal'') or last 40 residues (''C-terminal'') of the chain. The other occurrences are called internal ones.
If the distance between the edges of the pattern and the chain is less than 40 residues the pattern is considered to match these Figure 1. Dependence of the number of clusters on identity between protein chains. Inside each cluster at the given identity between chains the identity is larger than the considered identity between clusters. doi:10.1371/journal.pone.0027142.g001 Figure 2. Example of two clusters with 100% identity combined in one cluster with 75% identity. The sequences from two clusters are different only in one position (110), serin is changed for cystein. U denotes disordered residues in the chain and dash denotes ordered residues, respectively. doi:10.1371/journal.pone.0027142.g002 residues. The pattern length is not limited in this paper. Further we consider the following terminology: N u is the sum of weights (w chain ) of disordered residues matched by the pattern; N f is the sum of weights (w chain ) of ordered residues matched by the pattern; C u is the number of clusters with identity 75% (C75) in which N u .N f ; C f is the number of clusters with identity 75% (C75) in which N u #N f . Protein P has an occurrence of pattern A if A matches P at position s.
Fragment A = P j [s+1, s+L] of chain P j is considered as a candidate disordered pattern if it meets the following conditions: There are 16 918 patterns meeting conditions C1, C2, and C3. The longest pattern has the length of 45 amino acid residues (HHHHHHSSGLVPRGSGMKETAAAKFERQHMDSPDLGT-DDDDKAMA), and the shortest pattern has 2 residues (HH). In the next step we selected disordered patterns from the candidate list using the following iterative greedy procedure. From 16 918 patterns we chose the pattern with the maximal value D = N u 2N f . Then for the rest patterns the values of N u , N f , C u , C f were recalculated not taking into account the residues matched by the first pattern. Again all the rest patterns were checked to meet conditions C1, C2, and C3. Among the rest patterns meeting conditions C1, C2, and C3 the pattern with a maximal D value was chosen. If there were no patterns meeting conditions C1, C2, and C3, then the procedure was stopped. The iterative procedure was stopped when 390 patterns were selected (D.0). Finally, we were interested in the patterns for which D$10 and D$25 (the value 25 corresponds to the summation of weights of 5 whole disordered patterns with 5 residues in length in 5 clusters without neighboring regions, or terminal occurrence). The numbers of such patterns are 249 and 141, respectively (see Dataset S1). The lengths of patterns are in the region: 4#L#24. Further we will consider only the set of patterns meeting the condition that D$25.

Significance of disordered occurrences
We have studied the statistical significance of the selected patterns from two points of view. First, we were interested whether the disordered fragments are overrepresented among the occurrences of each pattern, and, second, whether the patterns are overrepresented in the database. The features are described with the proper Z-scores, called Z disorder and Z occur , respectively. To estimate the significance of the number of disordered occurrences of pattern P we have implemented the following procedure. First, we determined the fraction of disordered fragments among all fragments with the given length taking into account the weight of the disordered residues in each case: where N is the number of chains in the CDRDB, L i is the length of the considered chain, n is the fragment length, w i is the chain weight, d ik is equal to 1 if the fragment with adjoining regions is disordered more than in half positions, and 0 in the opposite case.
For each pattern we know the number of clusters C u where this pattern in more than half cases is disordered, and also the number of clusters C f where this pattern is folded in more than half cases (see Dataset S1, columns J and K). We should calculate the probability P (Y) that the number of successes will be larger or equal to C u at the given number of trials Y = C u +C f . In other words, this is the probability that at the given or larger number of trials: where p is the probability of success of one trial (see above). The significance of disordered occurrences is estimated with the Zscore: Statistical significance of the observed number of occurrences of pattern X in proteomes The probability of finding patterns with possible changes is equal to the summation of probabilities over all sequences compatible with the given pattern.
X 0 i is the sequence compatible with the given one (see the rules of coincidence, for example i = 39 for n = 6).
where the probability p(X) that pattern X occurs in a sequence and p i is the probabilities of occurrence of amino acids in the considered proteome. We calculated the probability p(X, N) that pattern X with n amino acid residues occurs in a sequence of length N: The probability distribution on protein sequences is assumed to be binomial. The statistical significance of pattern X is estimated with the Zscore where S is the number of sequences containing at least one occurrence of homorepeat X. R is the number of proteins in the considered proteome. N is the average length of proteins in the considered proteome.
Statistical significance of the observed number of occurrences of pattern X in two different proteomes Let n i and n j be the numbers of proteins with the given pattern X in proteomes i and j. N i and N j are the whole numbers of proteins in both proteomes, and p~n= N is the frequency of proteins with the given pattern. s~ffi ffi n p N is the standard deviation. L i and L j are the average length of proteins in the considered proteomes i and j. The scoring function is: We consider that the difference is significant if its Z-score exceeds the proper value with absolute meaning 3 and 5. These values correspond to the probabilities 3*10 23 and 6*10 27 , respectively. The correlation coefficient (r) was calculated using the equation: where S x and S y are standard deviations for variables x and y.

Database of proteomes
We considered 3279 proteomes from the EBI site (ftp://ftp.ebi. ac.uk/pub/databases/SPproteomes/uniprot/proteomes/). Since the patterns with the frequent occurrence in proteomes have low complexity we did a preliminary analysis. The analysis showed that the number of proteins with at least one occurrence of homorepeats of 6 residues long is less than 500 for proteomes with an overall number of residues below 2500000. Even so, only 22 proteomes out of 3156 have more than 100 proteins with at least one occurrence of 6-residue homorepeats. The data gave grounds for our research involving only proteomes with an overall number of residues exceeding 2500000.
We obtained 123 proteomes taking into account the length of proteomes representing 9 kingdoms of eukaryotes and 5 phyla of bacteria (see Table 1 and Dataset S2). Unfortunately, only three kingdoms of eukaryotes (Metazoa, Viridiplantae, and Fungi) are given at http://www.ncbi.nlm.nih.gov/Taxonomy/. In other cases, the rank of kingdom is missing. In such situations, we chose the highest taxonomic category proceeding from the subkingdom of eukaryotes instead of the kingdom. We chose 97 out of 120 eukaryotic proteomes, and a small number of bacterial proteomes. The smallest eukaryotic proteome belongs to Hemiselmis andersenii, class Cryptophyta. It is evident that 498 proteins with an overall number of 167452 of amino acid residues are not sufficient for reliable statistics. Historically, the superkingdom of bacteria is divided into phyla but not kingdoms. We preferred to consider such phyla separately. Among

Results and Discussion
Library of disordered patterns Following the procedure described in the Materials and Methods section, we constructed the clustered PDB (CDRDB) at the level identity of 75% (http://bioinfo.protres.ru/st_pdb/) and obtained a library of disordered patterns. The dataset includes 141 patterns (see Dataset S1). Figure 3 demonstrates the distribution of the patterns according to their lengths. The patterns occur more frequently as short fragments (105 out of 141 are patterns of 4-6 residues long). The largest pattern with condition D$25 consists of 17 amino acid residues (HHHHHHSSGLEVLFQGP). It is interesting that the strong pattern is HHHH, but not HHHHHH as in the last version of the library [31]. We suggest that the residues matched by these patterns will be disordered in new protein chains because more than half of residues in these patterns are disordered (see conditions C2 and C3 in the Materials and Methods section).
The statistical significance of disordered occurrences in the selected patterns was estimated with the Z-score (see Materials and Methods). We calculated the probability that the number of successes will be larger or equal to C u at the given number of C u +C f (for each pattern we know the number of clusters C u where this pattern in more than half cases is disordered, and also the number of clusters C f where this pattern is ordered in more than half cases). This probability for all 141 disordered patterns is less than 7N10 25 .
All 141 patterns have Z disorder .6.4 that corresponds to the Pvalue of 7N10 25 , which is in good agreement with the procedure of the disordered patterns determination. The worst variant is C u = 5, C f = 4, and the length of patterns is 6. We have four such cases: SVAESS, ASIGQA, PPSGSP, and DSDVSL (see Dataset S1, columns O and P).

Comparison of the new and the previous libraries of disordered patterns
After construction of the new library the question about similarity of two databases (previous and new) arises. For this purpose the previous patterns matched the clustered pdb (CDRDB) and the sum of weights was calculated analogously to the new patterns. Then we calculated the sum of weights for residues matching both the previous and the new patterns (intersections, I 12 ). The number of clusters with identity of 75% in which there were new and previous patterns was calculated, as well as the number of intersections. The degree of coincidence was calculated using equations (13) and (14): where I 12 is the sum of weights for intersections (coincidences), and N is the weight of a single pattern. We considered only pairs where F 2 .0.1, F 2 (C75).0.1, I 12 $3, and the number of clusters where two patterns appear together, C 12 $3 (see Dataset S3). The measure F 1 points to the coincidence between two considered patterns. At the same time the measure F 2 demonstrates a level of inclusion of the pattern with smaller N into a larger one. Large difference between N 1 and N 2 results in a wide difference between F 1 and F 2 . For example, the sequence GSSHHHHHHSSGLVPRGSHM occurs in 393 clusters on the N-termini, where it is disordered more than half in 387 clusters. This sequence is matched by pattern GSHM, and its beginning is matched by the HHHH pattern. If we have a test database with one protein where there is such a sequence at the N-end, then N GSHM = 20. N is the weight of a pure pattern with the neighboring part, in this case this is the length of the whole N-terminal fragment, N HHHH = 9, I 12 = 9,  It is interesting that some patterns appear in a protein together with other patterns (57 out of 141). Such pairs can be seen in Dataset S4. Also we calculated the number of patterns which appear in proteins together with the considered pattern (see Fig. 5). Pattern HHHH occurs more often with other patterns in proteins. It should be noted that there are several patterns which appear alone in the CDRDB (see Fig. 5, Dataset S4). We used the same criteria as for the intersections of the two libraries.

Occurrence of disordered patterns in 97 eukaryotic and 26 bacterial proteomes
After creating the library of disordered patterns taken from the CDRDB, another interesting question arises: how often the obtained patterns could occur in some proteomes. Since eukaryotic proteomes include more disordered regions than other proteomes [17,37,38] we compared 97 eukaryotic proteomes and 26 bacterial ones (see Table 1, Dataset S1, and Materials and Methods).
We considered two cases for coincidence. In the first case we calculated the number of proteins where the patterns match with precise coincidence a polypeptide chain fragment. In the second case we analyzed the coincidence according to the definition  suggested here and in the paper [31]. According to the rule mentioned in the Materials and Methods section for patterns with a length of L#5 no change may occur, for 5,L#10 -only 1 change may take place, for 10,L#15 -2 changes, etc. Among 141 disordered patterns 17 occur (with precise coincidence) only in the PDB but are very sparse in 123 proteomes (see Dataset S5). Such patterns as RASQPELAPEDPED, SMTGGQQMGRGS, SHHHHHHSQDP, PTTENLYFQGAM, HHHHHHSSGLEVLFQGP, EQKLISEEDLN, and ASMTG-GQQMGR do not appear in the analyzed proteomes even in two cases (precise coincidence and exact coincidence of two terminal residues and no coincidence in L/5 positions) (see Figure 6). This suggests that such patterns are an artificial addition to proteins from the CDRDB for their better purifications.
From Figure 6 it is evident that the homorepeats occur very often in eukaryotic proteomes. The patterns with the most frequent occurrence in the eukaryotic proteomes have low complexity: PPPPP, GGGGG, EEEED, HHHH, KKKKK, SSTSS, and QQQQQP. From Tables 2 and 3 it is evident that the disordered patterns with the most frequent occurrence in the eukaryotic and even in bacterial proteomes are patterns with low complexity GGGGG, PPPPP, TTTPTT, GGGGSGG, KKKKK, etc.
According to work [31] we suggest that these patterns will be disordered in most cases. It should be noted that low-complexity regions can additionally include ordered structural proteins or proteins with strong structural propensity, like collagens, coiledcoils or fibrous proteins [40]. Recently, it has been demonstrated that an increased number of perfect tandem repeats correlates with their stronger tendency to be unstructured [41]. Moreover, strong association between homorepeats and unstructured regions was shown elsewhere [42]. Such patterns as GGGGSGG, EEEEVEE, EDEREE, APIPAP, and PSRSPS (see Table 2) often occur in the considered 17 animal proteomes.
It should be noted that poly H fragments are artificial parts of proteins in the PDB which have been added for better purification of proteins, but in eukaryotic proteomes such a repeat is likely to have a biological function. The locations of poly-H fragments can  be found in different proteomes from our site, http://bioinfo. protres.ru/fp/search_new_pattern.html.
We calculated the statistical significance of the observed patterns in 123 proteomes by using equation (10) (see Materials and Methods). It should be noted that the average length of proteins in considered proteomes is larger (about 400 residues) than the average length of the protein in the PDB database (about 260 residues). On the one hand, Z occur #0, varies from 40 patterns for the human proteome to 91 ones for the bacterial proteome B. xenovorans. On the other hand, Z occur .5 varies from 65 patterns in the rice proteome (O. sativa) to 8 patterns in the bacterial proteome B. xenovorans. Several examples deserve our attention. For instance, the appearance of pattern GGSGGGGSGGG varies from 7 cases in T. spiralis (the expected occurrence is 0.0004) to 149 cases in humans (the expected occurrence is 0.013), but the Z occur value is 353 and 1291, respectively. Such patterns as MSLN and SNAM appear more sparsely in comparison with the expected value (Z occur ,0) for all considered 17 animal proteomes. Although the first pattern occurs 100 times (that is not rare) in the human proteome, and the second pattern appears 61 times, correspondingly. At the same time pattern HHHH appears more often than expected (from 10 for the human to 4 for the actinia (N.vectensis)), but Z is 68 and 12, respectively.
We calculated the frequencies of occurrence of 141 disordered patterns in 123 proteomes. To make a statement that the given pattern X occurs more often in the i proteome than in the j one we introduced the scoring function for such difference between occurrences of the pattern in two proteomes (by using equation (11), see Materials and Methods). This scoring function should have a normal distribution according to the central limit theorem. We considered the difference occurrence of 141 patterns in some pairs of proteomes (see Dataset S5) and illustrated here the example for eukaryota and bacteria superkingdoms. It turns out that the appearances of 55 patterns in the two superkingdoms do not differ significantly at the level of 10 27 . The negative value of the scoring function points out that the frequency of appearance of the given pattern is higher in bacteria than in eukaryota superkingdoms. For example pattern APIPAP occurs 1.5 times more frequently in 26 bacterial proteomes than in 97 eukaryotic proteomes (Z diff = 220.4). It should be added, that HHHH and QQQQQP patterns occur in Arthropoda's proteomes more often than in the Chordata proteomes (Z diff = 238.4 and 234.7, correspondingly) (see Table 2 and Dataset S5).
For each proteome we calculated a set of 141 values reflecting the number of proteins containing at least one disordered pattern for each of the 141 patterns from the library. Then considering all possible pairs of proteomes, the correlation coefficients between the 141 values have been calculated resulting in the matrix of correlation coefficients. The correlation coefficient was calculated for each pair of proteomes separately (see Table 4), and then averaging has been done inside each kingdom and phylum (see Table 5). As a rule, the correlation coefficients are higher inside the studied kingdom and phylum than between them.
From Table 4 four clusters can be selected with a high correlation coefficient between the numbers of proteins where all considered patterns appear for all pairs between 17 animal proteomes. The first cluster corresponds to phylum Chordata (7 proteomes), the second corresponds to Arthropoda (5 proteomes), the third to Nematoda (4 proteomes), and the fourth to Cnidaria (only 1 proteome). In Tables 4 and 5, bold formatting is used to show a correlation higher than 75%, normal size of numbers to show the correlation from 50% to 75%, and smaller size of numbers to show the correlation smaller than 50%. From Table 4 it is evident that the number of proteins from the human proteome    Table 3. Average number of proteins with the most frequent occurrence of disordered patterns in 123 considered proteomes in the case of incomplete coincidence.

Metazoa
Viridi-plantae (5) Stramenopiles (1) Choano-flagellida (1) Euglenozoa (4) Alveolata (6) Amoebozoa (2) Diplomonadida (3) Fungi (58) Acido-bacteria (1) Actino-bacteria (14) Proteo-bacteria (8) Bacter-oidetes (2) Chloroflexi (1) Metazoa ( correlates with that from chicken and fish lesser than with bovine, rat, and mouse proteomes. At the same time, the correlation between the number of proteins from proteomes from the Chordata phylum is high for such proteomes as C. briggsae and C. elegans. High correlation coefficients also are observed for such pairs as T. spiralis for the Arthropoda proteomes, and N. vectensis for the Chordata proteomes. Combining the motif discovery and disorder protein segment identification in the clustered PDB allows us to create the largest library of the disordered patterns. At present the library includes 141 disordered patterns. Such an approach is promising for further studying and understanding the functional role of the obtained patterns in different proteomes. We came to some general conclusions after analysis of 123 proteomes. The disordered patterns appear more often in eukaryotic than in bacterial proteomes. We can conclude that the occurrence of disordered patterns is more monotonous within the same kingdom (phylum) than between kingdoms (phyla). One can suggest that such short similar motifs are responsible for common functions for nonhomologous, unrelated proteins from different organisms.

Supporting Information
Dataset S1 List of 141 disordered patterns.