Systematic Analysis of Compositional Order of Proteins Reveals New Characteristics of Biological Functions and a Universal Correlate of Macroevolution

We present a novel analysis of compositional order (CO) based on the occurrence of Frequent amino-acid Triplets (FTs) that appear much more than random in protein sequences. The method captures all types of proteomic compositional order including single amino-acid runs, tandem repeats, periodic structure of motifs and otherwise low complexity amino-acid regions. We introduce new order measures, distinguishing between ‘regularity’, ‘periodicity’ and ‘vocabulary’, to quantify these phenomena and to facilitate the identification of evolutionary effects. Detailed analysis of representative species across the tree-of-life demonstrates that CO proteins exhibit numerous functional enrichments, including a wide repertoire of particular patterns of dependencies on regularity and periodicity. Comparison between human and mouse proteomes further reveals the interplay of CO with evolutionary trends, such as faster substitution rate in mouse leading to decrease of periodicity, while innovation along the human lineage leads to larger regularity. Large-scale analysis of 94 proteomes leads to systematic ordering of all major taxonomic groups according to FT-vocabulary size. This is measured by the count of Different Frequent Triplets (DFT) in proteomes. The latter provides a clear hierarchical delineation of vertebrates, invertebrates, plants, fungi and prokaryotes, with thermophiles showing the lowest level of FT-vocabulary. Among eukaryotes, this ordering correlates with phylogenetic proximity. Interestingly, in all kingdoms CO accumulation in the proteome has universal characteristics. We suggest that CO is a genomic-information correlate of both macroevolution and various protein functions. The results indicate a mechanism of genomic ‘innovation’ at the peptide level, involved in protein elongation, shaped in a universal manner by mutational and selective forces.


Introduction
Most protein sequences appear to be quite random. Nonetheless, many sequences display various types of ordered patterns, observed in all kingdoms of life [1,2]. These include successive expansion of a single amino-acid (known as 'run' or homopeptide), repetitive sections with various lengths and degree of purity, and more generally, low-complexity amino-acid regions, i.e., sections of high compositional bias manifested by low single amino-acid Shannon entropy [3]. We refer hereafter to the phenomena of ordered patterns in protein sequences as compositional order (CO).
In the DNA, ordered patterns appear in both non-coding and coding regions, including minisatellites and microsatellites, or generally tandem repeats of chains of few nucleotides. Their generating mechanisms involve replication slippage and recombination effects [4]. These mechanisms, and others such as segmental duplications, may serve as the origin of the CO structures observed in proteins. Proteins containing CO exhibit a wide variety of functions associated with disordered, as well as ordered, 3D structures including extended coiled, helical domains, molten globules, collagen, keratin and zinc-fingers [5][6][7][8]. They are involved in DNA binding, alternative splicing, transcription, regulation, protein-protein interaction, tumor genesis [5,8,9], and formation of novel functions, such as cell envelopes of keratinocytes [10].
Tandem repeats are thought to represent a third type of genomic variation along with single nucleotide polymorphisms (SNPs) and copy number variation (CNV) [11]. This is because they are important not only for protein function but also for the fast evolution of complex traits, including various phenotypic and morphological changes as well as adaptive and social behaviors [12][13][14]. Variations of repeats in coding DNA sequences were found particularly important in some rapid evolutionary processes, such as changes in size and shape of limbs and craniums of dog [15], and fast adaptation to changing environment of cell wall proteins in yeast, which allows for avoiding capture by the host immune system [16]. In contrast, in human, variations in aminoacid runs have been associated with disease, in particular various cancers [17] and neurodegenerative diseases [18], where related proteins are rich with poly-Q repeats [19], poly-A repeats [20], or multiple runs of various amino-acids as in the case of the Huntingtin protein.
The dual association of repeats with both essential functions and disease promoted the view that repetitive sections are subjected to rapid evolution by fast mutational drive, which facilitates the acquisition of a function soon after a repetitive section came into being [5,9]. The latter presumably did not fit any initial functionality and may have even contained a risk of leading to deleterious effects. This raised the question of which evolutionary forces act on repetitive sequences. Early studies [5,9,21] pointed out that interruptions of amino-acid runs are evidence of mutational forces, and that mutations at the third DNA synonymous site are indicative of selectivity of function. However, the role of selection remained elusive as other studies have shown that repeats are only weakly conserved across species, indicating weak selective pressure, except for some specific genes of the xylanase family, the heat response protein Dnaj, and the ribosomal L10, 12 proteins, which were identified as the origins of aminoacid runs in prokaryotes [22]. Recently it was shown that alternatively spliced exons are enriched in repeats with low codon-diversity [23], and that repeat conservation in vertebrates is three times higher in coding than in non-coding regions, but less in primates [24]. Both observations constitute strong evidence for selection. Evidently three fundamental evolutionary forces: mutation, selection and 'innovation', i.e. generation of new raw repetitive sequences exist; however, the balance between these forces is hard to measure and may vary considerably among species and conditions. The interrelation between CO and evolution has been stressed by Albà et al. [25], who have suggested that repeats may have an important role in organism diversity and macroevolution, i.e., the generation of higher taxa. This is because some developmental genes, like Ubx in insects and HOX in human, are responsible for major organism-specific characteristics and are rich in homopeptides. Indeed, previous analyses of a large ensemble of species revealed that the number of CO proteins is three times larger in eukaryotes than in prokaryotes, independently of protein length [26]. Furthermore, some CO proteins were associated with specific eukaryotic functions such as collagen, calcium binding and keratin. Thus it was suggested that eukaryotes favor the generation of repeats as a source of variability to compensate for their relatively slow evolutionary rate [26,27], indicating that the mechanisms shaping CO are not universal in the super-kingdoms.
In the general framework of evolution, with particular emphasis on eukaryotes, it should be noted that species development is often described in terms of increasing organism complexity [28], which is thought to be reflected by several factors such as the numbers of different tissues, cell types, proteins and their interactions [29]. Attempts to quantify this complexity from genomic sequence suggest that natural selection is a necessary mechanism to explain the seemingly increase in biological complexity [30]. Nevertheless, the questions of which evolutionary forces participate in the development of complex traits, what is the balance between them, how it depends on environmental and ecological factors, and whether all this leaves any measurable genomic-information stamp that correlates with the evolutionary path of species complexity, remained unresolved [31,32].
In this study we introduce the concept of Compositional Order (CO), accounting for all types of repetitive and low complexity regions. The novel framework is based on the identification and quantification of Frequent amino acid Triplets (FTs). The biological importance of both amino acid and DNA triplets has been pointed out in various studies, emphasizing their role in the characterization of major bacterial phyla and super-kingdoms [33,34], and the evolutionary importance of their spontaneous expansions in higher taxa [35]. We show that triplets of amino acids are adequate and even optimal building blocks for a systematic characterization of CO. We define and exhibit three measures of CO in proteins: 'regularity', 'periodicity' and 'vocabulary'. Regularity refers to the high multiplicity of aminoacid triplets, and is defined by the relative coverage of a protein's sequence by FTs. This measure is highly correlated with Shannon's entropy hence it recapitulates the conventional establishment of low sequence-complexity regions. Periodicity reflects the relative amount of FT occurrences within a periodic structure observed on the protein sequence. In the case of tandem repeats it may account for basic motif characterization. FTvocabulary is defined as the number of observed Different Frequent Triplets (DFT) in either a single protein or in a full proteome.
We demonstrate and evaluate the phenomenology of CO in human proteins, quantifying them in detail using the new measures. We explore the functional enrichment of proteins containing CO in several representative species, emphasizing their dependencies on these new measures. We discuss the evolutionary interpretation of these dependencies. A comparative study of human vs mouse proteomes provides new insights on the interplay of CO with evolutionary forces. Last, we concentrate on a largescale proteomic study, comparing 94 species from all kingdoms of life. This leads to the observation that FT-vocabulary is an important measure. At the proteome level DFT counts provide clear delineation of vertebrates, invertebrates, plants and fungi from each other, with bacteria and archaea closing the list, concluding that DFT is a universal proteomic marker of macroevolution. This throws new light on fundamental questions in the evolution of species and on the nature of the genomic mechanisms involved.

Results
We define Frequent Triplets (FT) to be those amino-acid triplets that are observed in protein sequences far beyond random (see Methods). Specifically, we search for triplets that occur at least 5 times in a protein. Their statistical significance is discussed in Text

Author Summary
Variations in compositionally ordered (CO) sections of proteins, such as amino acid runs, tandem repeats and low complexity regions, are often considered as a third type of genomic variation along with SNP and CNV. At the microevolutionary scale, they are involved in the rapid evolution of numerous biological functions and the development of novel phenotypic complex traits, including disease in human, in particular neurodegeneration and cancer. At the macroevolutionary scale, the best discriminating proteomic factor between super-kingdoms is the prevalence of CO proteins in eukaryotes. The analysis of CO structures has so far been quite eclectic. Here we introduce a novel unifying methodology, accounting for all types of low-complexity regions and repetitive phenomena, including the existence of large periodic structures in protein sequences. We define new CO measures providing insights into the correlation of CO with protein function and with evolution. In particular, a large-scale analysis of 94 proteomes shows that the CO vocabulary of frequently appearing amino acid triplets serves as a measure of taxonomic ordering separating major clades from each other. It unravels a missing genomic correlate of macroevolution and serves as a novel phylogenetic tool. This suggests that major CO generation occurs during the creation of a completely new species, i.e. during macroevolutionary events. S1 (section 1-2, figures S1, S2, S3, S4, S5). In Methods, we establish that the relative coverage (RC) of FTs in a protein sequence highly correlates with sequence entropy, providing a good tool for estimating 'regularity' (Text S1 -section 4, figure S7). Additionally, the intervals between the consecutive occurrences of an FT provide information about the existence of periodic structures on the protein's sequence. These are identified by the most frequent interval (MFI) encountered in a protein, chosen out of all intervals displayed by FT recurrences on the sequence. The level of 'periodicity' in a protein is then estimated by the relative periodicity (RP): the sum of all FT recurrences at MFI divided by the sum of all FTs occurrences. Thus we obtain through FTs independent information about both prevalent composition and prevalent periodicities. Few representative examples are shown in Table 1. Complete detailed information is provided in Methods.

Compositional order (CO) of human proteins
We analyze the Swiss-Prot human proteome (N = 20248) in detail, employing our new measures. The human proteome is composed of CO proteins (N CO = 5511, 27.2%) and NO proteins (N NO = 14747); the latter do not contain any FT. Identifying FT occurrences on proteins allows for capturing a large repertoire of order patterns of peptide repeats of different levels of purity. Two outstanding examples of human proteins in Swiss-Prot records are: 1) the pure glutamine run in ATX8, a protein which consists of one M followed by 79 Q. Notably its DNA consists of an uncorrupted chain of 79 CAG repeats [36]. 2) 40 exact repeats of a peptide of length 20, VTSVPVTRPALGSTTPPAHD, on the protein MUC1. The variability in the number and purity of these repeats may differ among individuals [37], and plays an important role in cancer [38]. In Table 1 we present other types of order patterns caught by the FT analysis. One finds single amino-acid short runs of several amino-acid types, which may be distributed on various locations in a protein (Table 1 Table S1. The distribution of MFIs (figure 1a) as well the distribution of all intervals in the entire CO set (figure 1b), exhibit two leading periodic features in the proteome. One is that of MFI = 1 denoting prevalence of amino-acid runs, and the other is MFI = 28 which is characteristic of many ZF proteins. The interval distribution further displays higher harmonics of 56 and 84 on ZF proteins, which can be accounted for by mutation effects on amino acid sections with periodicity 28. Interestingly, the ranked-ordered interval distribution (figure 1b) displays behavior close to that of the well-known Zipf Law, a hallmark of linguistic elements (see Discussion). A periodic structure can be defined by requiring a minimal number of interval recurrences at MFI (figure 1c). In human, about 50% of CO proteins can be characterized as periodic with at least 4 interval repetitions, on which we find on average 6 DFTs and 30 recurrences at MFI.
The number of ZF proteins is quite prominent in the CO set (Table 2). It is of the order of 18%, doubling its relative weight compared to the total human proteome. Similar doubling is observed for collagen and keratin. The latter have substantial average values of RC, pointing to high relative coverage of FTs on their sequences, while ZF have high relative periodicity, RP. In contrast to all these examples, proteins annotated as diseasecorrelated, are not significantly enriched within the CO set. This would seem to run against the common understanding that disease related proteins have high compositional bias. The resolution is explained in the next section.
Functional enrichment and annotation dependencies on measures of compositional order on species chosen from different kingdoms throughout the tree of life Our principal measures of CO regularity, RC, and periodicity, RP, can be used to sort out functions, cell-localizations and other annotations that are enriched with CO (see Methods). We carry out such analysis on three species: Human, A. Thaliana, and S. Cerevisiae, which may be viewed as representatives of three major taxonomic groups of eukaryotes: Animalia, Plantae and Fungi. Their proteomes in the Swiss-Prot data-base contain 20248, 11304, and 5875 proteins respectively. In addition, we have analyzed all the 187740 bacterial enzymes in Swiss-Prot.
Human. Analysis of human CO proteins using the GOrilla GO (gene ontology) tool [39,40] shows, consistently with previous studies, that human CO proteins exhibit numerous and highly significant functional enrichments (Text S2). Notably, these include regulation, transcription, binding, and various developmental and metabolic biosynthetic processes (Hypergeometric Pvalues,10 28 , FDR corrected). In figure 2, we further demonstrate, using text search in GO annotations (see Methods), that CO human proteins exhibit a repertoire of enrichment dependencies on CO measures. Some functions depend on RC (figure 2A): keratin (P-value,10 212 ) and collagen (P-value,10 212 ). In this category one finds also filament and cell adhesion related proteins. Other functions depend on the RP (figure 2B), such as neuronal (P-value,10 26 ) and immune system related proteins (P-value,10 25 ) and other response proteins. Some depend on both RC and RP (figure 2C), e.g., extracellular proteins (P-value,10 210 for RC, and ,10 26 for RP). There also exist proteins that have a non-monotonic behavior (figure 2D). In this last category we note some outstanding terms that have been previously discussed in the literature in the context of compositional bias: DNA-binding, regulation and transcription. To better understand the association of these annotations with CO measures we further explore how they depend on MFI. We analyzed separately two sub-groups of these proteins (figure 2D), those with significant amino-acid runs (MFI = 1) and its complement (MFI.1). The subgroup MFI.1 shows the highest enrichment with RP, indicating that, repetitive sections other than runs play important roles in the evolution of these functions. The subgroup MFI = 1 further displays a clear monotonic behavior with respect to the length of runs (figure 2E). The larger the coverage of aminoacid runs, the larger is the portion of proteins associated with these three annotations (P-values,10 212 , 10 27 , 10 212 for DNA binding, regulation and transcription, respectively). Disease proteins do not show any clear behavior with respect to RC or RP, however, a monotonic enrichment was found with respect to an increase in length of runs (figure 2F) with moderate Pvalue,10 24 . Last, we note that there are several GO terms for which one finds monotonic decrease of the portion of related proteins for elevated thresholds of both RP and RC. These include ATP, cell-cycle, signal transduction, proliferation and growth. All these basic functions of living systems presumably evolve without relying on CO-structures, implying strong accumulation of mutations or purifying selection. We will point out below that increasing CO is correlated with organism complexity; hence, the fact that the most basic mechanisms do not require an increased CO is consistent with our analysis. Arabidopsis thaliana. Some plant genomes contain large repetitive sections whose protective roles in stressful conditions have been suggested previously [41,42]. In A. thaliana we find 1786 CO proteins comprising 15.8% of the proteome. Figure 3  Saccharomyces cerevisiae. Main GO terms enriched in CO proteins (N CO = 996, 17%) are similar to the ones found in A. thaliana as shown in figure 3 (lower panel), i.e., cell wall (Pvalue,10 212 ), response (P-value,10 23 ) and specifically stress related genes (P-value,10 24 ) and extracellular region genes (P-value,10 212 ). Also nuclear pore proteins are enriched, although their overall number is low. Furthermore, in both A.  thaliana and S. cerevisiae, annotations of DNA-binding, regulation and transcription display similar enrichment patterns to those observed in human for the subgroup of proteins that contain significant runs, MFI = 1 (figure S10). Noted differences between A. thaliana and S. cerevisiae are in response related genes. While in A. thaliana enrichment is with respect to RP (figure 3B) in S. cerevisiae it is with respect to RC ( figure 3E). This looks similar to the disappearance of RP-enrichment of extracellular proteins in A. thaliana vs human. We will argue below that it is consistent to assume that major CO generation occurs at macroevolutionary steps. Since the macroevolutionary birth of the S. cerevisiae lineage predates that of A. thaliana which clearly predates the several macroevolutionary steps in the human lineage, it is consistent to assume that RP dependencies of the former species have washed out during the long periods of microevolution, due to accumulating mutations. Bacterial enzymes. Since the number of CO proteins in a bacterial proteome is quite small, of the order of few tens, functional enrichment in a single bacterial species is usually not conclusive. Therefore, we analyzed the ensemble of all the reviewed bacterial enzymes in Swiss-Prot (n = 187740), which contains 6240 CO enzymes (3.3%). Enrichment levels display dominance of cell wall and response proteins (figure S10 C), consistently with previous observations [43,44].
Furthermore, our methodology allows us to pick up extreme examples of CO in bacterial enzymes, namely sections of protein sequences having high RC, which further elucidate how CO sections may accumulate throughout evolution. Two outstanding examples are Lysostaphin enzymes in two different species that belong to the same genus, Staphylococcus simulans and S. staphylolyticus. In both cases these enzymes contain a long repetitive section of 15x13 amino acid approximate tandem repeats of AEVETS-  KAPVENT. This long section serves as a pro-peptide chain which is uniquely associated with these two enzymes. Another example is XYNA_RUMFL, belonging to Ruminococcus flavefacien. This enzyme has a long midsection which is highly enriched with Asparagine (N) and Glutamine (Q), captured by the FT QNN, and has no homolog elsewhere. As a last example we point to SYE_TRIEI which has a standard Glutamyl-tRNA synthetase domain on its first 480 amino-acids, followed by a highly repetitive domain for the next 400 amino-acids (containing the FTs ATD, ATT, PVA, TAT, TPV, VAT). This long chain has no homolog anywhere in the known protein world. All these examples serve as a show-case for CO sections of proteins that did not spread to other organisms; however they survived within these organisms, presumably because they don't have deleterious effects. As such they allow us a glimpse into how very rich CO can emerge. In analogy with paleontological evidence, we may assume that many other CO variations have been tried by nature, and have been either discarded or reshaped into useful novel genes.

Comparative proteomic analysis of compositionally ordered sets of Human and Mouse
In this section we perform comparative proteomic analysis of human and mouse, based on Swiss-Prot data. The relationship between the proteins of these two species is summarized in Table 3. The 20248 human proteins and the 16513 mouse proteins are sorted out according to whether they are CO or not (NO) and, according to whether they are orthologs (V) of each other or not (X) as indicated by Swiss-Prot annotations. Some of the 8 subsets have interesting features that we describe below.
Analysis of the CO orthologous sets. Comparison of RC values in mouse and human for the 3312 orthologous CO proteins is shown in figure 4. Along the diagonal of figure 4 we find high similarity of sequences, FTs and their periodic properties. High RC values are associated with some well-known protein families, Zinc fingers (MFI = 28, 56), Collagen (MFI = 3, 6), Keratin (MFI = 5, 10). Because the lower harmonics are more prevalent, the existence of the higher harmonics suggests the effect of mutations, while the simultaneous conservation of function and high CO in both species suggests that selection played a role in maintaining them.
Pairs that deviate from the diagonal, and have high RC in one species but relatively low RC in the other, have also low sequence identity (,70%), as measured by Needleman-Wunsch (NW) global alignment. These low sequence similarity proteins comprise 11.5% (380/3312) of the studied set. Interestingly, we find that protein pairs in the upper off-diagonal (i.e., low RC in human and high RC in mouse, including SPR1B, MUC4, ZN239, K1C9, F186A, RPNT, SBSN, ZAN) display similarity between their FTs and MFIs in both species. For example, the protein K1C9 has MFI = 8 in both species and similar prevalent FTs (SGG and GSG). In contrast, the lower off-diagonal pairs (high RC in human and low RC in mouse, including CQ097, FILA, MUC2, MUC20, PGCA, PHGR1, SPRR3, INVO) have low similarity of the CO sections , i.e. usually their FTs and MFIs are different. Two exceptions are PRG4, with MFI = 8 in both species (FT = PTT), and TXND2 with MFI = 15 (FT = PKS).
Both types of observation are consistent with the fact that the mouse lineage exhibits higher substitution rates [45,46]; hence some CO structures that existed on the common ancestor may more readily wash out in the mouse lineage if they are not needed for functional purposes. Few examples provided in Text S1 (section 7, figure S11) present evidence that the repetitive sections are subjected to mutational forces that lead to an increase in the intervals in mouse and to the creation of harmonics. In Text S1 (section 7, figure S12) we have further quantified the discrepancy in mouse intervals distribution compared to their human orthologs in all proteins that contain periodic structures (MFI.1), showing that high harmonics are more prevalent in mouse by a factor larger than 2.
Because orthologous proteins in human and mouse originate from a common ancestor, it is of particular interest to study the sets in which the CO property has been lost or gained in one of the species (Table 4, sets H2 and M2). Their average RC is at bare minimum, whereas their average RP is high, compared to the orthologous CO sets. This may indicate that in these cases while RC decreases (low FT coverage) functional evolutionary constraints retain periodicity which leads to the high RP. However, we did not find any significant functional enrichment is these subgroups. Therefore, it is possible that high RP based on only few FTs, reflecting low RC, may simply indicate the degeneration of CO, a phenomenon that we have encountered in the off-diagonal proteins in figure 4. Alternatively, the high RP and low RC in these sets may reflect the generation of new CO sections in the respective species with no evident functional purpose.
Analysis of CO non-orthologous novel protein sets. Another interesting case is that of CO sets which are novel (Table 4, sets H3 and M3), i.e. have been created after the two lineages have separated from each other. The latter are particularly abundant in human: whereas human has 1368 nonorthologous CO proteins (set H3), mouse has only 125 such proteins (set M3). This discrepancy by an order of magnitude is quite astounding, indicating that novel CO proteins have accumulated to a larger extent on the human lineage since its departure from the mouse lineage. Comparable novelty is observed also for all non-orthologous proteins (both CO and  Each point corresponds to a pair of such proteins (n = 3312). Low homologies are marked by circles. Usually, their CO sections are comparable, however revealing higher harmonics in the mouse (Text S1 -section 7, figure S11 In view of the large disparity between the numbers in H3 and M3 we have also extracted the number of non-orthologous proteins in human and mouse using various other databases and methods, and re-assessed the ratio between human and mouse novel CO proteins (Text S1 -section 13 and Table S5). The ratios we obtain are in the range of 2 to 5. Thus, although the discrepancy is less pronounced than in the Swiss-Prot set, it is still large and significant. Consistently, all sets show that significantly many of the human novel CO proteins are Zinc Fingers.

Compositional Order vocabulary serves as a signature of macroevolution
In order to study effects of CO for a wide range of species, we have extracted from the NCBI-RefSeq data-base well annotated proteomes, listed in Table 5, of 39 eukaryotes (including 7 protista), 36 bacteria and 19 archaea, distributed across the treeof-life [47]. In this table, we ordered the species according to the kingdoms Animalia, Plantae, Fungi, Bacteria and Archaea. The ordering of the eukaryotes follows the tree-of-life, which is also a reasonable ordering of organism complexity. We explore the FTvocabulary, a measure of proteomic CO richness, which is defined by the total number of Different Frequent Triplets (DFT). This is the count of FT types rather than the number of FT occurrences on the proteome. Proteomic DFT counts displayed in Table 5 are insensitive to redundancy, because two identical proteins in a proteome contribute the same FTs.
One may discern in Table 5 a general trend of decrease in DFT counts among eukaryotes with increasing evolutionary distance from human, excluding the 7 protista which are added at the end of the list of eukaryotes in arbitrary order because of uncertainties in their phylogenies. Bacteria are ordered by the phylogenetic distance between phyla from firmicutes to protobacteria, with decreasing DFT counts within each phylum. Archaea are ordered by DFT counts. DFT counts of prokaryotes are mostly in the hundreds, with few exceptions in the thousands; the latter usually occur because of few highly ordered long genes, unlike in eukaryotes (see Methods sensitivity analysis, and Text S1 -section 2).
The data of Table 5 are grouped together into major taxonomic divisions in figure 5. The latter exhibit characteristic ranges of DFT counts that distinguish these divisions from one another, allowing for a meaningful and significant hierarchical order differentiating the successive kingdoms (Kolmogorov-Smirnov P-values#10 22 ). The decrease of DFT may be also correlated with a decrease in the complexity of the organism. This correlation is not exact; however the trend is clear, yielding a decrease by factor of 5 from human to yeast. Protista are exceptional, with DFT counts overlapping with those of plants and fungi. This is consistent with the conventional view of protista as being a diverse grouping of organisms that may not be closely related via evolution. Among prokaryotes, we find an interesting systematic functional trend of DFT counts: thermophiles exhibit much lower numbers (few tens) than mesophiles, with differentiating P-value of 1.4610 24 . This is in agreement with the analysis of Pe'er et al. [33] which found bacterial thermophiles to be more closely related to archaea. It seems quite natural to expect that the low DFT counts of thermophiles are due to evolutionary pressure, since highly repetitive amino-acid sequences may be less stable under extreme temperatures. The few observed FTs may be important to induce favorable structural changes [48].
No hierarchical order of the kind displayed here can be achieved by measures such as the number of proteins, fraction of CO proteins, average protein lengths (Text S1 -section 3, figure  S6) or other genomic characteristics [49]. It is interesting to note the species that seem to possess extreme DFT counts within their particular kingdoms. Fungi with the largest number of DFTs are plant pathogens (Nectria haematococca, Botryotinia fuckeliana). Bacteria and archaea with very large DFT counts live in aquatic or cold environments (Chlorobium chlorochromatii, Cenarchaeum symbiosum) or possess very complex functionalities (Nostoc punctiforme). This may support the view that ecological and environmental conditions, such as decrease in temperature, or inter-species hybridizations as in the case of amphidiploids in plants [50], had shaped DFT distributions in these species. In Text S1 (section 6) we present a reanalysis of the same data using a modified restrictive definition of FTs which further abolishes any length-dependent contribution (see Methods), resulting in figure S9. The same characteristics are obtained with slightly different P-values. We conclude that the proteomic DFT counts lead to a unique correlate of evolution, which is insensitive to the exact FT definition, providing a distinguishing hierarchical order-parameter.   This observation motivated us to analyze the identities of the DFT contents of different proteomes. Defining DFT I to be the set of DFTs in proteome I, we look for the DFT-correlation between different proteomes I and J, defined by the Jaccard index: C IJ = (DFT I >DFT J )/(DFT I <DFT J ). The results are displayed in figure 6, containing all eukaryotes, and figure 7, containing all prokaryotes. In eukaryotes, the divisions between the sets displayed in figure 5 are delineated also through the C IJ . Moreover, mammals (id 1-5) stick out among the vertebrates (id 1-10), having distinct DFT sets of their own. Invertebrates (id [11][12][13][14][15][16] have the largest correlation to vertebrates; and plants (id [17][18][19][20][21][22][23][24] show some correlation with animals, and less with fungi (id 25-32). In contrast, some protista species (id [33][34][35][36][37][38][39] show an unexpected correlation which gradually decreases from Animalia to Plantae to Fungi. We note the particular case of Zebra-Fish (id 10), whose DFT count supersedes that of human yet its correlation with mammals is not too high. Hierarchical clustering, using Euclidian average distance of C IJ , shows that the hierarchical tree largely overlaps the phylogentic tree. Animals, Plants and Fungi form separated clusters with only little misplacement ( figure 6, bottom). Three Fungi that are plant pathogens, Nectria haematococca, Botryotinia and Aspergillus niger are grouped with Plants. Protista species are distributed across the tree.
Another evident difference between the kingdoms is the identity of the most abundant FTs in the proteomes. They are presented in Table 6 for several selected species, highlighting those resulting in amino-acid runs. The latter are significantly more abundant in eukaryotes than in prokaryotes (figure S13).
Last, we applied the same technique of DFT correlations (defined at protein levels) to human proteins. The analysis shows that it leads to classification of principal functional groups, notably various metabolic processes of macromolecule biosynthesis, response to unfolded proteins and numerous developmental, morphological and anatomical structure proteins (Text S1section 8, figure S14). A compendium of human protein information is presented in figure S15. There we sort all CO proteins according to the clustergram (figure S14) and present the distributions of DFT numbers in proteins, and protein numbers in which each FT occurs. We also zoom-in onto the 50 leading FTs exhibiting characteristics of co-occurrences.  Table 6) and many DFT are quite rare, i.e. found on few proteins. Similarly, few proteins carry many DFT and many proteins carry much smaller numbers of DFT. Many of the prokaryotes seem to have similar DFT distribution behavior as well, but their variance is much larger. The specific example of E. coli added onto figure 8 serves to demonstrate the large variance observed for a single bacterium. Nevertheless, when individual distributions of many prokaryotes are superimposed, they reveal a power-law behavior as well (Text S1 -section 9, figure S17), suggesting universal characteristics. The general character of power-law distribution is similar to evolutionary genome universals such as the membership in paralogous gene families [1] and node-degree relations in biological networks [51]. Figure 8 may therefore add a new law to the 'laws of genome evolution' [52], this time at the peptide level. Further support for this view comes from observations of spontaneous expansions of triplets in higher taxa [35].
Universality of CO measure characteristics. Studying the behavior of CO measures we find that in single proteins RP and DFT show universal dependence on protein length L (figure 9). As an example, we show in figure 9A that RP for human is negatively correlated with L, while figure 9C shows that DFT for human is positively correlated with L. RC has no significant correlation with  Figure 9D displays the behavior of RP for all species, while figure 9G is its analog for DFT. The universal behavior of these trends is quite obvious, as shown by the linear regression slopes for the average of these measures over all kingdoms in figures 9E and 9H, respectively.
Because FTs are not expected at random (see Methods), the behavior displayed in figures 9C and G implies that growth of protein length and CO are linked. This suggests the possibility that incorporation of CO may be an element of the mechanism of protein elongation. The decrease of RP, which is defined as the fraction of all FTs that participate in the MFI, may occur in one of two ways: either by increasing FT occurrences, as hinted by the increase in DFT, or by a decrease in the number of FTs participating in the most prominent periodic structure. Assuming that many of the longer proteins may be considered to be of older evolutionary origin [53], the decrease in RP could be blamed on mutations that were accumulated during evolutionary history. A direct analysis of the relationship between RP and protein age shows that high RP is associated with relatively young protein age (Text S1 -section 12, figure S20).
Another feature of universality is provided by the rank-ordered interval distribution functions. The latter is reminiscent of the Zipf-law (Text S1 -section 9, figures S18, S19), a hallmark of many dynamic evolving systems [54], as well as languages where word frequencies follow this law [55].

Discussion
Many efforts to order major taxa according to known genomic measures are inconclusive [49; figure S6]. Aside from clear differences in the karyotypes, the best genomic discriminating factor between eukaryotes and prokaryotes is the prevalence of amino acid repeats in the former [26]. Their role in evolutionary processes, in particular fast evolution of protein function and development of phenotypic complex traits is well accepted [11][12][13][14][15][16]. However, because of the large diversity of repetitive sections, it is difficult to find genomic variants and determinants that may further elucidate their importance. Existing methods are usually tuned to capture a few aspects of the nature of compositional order, but a unifying framework has been missing [56].
Here, we presented such a unifying framework, by generalizing the concept of compositional bias to that of compositional order (CO), which captures all scales of repetitive peptides, from runs to repetitive domains. This is achieved by identifying multiple occurrences of frequent triplets that are not expected at random. As such, their existence on protein sequences allows for detecting various patterns, and provides novel measures of order. We focused in this study on three CO measures: 'regularity', the relative coverage of FTs in proteins (RC), which correlates well with sequence entropy; 'periodicity' of FT recurrences (RP); and FTvocabulary, the number of different FTs (DFT), representing the richness in the vocabulary of CO on a protein or within a proteome.

RC, RP and evolution
RC and RP provide novel perspectives on the evolution of proteins and proteomes by putting various observations into a common framework and applying comparative analyses. One of the astounding facts is the CO enrichment of novel proteins in human compared to mouse, leading to an increase in RC and RP. To a large extent, the latter is due to the large increase of ZF proteins in the human lineage.
Whereas an increase in RP is correlated with a general increase of the CO component of the proteome, we observe that a decrease of RP reflects the effect of mutations along the microevolution of a lineage. Rapid evolution by high mutation rate will tend to erase the periodic nature of repetitive sections in protein sequences. Thus, when comparing CO orthologous proteins in human and mouse, we find that some of the latter exhibit a clear decrease in RP, which may be blamed on the higher substitution rate along the mouse lineage. Our analysis of interval distributions is consistent with faster evolutionary substitution rate on the mouse lineage [45,46]. A similar conclusion that RP functional enrichment of proteins deteriorates with evolutionary age of the organism follows from the study of response and extracellular proteins, for which RP-enrichment was seen to decrease from human through A. thaliana to S. cerevisiae.
When a protein's RP decreases along evolution this is evidence that the particular period, rather than its harmonics, may be less important to its function. Interesting cases are collagen and keratin proteins in human and cell wall proteins in plants, fungi and bacteria. In these cases prevalent intervals reflect the existence of underlying repetitive motifs, but protein enrichment with respect to RC and not RP indicates the tendency of these sections to rapidly accumulate mutations for functional purposes. Thus RC, rather than RP, correlates with these functions.

Universality of the CO measures
We note that the balance between the three forces of evolution, mutation selection and innovation, acting on CO sections in proteins is universal. This we conclude from three observations: 1) the power-law distribution of DFTs in proteomes ( figure 8).
Interestingly, the accumulation of FTs in the proteome is similar to evolutionary genome universals such as the membership in paralogous gene families [1]. Such power-law behavior is observed in node-degree relations in biological networks [51], suggesting a particular role in protein-protein interaction (PPI) and metabolism, networks of similar functional node-degree architecture. Indeed, various metabolic processes, notably of macromolecule synthesis, are found in our human CO set. Other evidences for the role in PPI comes from the observation of CO enrichment in PPI hubs proteins [57], and the association of variations in repetitive sections with the evolution of PPI network topology [58]. 2) the distribution of intervals resembling in character to Zipf's law (figure 1B). Zipf's law, the rule of word frequency in text, is a hallmark of linguistic structure [55]. The intervals between FT recurrences correlate with the lengths of larger motifs (see, e.g. the ZF protein PRDM9 in table 1), thus their frequency represents motifs frequency to a large extent. The appearance of Zipf's law also suggests fast evolution of motifs, such that there is no characteristic length scale of motifs, i.e. the distribution of FT intervals is scale free. In analogy of conventional interpretation of Zipf's law [59], our analysis suggests fast evolution of immune system and response related proteins in human, as well as cell-wall and response proteins in plants, fungi and bacteria. This strengthens the view that CO vocabulary is a hallmark of evolving diverse functionalities, a consequence of the necessity of some proteins to interact and adapt to fluctuating environmental conditions. 3) the relationships of RP and DFT with protein length (figure 9).
RP decreases while DFT increases with protein length, in all species. The ratio of the power-law exponents of RP and DFT is approximately constant, indicating a balance between repetitive structure degeneration and CO vocabulary escalation. The positive correlation between RP and protein age (figure S20) implies that insertion of raw repetitive material is a possible mechanism responsible to protein growth. This also further associates RP with relatively new functions.

DFT and macroevolution
The observation that DFT counts increase from archaea to vertebrates, providing a clear delineating hierarchy of major clades of organisms, is a unique case of correlating proteomic information with evolution. Since it relates to major taxonomic groupings, the evolutionary context to which it belongs is macroevolution [60]. Macroevolutionary changes are invariably connected to major genomic changes. Novel taxa and novel functions are marked by gene and chromosome rearrangement [61], and gene duplications [62] which may occur even after speciation [63]. This is also when major effects may occur in CO properties, as reflected by DFT counts. Thus we posit that changes in DFT reflect macroevolutionary events. In other words, we envisage major CO accumulation to occur mostly during macroevolutionary events. The following microevolutionary forces of mutation and selection can diminish or modify the CO, leading to the presently observed structures.
Eldredge & Gould [64] emphasized that long periods of small evolutionary changes are intertwined by relatively short periods of major changes, a phenomenon they called punctuated equilibrium. The Cambrian explosion period is a striking example, where changes from unicellular to multicellular species occurred within few 10 MY after billions of years dominated by microevolution. More recently, large-scale analysis of various measurements combining data from fossil records [65] showed that macroevolutionary steps indeed occur in rare bursts at time scales .1MY, presumably as a consequence of permanent changes in ecological and environmental properties [66]. Thus we should expect that major changes of DFT counts have occurred at relatively short periods of time, while most of evolutionary history accounted for smaller changes that accumulate during microevolution.
Gould [67] pointed out that one should not be influenced by our parochial focus on human, believing that evolution proceeds in the direction of complexification, since speciation may just as well take a turn toward simplification. Nevertheless, even if macroevolution can go both ways, it must still be true that high complexity of an extant organism, as well as a high DFT count, is a good indicator that its lineage has gone through many steps of macroevolution. CO structures that we observe on proteomes had survived while being modified by mutation under selection constraints. This suggests that analogously to birth-death-innovation models of protein domains evolution [68], similar forces shape the evolution of repeats at the peptide level.
In prokaryotes we find that DFT counts do not discriminate between major phyla. We observe, however, a clear distinction between mesophiles and thermophiles, suggesting that CO generation and conservation is also condition dependent. Thermophiles have characteristically lower DFT counts. The ones that we observe presumably have been selected for functional purposes. Evidence for this is the crucial role of CO in the induction of necessary structural changes under extreme conditions [48], and the prevalence of functional peptide motifs in extremophiles [69].
DFT content, rather than DFT counts, serves as another handle on proteomic relationships. Boundaries of prokaryote phyla may be discerned by their DFT content dissimilarity. This result is of particular interest when compared to previous attempts to find amino-acid sequence correlates of kingdom and super-kingdom divisions. Using information about single, double and triple amino-acid distributions, Pe'er et al. [33] observed some separations in a principal component analysis. Triplets turned out to be the best distinguishing elements. In our analysis, also based on triplets, but constraining them further to fit into FTs, we find that DFT correlations are highly significant, exposing DFT contents to be an important sequence correlate of kingdom identity.
Macroevolution is also affected by the landscape of inter-species interaction as in the case of plants and insects [70]. Here, the role of inter-species interaction is insinuated by DFT proximity of species in one kingdom to species in a neighbor kingdom, as exhibited by the similarity of fungal plant-pathogens to plants. Such interaction was previously suggested based on analysis of simple sequence repeats contents [34]. Interestingly, many of the CO proteins are response and immune related. Therefore, it is possible that these proteins coevolved as part of the interaction between the species defense systems. Thus, the fundamental factors that shape evolution of different lineages, i.e. unequal distribution of changes over time and correlation with ecological and environmental properties, also seem to shape CO composition at the kingdom level.
Lynch & Conery [71] studied the ordering of species based on the effective population size x mutation rate, N e u. While this measure is based on comparisons within species, i.e. it is of microevolutionary nature, their results suggest that increasing genomic complexity, associated with transitions from prokaryotes to eukaryotes, is a consequence of magnified random genetic drift.
Comparing N e u with DFT counts over eukaryotes we find that they anti-correlate (Pearson correlation = 20.6, P-value<10 22 , Text S1 -section 11, table S4). This suggests that drift plays a central role in shaping FT evolution. Furthermore, we have seen that some CO in bacterial enzymes exhibit de novo creation which has presumably reached fixation through drift, without having any clear functional advantage, although we cannot exclude the possibility that these sections have emerged recently. However, there are various cases that are indicative of the effect of purifying and positive selection. Purifying selection is indicated by the very small numbers of DFT in thermophiles, while evidence for positive selection is provided by functional enrichment of various protein families with respect to increased CO (i.e., RC or RP), and by the mutations in synonymous and non-synonymous sites (Text S1section 10, table S3). To better quantify the relative contribution of positive and negative selection in specific protein families, one should resort to studying the traditional dN/dS ratio.
In summary we claim therefore that the FT tools that we have introduced and studied have proved themselves as meaningful measures of biological investigation. Moreover, they turn out to be very useful in providing the means for specifying which features are correlated with different protein annotations, and how the latter can be studied in a comparative genomics perspective. We believe that the highlight of this formalism is the fact that proteomic DFT counts turn out to delineate correctly major biological kingdoms, thus leading us to posit that CO vocabulary is intimately linked with major evolutionary forces.

Methods
There exist known mathematical tools that come to mind for studying compositional order. One is the Shannon entropy [3,5], and the other is the Fourier transform. The first is low when a clear imbalance in multiplicity of different amino-acids occurs, and the second should provide peaks for dominant periodicities. But the question remains what should be the basic variables. Entropy misses out on the co-occurrence of amino-acids in repetitive kmers, while Fourier decomposition is much too noisy to allow for useful analysis. The difficulty and the need in constructing new and more general characterization of repeat patterns have been recently emphasized in a survey of existing methods [56]. Here, we establish a unifying framework for studying all types of compositional order (CO) within protein sequences. The basis for our systematic study is the identification of multiple occurrences of amino-acid triplets that appear far beyond random, which we define as Frequent Triplets (FTs). FTs allow for defining CO observables that facilitate the quantification and identification of structural elements of CO.

Definition and analysis of frequent triplets (FTs)
Triplets of amino-acids represent a set of 8000 elements. The multiple appearance of a single triplet on a protein should have been a rare event had the sequence consisted of an independent arrangement of amino-acids. The probability to observe any triplet exactly n time in a sequence of length L, is given by the Bernoulli distribution: Pr(L,p; i~n)~L i À Á p i (1{p) L{i , where p = 1/8000 in a uniform random model. The expected value, E, of the number of different triplets that appear at least n times in a protein is therefore: thus, E/8000 is the P-value for FT misidentification. Numerical search for the occurrences of every one of the possible 8000 triplets on purely random protein sequences perfectly matches this theoretical estimate ( figure 10). For the vast majority of proteins n = 5 is sufficiently restrictive for eliminating any random occurrences while maximizing the signal, the number of identified FTs. However, for very large proteins (L.8000) E becomes quite large. Although the number of such long proteins is very small (3 in human, TITIN, MUC16 and SYNE1), a more restrictive definition that reduces E, can be formulated by requiring that repetitions ($n) of a triplet should occur within a section of length M on the protein's sequence. The latter may become useful for the analysis of long proteins.
In Text S1, we present both the uniform random model of Eq. 1 and a unigram random model based on first order statistics of all human proteins (section 1), and provide detailed analysis of the human proteome comparing it to the two random models for various values of M (section 2). The analysis of random models suggests that a regular assignment of a triplet as FT, using n = 5 yields P-values,10 23 for proteins with L,8000 and is sufficient for any practical use. The fact that the regular definition captures high-order structures in long proteins that are otherwise missed by restrictive definitions provides further justification for the use of regular FTs (Text S1 -section 2, figure S5). Thus, we define Frequent Triplets (FT) of a protein to be those amino-acid triplets that are observed to occur five times or more, not necessarily as tandem repeats, on its sequence. FT of a species proteome will be any FT that appears in at least one protein of this species.
Setting a threshold n for FT definition allows comparing CO between different species. With n = 5 a reasonable description of CO can be obtained for all eukaryotes and prokaryotes. Lowering the threshold one runs into the problem of random effects, while raising the threshold is too restrictive (figure 10, inset) ending up with almost no FT assignments for some prokaryotes. We find that 27% of the human proteins are CO, i.e. contain FTs, while this fraction reduces to 17% in yeast and to 3% in E-coli. The fraction of CO proteins in the proteome is shown in Text S1 (section 3, figure S6b) for the main taxonomic groups we study. Similar orders of magnitude of ordered sequences are obtained by other methods [26]. Our methodology has another important outcome for comparative analysis: it tends to unify the mean compositionally ordered protein length across the tree-of-life (Text S1 -section 3, figure S6 c,d). In other words, CO proteins display higher length similarity between eukaryotes and prokaryotes than average lengths defined over their complete proteomes.

Measures of compositional order (CO)
Many protein sequences show significant periods or quasiperiodic repeats. We wish to distinguish between the two aspects of regularity and periodicity. This is because both can be viewed as orderly elements of a protein sequence, as in many tandem repeats of any k-mer, but they are not necessarily correlated.
Regularity. Sequence regularity is traditionally estimated by the single amino-acids entropy [3,5]. The Shannon entropy can be generalized, using as its basis any k-mer of amino-acids, as follows Where N k is the number of possible k-mers in a sequence, N k = 20 k ; L is the length of the sequence and n i is the number of occurrences of the i th k-mer in the sequence. S k measures sequence regularity, and is low when a clear imbalance in k-mer multiplicity occurs. For comparative analysis, it is useful to normalize the entropy by it maximal value, defining the normalized entropy, nS k . This describes by how much a given sequence differs from a random sequence of maximum entropy. For k = 1, L&N 1 , the maximum of S 1 is log 2 (20), and the normalized entropy is given by nS 1 = S 1 / log 2 (20). For k.2, usually N k &L thus the normalized entropy is given by nS k = S k /log 2 (L).
Considering all FTs that appear on a given protein sequence, we define their 'relative coverage' RC of this particular protein as the number of distinct amino-acid loci covered by FTs divided by the protein length. This parameter correlates significantly with the normalized entropy nS 3 (Text S1 -section 4): high RC implies very distinctive CO, hence also very low nS 3 entropy. Thus, RC is a good tool for estimating compositional bias. Note that the dynamic range of RC is considerably larger than that of nS 3 (figure S7 C), thus making RC an easier tool to use.
Periodicity. Many CO sequences show significant periodic or quasi-periodic repeats (table 1). Entropy does not provide information about the periodicity of a sequence, neither does RC. We introduce measures based on the intervals between all consecutive occurrences of the same FT, I FTi . Considering all consecutive intervals of each FT, I = {I FTi }, one obtains the discrete empirical probability distribution function of all consecutive intervals, P(I). We define the Most Frequent Interval (MFI) of a CO protein as: where I max is the interval at which P(I) is maximal. Because MFI is based on the intervals of all FTs, and not on the intervals of the most repetitive single FT, it is a robust estimator. Rarely, there are two or more intervals with equal maximal probability, in which cases MFI is taken as the lowest interval, because it is often also the lower harmonic of the higher MFI. The significance of MFI may be further evaluated by considering the number of interval occurrences at MFI (see figure 1c). Based on the MFI we define the 'relative periodicity' RP as the number of FT occurrences within the MFI divided by total number of FT occurrences: RP measures the relative richness of CO observed within the periodic structure, compared to the overall CO observed in the sequence, and approaches zero as MFI becomes less significant. RP has no significant correlation with neither RC nor nS k , and therefore contains independent information (Text S1 -section 4 and figure S7). Vocabulary. We define a measure of FT-vocabulary as the total number of Different Frequent Triplets (DFT). This is the count of FT types rather than the number of FT occurrences, and can be applied either to single proteins or to a full proteome. The correlation between two DFT sets I, J representing either two proteomes or two proteins is estimated by the Jaccard score C IJ = (DFT I >DFT J )/(DFT I <DFT J ), the size of the intersection divided by the size of the union of the sample sets.

The case of overlapping triplets
Triplets may overlap. For example, in a protein containing the sequence AAAAAAA there are 5 occurrences of the single FT AAA, lagged by interval 1 and of total coverage of 7 amino-acids. This defines an MFI = 1 case, here a run of A. Similarly, MFI = 2 is a property of tandem repeats such as AQAQAQAQAQAQ, composed of two FTs (AQA and QAQ) that occur 5 times each, separated by interval 2 and with total coverage of 12. The occurrence of overlapping triplets is highly reduced in random models. Hence, one could have submitted runs to a constraint weaker than n = 5. This has little effect on global analysis, hence we stick to a single definition using always n = 5 as the minimal number of occurrence of an FT. As a result proteins which have only short runs (,7) are excluded from our set of CO proteins.

Relation to other repetitive phenomena and motifs characterization
Several tools have been developed for analyzing tandem repeats; both at the DNA level [72] and at the protein level [2,73]. While such methods may provide detailed local information on repeat properties, such as unit length, purity, and alignment-based repeat-variations, they are applicable to a subgroup of CO proteins with particular periodic structures. Moreover, such methods usually involve more than one stage of filtering and therefore require further choices of internal parameters to be able to account for each repetitive phenomenon such as tandem repeats or cryptically simple repeats. Our CO measures provide global information and are not restricted to a specific type of periodic structure. They span the spectrum of possible highorder structures, and therefore are not always directly comparable to other existing motif characterizations.
Nevertheless, within high RP proteins one finds cases of tandem repeats, for which the CO measures can account for some basic motif properties, such as unit length and purity. In a highly pure motif repeat pattern, for example PRDM9 (Table 1), the identified DFTs compose the motif of unit length 28. Thus, DFT = MFI indicates a highly pure motif. Even if part of the motif is variable, the identified FTs can appear in equally distant intervals leading to high RP, and MFI still indicates the unit length. Thus, RP is independent of motif purity, which may be accounted for by the difference between DFT and MFI. DFT,MFI indicates a decrease in purity. Note that DFT.MFI corresponds to larger regularity than periodicity, as in some low-complexity regions, and will usually indicate low RP but moderate or high RC.

Functional enrichment
For large groups of proteins we employ the bioinformatics tool provided by GOrilla [39,40]. In addition, to explore possible dependencies on CO measures, enrichment levels were estimated by text search of key-terms in GO annotations. Specifically, we estimate N tot , the total number of proteins with order measure value.some threshold. For a given threshold, we obtain the number of proteins that are associated with a certain GO term, N fun . The dependence of the ratio N fun /N tot on the threshold provides an enrichment pattern which allows for quantifying dependencies on CO measures also for small functional groups. Enrichment P-values are estimated by the Hypergeometric test, comparing the values of N fun and N tot obtained at the maximum of the ratio N fun /N tot with their values at the minimal threshold. To avoid small number effects, N fun is limited to 10% of the maximal N fun (i.e. at the minimal threshold), and in any case N fun .15, such that there exists sufficient data for statistical inference of enrichment levels.

Data bases
Full proteomes of 94 species distributed across the tree-of-life were downloaded from NCBI Ref-Seq, and served as the basis of the large-scale analysis of compositional order enrichment. Swiss-Prot reviewed proteomes of several representative species from animalia (human, mouse), plantae (A. thaliana), fungi (S. cerevisiae) and bacteria (E. coli) were downloaded and analyzed as well. For S. cerevisiae we compared in addition the Swiss-Prot proteome to SGD data-base for validation of proteins annotation. Generally, Swiss-Prot contains fewer proteins than NCBI. It provides however high quality information about proteins' biological functions, and contains non-redundant sets. Therefore, we used Swiss-Prot for detailed protein functional analysis, and also to test the sensitivity of CO measures to the choice of data base (see the following sensitivity analysis).

Sensitivity analysis
Interval distributions and the parameters specifying an FT. We have opted for the use of amino-acid triplets, i.e. k-mers with k = 3. Moreover, we have defined an FT using a constant threshold for its number of occurrences, n = 5. In Text S1 (section 5) we have tested other alternatives, i.e. different values of k and different thresholds that depend on protein length. The results, presented as rank-ordered interval distributions in figure S8, show that the regular FT-definition is optimal for capturing large repetitive motifs that are missed by alternative definitions of a frequent k-mer.  Table 5, grouped into kingdoms. There exist outstanding DFT counts in prokaryotes, such as for C. chlorochromatii, where few extremely long genes are responsible for the FT assignment of more than half the DFT. Such exceptions occur in prokaryotes and account for the outliers in figure 5. However, no such case exists in eukaryotes. Even TITIN, which carries many FTs, is responsible for the FT assignment of less than 2% of all DFT in human. Further analysis of the sensitivity to long proteins is provided in Text S1 (sections1-2).

Graphic User Interface for practical research of CO
We provide a friendly GUI MATLAB package that implements the search of triplets in proteins to help interested practitioners using our method for further research. The GUI accepts proteins in FASTA format. Both single proteins and full proteomes can be uploaded. The GUI allows changing the key parameters n and M. The output is a list of CO proteins and their CO properties, that is automatically saved as TEXT file and can be accessed with MicroSoft EXCEL. Key properties and figures are presented in the GUI output. The GUI is available at: http://neuron.tau.ac.il/ ,horn/research.html This fraction is smaller than 5% for all M. C) N CO found in a uniform random model compared with Nco found in human proteome, showing a minor fraction of ,3%. D) regular DFTs of human for various M (black bars), compared with those obtained for the uniform random model (red). 2 long proteins (TITIN and MUC16 of length 34350, 22152, respectively) may contribute a large number to the total DFTs found in the proteome. However, excluding them reduces DFT counts considerably in the random model (green) but not in the human proteome (gray). This shows that the human proteome is not sensitive to the contribution of few proteins, even the very long ones, in contrast to the random model. It also shows that the number of erroneous regular FTs in long proteins may be large and should be investigated separately. E-F) same as C-D for the human unigram model. (TIF) Figure S4 Dependence of DFT counts in the proteome on single protein contribution. Proteins were sorted by the number of FTs identified in them according to the regular (A) and restrictive definition (B) in human proteome (black), in uniform model (red) and in unigram model (blue). Long CO proteins were removed one by one (rank-ordered by length) from the set and the DFT count was reassessed. In A, the two long proteins (see text) contribute many DFTs in the random models, but not in human.  Figure S8 Sensitivity tests based on the rank-ordering interval distributions. Sensitivity test of human rank-ordering interval distributions for different k-mers (k) and repetition thresholds (th). Threshold is defined in terms of the ratio of the number of repetitions by the protein length. The proteins that are found to have k-mers that pass the threshold define a new CO set, which is compared with our original CO (defined by 5 repeats and k = 3). The percentage of overlap between the two sets is shown. Note that k = 4, th = 0.1%and the regular FT definition k = 3, th = 5 provide approximately the same CO proteins, and their interval distributions are very similar: they practically overlap, except for their tails. (TIF) Figure S15 Summary of DFT presence in the human proteome. DFT are ordered according to their abundance in the human Swiss-Prot proteome (x-axis, main panel) and proteins are ordered according to the classification of the clustergram in figure S14 (y-axis, main panel). The abundance of each FT is shown in the upper panel (blue) and the number of DFTs in each protein is shown on the left panel (blue). On the right, zoom in into the 50 most prevalent FTs showing the co-occurrences in groups of proteins. Note that their rank is slightly different from Table 6, which is based on NCBI-RefSeq.  Figure S18 Linear fits of the rank-ordered probability distribution functions in eukaryotes. rank-ordered distribution function (black circles) and the corresponding linear fits over the range 0-6 are shown for human (A), A. thaliana (B) and S. cerevisiae (C) as colored dashed lines. Applying such fits to all eukaryotes we find that the power-law exponents are close to 21 (D), displaying a universal behavior that is close to the Zipf law. Pvalues in all cases are practically 0. (TIF) Figure S19 Slopes of the rank-ordered probability distribution functions. Slopes of all individual interval probability distribution functions as obtained by a linear fit. Pvalues among all species were lower than 10 239 . (TIF) Figure S20 Protein age vs RP. The average age of proteins (black) is shown versus elevated RP. Error bars on the mean age are also shown. The statistical significance of the difference between the age distribution for a given RP threshold and the age distribution of the entire CO set was estimated according to Wilcoxon rank-sum test (red).

(TIF)
Table S1 Full list of all Human CO proteins. Detailed sequence information is provided for all identified Human CO proteins in the Swiss-Prot record. (XLSX) Table S2 Characteristics of the 10 longest human proteins. Characteristics of the 10 longest human proteins, displaying their length, assigned DFT counts according to the regular and restrictive definitions, and error estimates on the latter, given in '()', for the uniform and unigram random models respectively. In the unigram model the inhomogeneous drop with decreasing length is due to statistical fluctuations. Last column shows that amendment due to periodicity considerations (see figure S5) increases considerably the number of validated FTs. (DOCX) Table S3 Mutation and selection in cell wall proteins in yeast by DNA analysis. Analysis of repetitive motifs in cell wall proteins of S. cervisiaea. First and second columns show the name, function and starting location of the repetitive section. Third and fourth columns show the sequences at the protein and DNA levels respectively. Motifs are divided into identical sections forming groups which are colored with gray (group 1), blue (group 2) and white (group 3). The first group is taken as a reference motif and the remaining groups are compared to it in the following way: Yellow colored letters indicate mutations that cause the generation of a distinct motif, i.e. amino-acids different from group 1. Red colored letters indicate synonymous mutations within each motif group, thus do not change the motif composition at the aminoacid level, indicating that some amino-acids are protected by selection.
(DOCX) u in descending order. Both measures provide an hierarchical ordering of major clades. They also significantly correlate (correlation coefficient = 20.6, Pvalue = 0.012). Clade notations are the same as in figure 5 of the main text. (DOCX) Table S5 Non-orthologous genes and proteins in human and mouse using Ensembl. For non-orthologous genes we compare the numbers of different genes, available protein coding sequences and the number of CO proteins, for five different data sets. 1) Known genes with RefSeq Protein IDs 2) All genes, i.e., including novel and putative, with RefSeq IDs 3) All genes with UniProtKB/TrEMBL accessions 4) All genes with Entrez IDs 5) no filter applied. The ratio between the number of CO human and mouse proteins is between 2 to 5. As found in Swiss-Prot, a large fraction of the novel CO human proteins are zinc fingers. (XLSX) Text S1 Supporting analysis. Description of all the supporting analyses (13 sections) that are mentioned in the main text. (DOCX) Text S2 GOrilla analysis of Human CO proteins. Results of functional enrichments in Human CO proteins using GOrilla web-tool. Enrichments of processes, functions and cellular components are shown for three analyses: (i) comparing the Human CO proteins ('target') to all Human proteins ('background') (ii) Ranking Human CO proteins by RC (ii) Ranking Human CO proteins by RP. (DOCX) Text S3 GOrilla analysis of Human novel CO proteins. Results of functional enrichments in Human novel CO proteins using GOrilla web-tool. Enrichments of processes, functions and cellular components are shown for (i) Human novel CO proteins ranked by RC (ii) Human novel CO proteins ranked by RP. (DOCX)