Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Single Amino Acid Repeats in the Proteome World: Structural, Functional, and Evolutionary Insights

  • Amitha Sampath Kumar,

    Affiliation Centre for Cellular and Molecular Biology, Council of Scientific and Industrial Research, Uppal Road, Hyderabad, 500007, India

  • Divya Tej Sowpati,

    Affiliation Centre for Cellular and Molecular Biology, Council of Scientific and Industrial Research, Uppal Road, Hyderabad, 500007, India

  • Rakesh K. Mishra

    Affiliation Centre for Cellular and Molecular Biology, Council of Scientific and Industrial Research, Uppal Road, Hyderabad, 500007, India

Single Amino Acid Repeats in the Proteome World: Structural, Functional, and Evolutionary Insights

  • Amitha Sampath Kumar, 
  • Divya Tej Sowpati, 
  • Rakesh K. Mishra


Microsatellites or simple sequence repeats (SSR) are abundant, highly diverse stretches of short DNA repeats present in all genomes. Tandem mono/tri/hexanucleotide repeats in the coding regions contribute to single amino acids repeats (SAARs) in the proteome. While SSRs in the coding region always result in amino acid repeats, a majority of SAARs arise due to a combination of various codons representing the same amino acid and not as a consequence of SSR events. Certain amino acids are abundant in repeat regions indicating a positive selection pressure behind the accumulation of SAARs. By analysing 22 proteomes including the human proteome, we explored the functional and structural relationship of amino acid repeats in an evolutionary context. Only ~15% of repeats are present in any known functional domain, while ~74% of repeats are present in the disordered regions, suggesting that SAARs add to the functionality of proteins by providing flexibility, stability and act as linker elements between domains. Comparison of SAAR containing proteins across species reveals that while shorter repeats are conserved among orthologs, proteins with longer repeats, >15 amino acids, are unique to the respective organism. Lysine repeats are well conserved among orthologs with respect to their length and number of occurrences in a protein. Other amino acids such as glutamic acid, proline, serine and alanine repeats are generally conserved among the orthologs with varying repeat lengths. These findings suggest that SAARs have accumulated in the proteome under positive selection pressure and that they provide flexibility for optimal folding of functional/structural domains of proteins. The insights gained from our observations can help in effective designing and engineering of proteins with novel features.


The human genome encompasses a large number of repetitive elements that constitute more than 50% of the genome [1]. Simple sequence repeats (SSRs) are found in many organisms though they are predominant in eukaryotes than in prokaryotes [2]. They are dynamic elements occurring in diverse patterns and locations in the genome and are almost unique across organisms. SSRs constitute 3% of the human genome. They are considered as the “turning knobs” of evolution that offer genetic diversity as structural and functional elements [39]. SSRs evolve faster than point mutations in terms of dynamic genetic variability and there is a bias for the elongation of the repeats rather than their shortening [10]. This points towards functional utility and positive selection of these elements [11, 12]. Among SSRs, tri/hexanucleotide repeats are predominantly present in exons and contribute to single amino acid repeats (SAARs) at the protein level [13].

In addition to being a direct consequence of SSRs in coding regions, SAARs can also be coded by a combination of codons representing the same amino acid. Since proteins have diverse functions, so are the likely roles of SAARs. Some of these repeats are tolerated by the proteins while certain others lead to a monogenic disease state. Abnormal SAAR expansion is known to cause many neurodegenerative diseases in humans, with varying severity and inheritability [1416]. The role of SAARs in most proteins, however, is not yet understood. It has been shown that SAARs may serve as spacer elements, separating functional domains within the protein [17]. Certain SAARs also regulate transcription [1820] and facilitate protein-protein interactions [21]. The genes that code for proteins with polyalanine stretches show high selectivity for its base composition and are mostly genes of DNA and RNA binding proteins [22]. While some studies suggest that SAARs are present to merely increase the size of the protein [2], a study has shown that the SAARs in proteins are essential for proper envelope targeting in chloroplasts [23]. However, a general understanding of the functional importance and distribution of SAARs remains to be established.

Previous studies have addressed the distribution of SAARs but many questions regarding their function and evolution remain unanswered [20, 2427]. In this study, we have used in silico analysis to study 22 proteomes of different species of the animal kingdom to explore the evolutionary, structural and functional relevance of SAARs. This study provides clues that may be useful in understanding the organisation and stability of multiple domains in proteins and help in designing proteins with novel structures and functional features.

Materials and Methods

Proteome datasets

All proteomes of various vertebrates and invertebrates analysed in this study are listed in Table 1. The proteome datasets were downloaded from the UniProt database [28]. We considered both reviewed and unreviewed sets of proteins for our analysis. In case more than one isoform of a protein was available, we retained only the largest isoform to eliminate redundancy. Homo sapiens is the primary organism in our study, and most comparisons were done with respect to the human proteome.

Identification of SAARs

A homopolymer of any amino acid with a length of > = 5 was considered an SAAR. Every occurrence of an uninterrupted SAAR is considered as an event. A protein can thus have more than one event. For each protein in the proteome, protein id, the number of events, amino acids involved in the events, and the repeat length of each event were recorded. To identify and categorize SAARs in the proteomes, a custom Perl script was used.

We calculated SAAR density and total density for all the amino acids to discriminate their presence as repeats vs. their total occurrence. SAAR density is calculated as the number of amino acid residues present as repeats by the total proteome size normalized to one million residues. Similarly, total density is calculated as the total number of residues of any particular amino acid by the total proteome size normalized to one million residues.

Functional and structural annotation

For functional annotation of our proteome datasets, we used the PfamA [29] domain database that provides reviewed domain annotation. [30] was used for disordered domain annotation. Secondary structure information was annotated using the UniProt database for the human proteome [28]. We also populated the secondary structure annotation for all the proteins with SAARs from the PDB database where complete solved structures were available [31].

Protein orthologs and SAAR conservation

To identify protein ortholog pairs, we used an all vs. all blastp search between human and other proteomes. The blastp hits (human vs. each of the other organisms under study) with highest identity scores for each protein were searched for the conservation of functional domains. If the domains were conserved, the pairs were considered orthologous and were included in the study. Orthologs were grouped into two categories based on the level of repeat conservation. In the first category, we considered ortholog pairs where the order of SAAR events and their lengths are conserved. The second group included the orthologs where the SAAR events are conserved but not the lengths.

Gene ontology

Gene ontology (GO) annotations were done using the Panther database [32]. Statistical overrepresentation test for molecular function GO terms was performed with proteins containing SAARs, using the human proteome as background. Only those categories that showed a p-value of <0.05 were retained.

Codon organisation of Single amino acid repeats in the Human genome

To understand the base composition of protein repeats, we analysed the DNA sequence corresponding to each SAAR from the coding region of the human genome hg38. The CDS annotation was retrieved from the UCSC table browser [33]. We retrieved the codon fraction information for the Human CDS using the GenScript Codon Usage Frequency Table Tool [34]. The algorithm implemented for the extraction of genomic regions contributing to SAARs ensured maximum inclusion of such regions. For overlapping regions of CDS, the transcript with the highest number of repeat events was taken in case of common SAAR events in both the transcripts, else all unique repeat associated regions were considered. A minimum of 12 nucleotide-trimer repeats (4 repeating units) in the coding region was considered as one SSR event.

Statistical analysis

For categorization of the relationship between amino acids based on their SAAR density vs. the total density in the proteome, we performed hierarchical clustering on Canberra distance of amino acids. Canberra distance is a measure to compute the distance between paired points in a vector space. For two vectors in an n-dimensional vector space, Canberra distance d can be calculated as: where p and q are two vectors of real numbers.

The Canberra distance of amino acids was computed in R. The SAAR densities and total densities of all amino acids were stored in two vectors such that the nth element of both vectors correspond to the same amino acid. Both the vectors were fed into R’s dist function to calculate the Canberra distance.

Hierarchical clustering was performed using the hclust function in R. Linear regression analysis was performed using lm function of R, and the results were plotted using ggplot2 package[35]. Fisher’s exact test and Chi-squared tests were performed using GraphPad [36].

Repeat expansion diseases

We looked for the repeat conservation in genes associated with repeat expansion diseases [34]. Orthologous genes for these human genes were searched using the Homologene database for various animal and plant genomes, some of them including Pan troglodytes, Canis lupus familiaris, Bos taurus, Mus musculus, Rattus norvegicus, Oryza sativa, Arabidopsis thaliana, Caenorhabditis elegans, Anopheles gambiae str. PEST and Drosophila melanogaster. We studied repeat conservation in genes associated with the diseases Huntingtin (HTT), oculopharyngeal muscular dystrophy (PABPN1), and spiro-cerebellar ataxia type 3 (ATXN3) genes.


1. SAARs in the Human proteome

1.1 Distribution of SAARs in human proteome.

SAARs with a minimum length of 5 residues were searched for in the human proteome using Perl scripts developed in-house. About ~14% of the proteins in the human proteome contain SAARs. A total of 11852 SAAR events were found in 9780 proteins, indicating that a significant proportion of the proteins have >1 event. We asked if SAARs are random events linearly correlated to the total density of an amino acid. To answer this, we calculated the total density—abundance of a given amino acid in a proteome normalized to million residues, and SAAR density—the abundance of a given amino acid as part of repeats normalized to proteome size, for all amino acids (see methods). Compared to the total density of each amino acid, SAAR density showed a varying abundance (Fig 1A), indicating selective enrichment of SAARs. Using hierarchical clustering of mean Canberra distance for each amino acid (see methods), amino acids could be categorized into three distinct groups based on the ratio of their density in whole proteome to their SAAR density (Fig 1B)—First group (glutamic acid, proline, alanine, serine, leucine, glycine and glutamine) consists of those amino acids which are abundant in the proteome and also show high SAAR densities. Glutamine, proline, alanine and glutamic acid show particularly high abundance of repeats. The second grouping is of lysine, threonine, aspartic acid, arginine and histidine, which show a relatively higher total density but much lesser SAAR density. In the last group, we see a set of amino acids that are intolerant to repeats and are also less dense in the whole proteome. This group includes cysteine, phenylalanine, valine, methionine, tyrosine, tryptophan, isoleucine and asparagine. Within the last group, valine, isoleucine and asparagine show a slightly higher density in the whole proteome but remain intolerant to repeats.

Fig 1. Comparison of amino acid and SAAR density in the human proteome.

The amino acid density and SAAR density normalized to 1 million residues were calculated for all the 20 amino acids. (A) The percentage of each amino acid in the whole proteome and SAARs are represented as vertical bars. The black dot represents the SAAR percentage in each bar and the opposite end indicates amino acid percentage in the whole proteome. The bars are grouped by colour to indicate the three distinct patterns observed (see text) (blue—group1, red—group2, green—group3) (B) A distance-based dendrogram was plotted for the values of amino acid density and SAAR density for all the 20 amino acids. A distinct pattern of preference for SAARs vs. proteome density is seen clustered as three groups as described in (A) (see text)

1.2 Physical properties of amino acids in SAARs.

Proteins with stretches of hydrophobic amino acids often fold in a way such that the hydrophobic regions are shielded from solvent access [36]. We therefore asked whether amino acids with polar or non-polar side chains show any trend related to their number or length of SAAR events. To address this question, we grouped the SAARs based on whether the amino acid is hydrophobic or not. In both categories, we analysed the number of SAAR events and the SAAR density (SAAR events per million amino acid residues), the maximum number of events tolerated by a single protein, and the longest repeat in the whole proteome, for every amino acid. The hydrophobicity of an amino acid does not appear to be a factor for the abundance or length of its SAAR events (Fig 2A). However, we observed a relation between the abundance and length of SAAR events; amino acids that show a higher abundance of SAARs are also likely to be tolerated as long repeats. To test the validity of this observation, we compared the maximum number of SAAR events that could be accommodated within a single protein and longest SAAR length for all the amino acids. A linear regression fit at 95% confidence indicated that most amino acids that contribute to a high number of repeat events in a single protein can also be tolerated as long SAARs (Fig 2B). Exceptions to this are threonine and proline (p < 0.05, linear regression), which exhibit a tendency to be present as shorter but abundant events within a protein. For example, the human protein Mucin-2 (Q02817), which coats the epithelia of many mucus membrane-containing organs preventing bacteria from entering the inner mucus layer[37], has 112 threonine SAAR events, the most in all human proteins.

Fig 2. SAAR repeats lengths and occurrences in the human proteome.

A) For each amino acid, the number of SAAR events were calculated with a minimum length of 5 residues. The heat map shows the number of events in the various repeat lengths, with each cell indicating the number of events grouped by their physical properties (Hydrophobic—orange, Hydrophilic—blue). Empty cells indicate 0 events. B) For all the amino acids in the human proteome, the longest repeat present and the maximum number of SAAR events present within a protein was calculated. A scatter plot shows the longest repeat length (x-axis) vs. the maximum number of events tolerated in a single protein (y-axis), the grey shaded region indicates 95% confidence in a linear regression analysis.

1.3 Functional and structural association of SAAR in the protein.

We next asked if we could classify the proteins containing SAARs based on their function. To study this, we annotated all human proteins containing SAARs using the molecular function gene ontology terms from the Panther database and further grouped these proteins based on the hydrophobicity of their SAARs (S1 Fig). Using statistical over- and underrepresentation tests (see Methods), we observed a preference of SAARs to specific functions; proteins containing hydrophobic SAARs showed enrichment for receptor activity whereas those containing hydrophilic SAARs were underrepresented for the same. In general, proteins containing SAARs were overrepresented in various binding activities, particularly chromatin binding, mRNA binding, transcription factor binding, cytoskeletal protein binding, DNA, and RNA binding.

A closer observation of the molecular functions among proteins containing SAARs of specific amino acids revealed high enrichment and clustering to certain functions (Fig 3). Most of the amino acid repeats like alanine, proline, glutamine and glycine are ~1.5 fold enriched for various binding activity. It is also seen that transcription activity associated proteins are enriched for glutamine and glycine SAARs. We also see a high enrichment in proteins with lysine and threonine SAARs for catalytic activity.

Fig 3. Distribution of various molecular activities associated with SAAR containing proteins in the human proteome.

Molecular function class for proteins was annotated using the Panther database. Proteins with any amino acid repeat and proteins with a particular abundant amino acid repeat such as proline, alanine, leucine, glutamine, threonine, serine, glycine and glutamic acid are reported. Each cell in the heat map shows the fold enrichment between expected and observed frequency in reference to the human proteome. The colours in the heatmap scale from red to green where the fold enrichment is from 0 to 5 respectively.

Most proteins have known motifs related to distinct functions. To understand the contribution of SAARs towards functional motifs in proteins, we annotated our proteome datasets for functional domains using the Pfam-A database. We grouped the SAARs present within or outside the functional domains. Very few SAARs (1666 out of 11842, ~15%) were present in the functionally annotated domains while most of the SAARs mapped outside any known functional domain (S1 File). This suggests that SAARs are generally not part of any known functional domains and do not have any distinct functional features on their own.

Proteins have a complex structure determined by the three-dimensional folding of different structural domains, while some regions within the proteins remain unfolded and disordered. Disordered domains offer an advantage to the protein by making it more structurally flexible and facilitate optional folding [38]. To infer any association of SAARs with such regions, we used the database to annotate disordered regions in our SAAR containing protein datasets. We could annotate 8689 out of the total 11852 SAARs found. Interestingly, a large number of annotated events (6406 out of 8689, 73.7%) were part of disordered regions. We further classified the events for each individual amino acid to assess the preference of various amino acids to be part of disordered regions. Repeats of many amino acids displayed a preference for occurrence within disordered domains. Particularly, more than 90% of the SAARs involving the amino acids proline, serine, glycine, glutamic acid and aspartic acid were in disordered regions (Fig 4A). This was further confirmed using linear regression analysis, which showed that the fit for most amino acids is inclined towards disordered regions (Fig 4C). Contrarily, leucine and alanine showed an opposite preference, with a majority of their SAARs falling in non-disordered regions. When we grouped the SAARs based on the hydrophobicity of amino acids, we observed that hydrophilic amino acid repeats are more inclined to be part of disordered domains (p < 0.0001, Fisher’s exact test, Fig 4B). This observation points towards the role of SAARs in providing flexibility to the proteins rather than being part of functional domains.

Fig 4. Disordered and regular domains in the human proteom.

(A) SAARs were categorized based on their presence in the regular or disordered region of the protein, annotated using the database. The X-axis denotes each of the 20 different amino acids and the Y-axis shows the number of SAAR events in the disordered and regular part of proteins in the human proteome. (B) A stacked bar plot shows the number of events in the disordered region of the protein (red) and the regular region (blue) categorized based on the amino acids’ physical properties. (C) Scatter plot with linear regression analysis for each amino acid showing the number of SAAR events in the disordered regions (X-axis) vs. SAAR events in the other regions (Y-axis). Grey shaded area indicates 95% confidence interval.

1.4 Codon structure of Single amino acid repeats in the Human genome.

SAARs can be a direct consequence of simple sequence repeats (SSRs) present in coding sequences or a combination of codons coding for the same amino acid. Using coding sequences corresponding to SAARs in the human genome hg19, we set out to identify the fraction of SAARs contributed by SSRs. For all the amino acids that are present as SAARs, we calculated the number of times they were present as SSRs (a tandem triplet repeat of at least 12 nucleotides (4 codons)) (S2 Fig). We observed that 32% of SAAR events were present as SSRs at the genomic level and the rest were constituted of a combination of codons. Interestingly, just 46 SAAR events were split across exons at the genomic level (data not shown).

We further asked if there was a bias in codon usage at sequences contributing to SAARs. To check this, we gathered the codon usage for the entire human CDS using GenScript Codon Usage Frequency Table and compared it with the codon fraction for SAARs. For most amino acids, SAARs comprising of mixed codons showed no significant bias compared to codon usage of human CDS (p > 0.01, Chi-squared test, Table 2). However, SAARs arising from SSRs at the genomic level showed a strong preference to certain codons. The most drastic example is of threonine repeats; >99% (1393 out of 1404) codons encoding threonine repeats were ACC repeats whereas ACC accounts for only 36% of all threonine amino acids in the human proteome. Other amino acids that display this bias are glutamine (99% CAG), histidine (96% CAC), leucine (93% CTG), lysine (92% AAG), glutamic acid (88% GAG), aspartic acid (88% GAT) and glycine (84% GGC). The bias shown by these amino acids was statistically significant (p < 0.0001, Chi-squared test, Table 2).

Table 2. Codon fraction for SAAR coding regions against codon fraction of CDS in the human genome.

2. Comparative analysis of SAARs among vertebrates and invertebrates

2.1 Distribution of SAARs in various species.

To study the changes in the abundance of SAARs during evolution, we compared the SAAR density per million residues (normalized to proteome size) of all the amino acids in the 21 proteomes studied (Fig 5A). We performed a two-way clustering to group amino acids and species that show similar SAAR densities. The SAAR density is found to be relatively similar among all the species. Alanine, serine, aspartic acid, glutamic acid and glutamine SAARs show a high density across most of the proteomes. Interestingly, certain amino acids show high density in invertebrates compared to vertebrates. Notable examples of this trend include serine SAARs in spider, glutamine SAARs in Drosophila, and alanine SAARs in Leishmania. Leech showed a very high density of aspargine SAARs, which was not observed in any other species studied. Among the vertebrates studied, the king cobra proteome showed the highest density for glutamic acid SAARs. We did a similar analysis for the number of SAAR events (normalized to the total number of SAARs events in the proteome), where a pattern similar to SAAR density was observed (S3 Fig). The amino acids valine, isoleucine, phenylalanine, cysteine, tryptophan, tyrosine, and methionine are consistently low in all proteomes including humans where these amino acids are clustered in the third group (Fig 1A, green bars). The high density and events of certain SAARs across all the proteomes suggests that they have been selected for and retained during evolution. Since the SAAR density and number of events were equally abundant in all the proteomes studied, we also wanted to look for the longest SAAR tolerated for all the amino acids in these proteomes (Fig 5b). For most amino acids, the longest SAARs were observed in invertebrate proteomes. Sea Urchin has the longest SAAR for valine, glycine, methionine, and glutamic acid. Similarly, sea anemone and leech have the longest glutamine and aspargine repeats respectively. Similarly, SAARs longer than 100 residues were mostly seen in invertebrate proteomes where the sea anemone proteome has a glutamine SAAR of 163 residues, the leech proteome has an SAAR of 117 asparagine residues and the sponge proteome that has an aspartic acid SAAR of 120 residues. Zebrafish was the only vertebrate in the study with an SAAR longer than 100 amino acid residues, with an SAAR of 144 aspartic acid residues.

Fig 5. SAAR density and longest SAAR among all proteomes.

(A) SAAR density was calculated and normalized to one million residues for the indicated proteomes and plotted as a heatmap where the X-axes show individual amino acid associated repeats and Y-axes have all the organisms under study. The plot is two way clustered to group amino acids and species with similar densities. (B) A heat map was generated for the longest repeat length for all the amino acids in each of the vertebrate and invertebrate proteomes under study. The plot is two-way clustered between the longest SAARs (X-axis) and the proteomes (Y-axis).

2.2 Conservation of SAARs of human proteins among orthologs.

We have shown that the SAARs are equally abundant in terms of their density and frequency among all the 22 proteomes studied. To understand the conservation of repeats among various species, we looked at ortholog pairs of human proteins. We considered two parameters to define orthologous pairs of proteins—a minimum of 35% sequence identity and matching domains/domain families as annotated by the Pfam-A database. For all the orthologous pairs, we looked for the conservation of the repeats and categorized them into two groups based on the degree of repeat conservation. The first group looked for a stringent match of SAAR length and the order of conservation (in cases where more than one event was present within a protein). Most of the conserved SAAR events in the first group were of short lengths (~5–9 aa residues) (S4 Fig). The second group included those protein pairs where the SAAR events were conserved in the same order but their lengths were different. For both the groups, we ranked each amino acid based on the number of conserved human ortholog pairs. These ranks were then plotted as a heatmap (Fig 6). A good conservation was observed for SAARs containing alanine, leucine, proline, glycine, serine, lysine and glutamic acid in both the groups. Glutamine SAARs were moderately conserved in both groups. The second group exhibits good conservation with alanine, serine, and glycine associated SAARs. Threonine, cysteine, and asparagine SAARs were seen to be moderately conserved, predominantly among vertebrates.

Fig 6. Group 1 and group 2 orthologs.

The data shows the SAAR conservation between orthologous proteins of several vertebrate and invertebrate proteomes. Amino acids are ranked by the frequency of conservation (lower value indicates better conservation). Group 1 (top) contains human ortholog pairs in which the SAAR events are conserved in terms of events and repeat lengths. Group 2 (bottom) contains human ortholog pairs that are conserved by the number of events but not their repeat length. Group 1 is two way clustered to group amino acids and species with similar conservation. Group 2 follows the order of amino acids and species of group 1 to allow an easier comparison between plots.

2.3 Gain of SAAR events among orthologous proteins along the evolutionary scale.

In our analysis, we identified several proteins which showed a large number of SAAR events. For example, the human Formin2 protein, which is responsible for actin cytoskeleton organisation and cell polarity, has 28 events of proline SAAR of length 5–6 residues along with one glycine and two glutamine repeat events [39]. To understand if the high number of SAAR events were acquired progressively during evolution or were gained by a common ancestor early on, we compared several human proteins with high SAAR events with their orthologs across species. Orthologs of Formin2 show a clear decline of proline SAAR events as we go down the taxonomic hierarchy. We observed this trend in a few other candidate proteins as well, irrespective of the amino acid repeat they contain (Table 3). These examples suggest that the acquisition of many SAAR events by a protein is not spontaneous, and instead is a consequence of several events of positive selection for amino acid repeats.

Table 3. Protein orthologs for SAAR containing proteins among animals (one event is a mono aminoacid stretch of ≥5 residues).

3. Secondary structures of SAARs in proteins with solved structures

To study the structural confirmation of SAARs, we annotated the SAAR events with secondary structure information from the UniProt database, which was available for only 58 SAAR events in the human proteome. As the number of SAAR events that could be annotated using UniProt was too low to infer meaningful trends, we instead used the data of all proteins with solved structure information from the PDB database of all organisms. 16731 proteins with solved structures contained at least one SAAR event, out of which less than 1% had secondary structure annotation for SAAR regions. This is in concordance with our observation that most of the SAARs map to disordered regions of the protein, and hence may not be contributing to the secondary structure of the protein directly.

4. Repeat expansion diseases

Many studies show that certain SAARs, upon abnormal expansion, cause various neurodegenerative diseases in humans [9, 14, 20, 34]. We looked at the orthologs of such proteins and their SAARs. In cases where the severity of the disease is linked to the extent of repeat elongation (for example, CAG expansion diseases), orthologous proteins also had the repeats but at variable lengths, irrespective of the phylogenetic lineage (Fig 7, S1 Table). However, in proteins responsible for PolyA expansion diseases like OPMD, where even a single codon expansion of the GCG repeat (PABPN1 gene) would cause the diseased state, all the orthologous proteins had perfect conservation of the repeat lengths. These observations indicate that not only did repeats get selected during evolution, but they are also maintained to prevent abnormal expansion that can cause diseases.

Fig 7. Number of repeats in genes associated with repeat expansion diseases.

The genes PABPN1, ATXN3 and HTT are associated with repeat expansion diseases. PABPN1 is associated with PolyA (GCG) expansion that causes OPMD, ATXN3 and HTT cause SCA3 and HTT, respectively upon abnormal expansion of PolyQ (CAG). (A) shows the length of SSR present in the CDS of the genes (B) shows the length of SAARs in the proteins of the respective genes.


Repeat elements are the most abundant class of DNA elements in the genomes, particularly in higher eukaryotes. However, repeat regions are less common in proteins. SSRs contribute to amino acid repeats if they occur in exons. Our analysis shows that only 32% of SAARs arise due to SSRs, indicating that SAARs are not merely a consequence of SSRs in the genome but have evolved independent of SSRs. This is further corroborated by the observation that few amino acids (glutamic acid, proline, and alanine) show much higher SAAR density across most proteomes compared to other amino acids. We also noted that some amino acids like glutamic acid, serine and glutamine are tolerated at much higher repeat lengths than others. In many cases, the SAAR frequency of an amino acid is proportional to the lengths that could be tolerated. However, amino acids such as threonine, lysine, arginine and aspartic acid are never found as large repeats albeit having frequencies comparable to glutamine, which has the longest repeat in the human proteome (Fig 2A). These observations indicate that these amino acid repeats are under constant selective pressure to be retained by the proteins but not expanded.

Our study has shown that SAARs are not associated with any functional domains/motifs within the proteins. Instead, they mostly overlap disordered regions of proteins. This is in agreement with the idea that SAARs could act as spacer elements separating two or more functional domains within a protein [17]. Disordered regions are advantageous to proteins as they offer structural flexibility and facilitate optional folding. We hypothesize that proteins acquire SAARs based on the structural flexibility needed for their function. Our molecular function gene ontology analysis shows that proteins containing SAARs are overrepresented in various binding activities such as chromatin binding, DNA and RNA binding, transcription factor activity, etc. By being structurally malleable, these proteins can adapt to bind their targets efficiently. For example, a DNA-binding protein could tolerate imperfect binding sequences, and a transcription factor could adapt itself depending on whether its target promoter is in a highly euchromatic region or not. This hypothesis is further strengthened by amino acid specific categorization of molecular functions; proteins containing proline and glutamic acid repeats, which show the highest tendency to be part of disordered regions, are enriched for chromatin and nucleic acid binding related functions whereas proteins containing leucine repeats, which never occur in disordered regions, are underrepresented for the same. On the other hand, proteins with leucine repeats are overrepresented for functions such as antigen binding and receptor activities, which require precise binding to targets. SAARs of various amino acids, therefore, may function as components that facilitate folding, domain structuring and stability of multi-domain proteins, as also hinted by few previous studies [4042].

Comparison of SAARs across many proteomes showed that repeats of certain amino acids are equally abundant along the evolutionary tree. Our analysis on repeats of orthologous proteins indicated that SAAR conservation is limited to events of smaller lengths, mostly in the range of 5–9 residues. We also observed in the human proteome that codons encoding repeats are very rarely split across exons (46 out of 11852 events). Furthermore, there are cases like Formin2 and MUC2 genes, where we observe a steady increase in the number of SAAR events in the human proteins as compared to the corresponding homologues in lower organisms. Taken together, these observations suggest that many proteins acquired small SAARs early on during evolution, when exon-intron complexity was minimal. As complexity evolved, proteins acquired more events to augment their adaptability. However, long amino acid repeats are likely to be recent acquisitions. In fact, repeats longer than 15aa are unique to each species, with an exception of three events shared by human and mouse (S4 Fig). However, we did observe that in some cases (~13%), such long SAAR events are replaced in orthologous proteins by another amino acid repeat of similar polarity (data not shown). Hence, we think these are recent and parallel acquisitions, driven by the need of a protein to enhance its functionality.

The functional relevance of SAARs is further substantiated by their roles in causing disease phenotypes. We looked at genes that cause neurological disorders upon expansion of CAG repeats. These genes exhibit a gain of function upon expansion of repeat and the disease severity is proportional to the length of repeat [34]. Orthologs of human genes in plant genomes lack the repeat region while present at varied repeat lengths in the animal genomes. This suggests that animal genomes acquired SAARs for improved functionality with the trade-off that the protein is now vulnerable to a diseased state upon abnormal expansion of the repeat.


Our analysis shows that SAARs may have been accumulated under selective pressure and, therefore, have functional relevance. Since a great majority of SAARs are part of disordered regions, their function seems indirect, by providing flexibility and stability to proteins in the context of other functional domains. The widespread occurrence of SAARs across the evolutionary landscape, as well as their occurrence in core exonic regions and not at the intron-exon boundaries, indicates that much of SAARs are not a late addition to the proteins but have been essential part of proteins from early on. There is, however, evidence of recent acquisitions/expansions, in particular of the increasing SAAR content in more complex organisms. These findings help us understand the evolutionary determinants that enhance functional features of proteins and can be used in better design of proteins with novel properties.

Supporting Information

S1 Fig. Functional classification grouped by the physical properties of SAARs.

Distribution of various functional and molecular activity classes (X-axis) for SAAR associated proteins categorized based on their physical properties is shown. The functional classes were defined from Gene ontology annotations. The plot shows the fold enrichment (Y-axis) between expected and observed frequency in reference to the human proteome. The blue bars and red bars indicate hydrophobic and hydrophilic SAARs, respectively.


S2 Fig. SAARs present as SSRs at the genomic level in human.

For all the coding regions corresponding to SAARs in the Human proteome, we calculated the number of times they were present as simple sequence repeats (SSRs). The X-axis shows the different amino acids and the Y-axis shows the number of repeat events in the proteome (red bars) and genome (blue bars).


S3 Fig. SAAR events in all proteomes.

SAAR events were calculated and normalized to one million residues for the indicated proteomes and plotted as a heatmap where the X-axes show individual amino acid associated repeats and Y-axes have all the organisms under study. The order of amino acids and species was kept consistent with that of SAAR density (Fig 5) to allow an easier comparison between plots.


S4 Fig. No of SAAR events and lengths distribution among Human orthologous proteins.

For all the organisms under study, Human ortholog pairs with conserved SAARs were identified. For these proteins, the distribution of repeat length (X-axis) and number of events (Y-axis) are plotted.


S1 File. Functional domain annotation for SAAR associated regions in human proteins.


S1 Table. Codon and amino acid repeat length among orthologous genes associated with repeat expansion (CDS) diseases.



The authors thank RKM laboratory members for comments and discussions that helped in shaping this study.

Author Contributions

  1. Conceptualization: RKM.
  2. Data curation: ASK DTS.
  3. Formal analysis: ASK DTS RKM.
  4. Funding acquisition: RKM.
  5. Investigation: ASK DTS RKM.
  6. Methodology: ASK DTS RKM.
  7. Project administration: RKM.
  8. Resources: RKM.
  9. Software: ASK DTS.
  10. Supervision: DTS RKM.
  11. Validation: DTS.
  12. Visualization: ASK DTS RKM.
  13. Writing – original draft: ASK DTS RKM.
  14. Writing – review & editing: ASK DTS RKM.


  1. 1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860–921. pmid:11237011
  2. 2. Green H, Wang N. Codon reiteration and the evolution of proteins. Proceedings of the National Academy of Sciences. 1994;91(10):4298–302.
  3. 3. Nithianantharajah J, Hannan AJ. Dynamic mutations as digital genetic modulators of brain development, function and dysfunction. Bioessays. 2007;29(6):525–35. pmid:17508392
  4. 4. Hannan AJ. Tandem repeat polymorphisms: modulators of disease susceptibility and candidates for ‘missing heritability’. Trends in Genetics. 2010;26(2):59–65. pmid:20036436
  5. 5. King DG. Evolution of simple sequence repeats as mutable sites. Tandem Repeat Polymorphisms: Springer; 2012. p. 10–25.
  6. 6. Sawaya SM, Bagshaw AT, Buschiazzo E, Gemmell NJ. Promoter microsatellites as modulators of human gene expression. Tandem Repeat Polymorphisms: Springer; 2012. p. 41–54.
  7. 7. Kashi Y, King D, Soller M. Simple sequence repeats as a source of quantitative genetic variation. Trends in genetics. 1997;13(2):74–8. pmid:9055609
  8. 8. Kashi Y, King DG. Simple sequence repeats as advantageous mutators in evolution. TRENDS in Genetics. 2006;22(5):253–9. pmid:16567018
  9. 9. Richards RI. Dynamic mutations: a decade of unstable expanded repeats in human genetic disease. Human molecular genetics. 2001;10(20):2187–94. pmid:11673400
  10. 10. Ellegren H. Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 2004;5(6):435–45. pmid:15153996
  11. 11. Ramamoorthy S, Garapati HS, Mishra RK. Length and sequence dependent accumulation of simple sequence repeats in vertebrates: Potential role in genome organization and regulation. Gene. 2014. Epub 2014/08/31. pmid:25172211.
  12. 12. Kumar RP, Senthilkumar R, Singh V, Mishra RK. Repeat performance: how do genome packaging and regulation depend on simple sequence repeats? Bioessays. 2010;32(2):165–74. pmid:20091758.
  13. 13. Subramanian S, Mishra RK, Singh L. Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions. Genome biol. 2003;4(2):R13. pmid:12620123
  14. 14. Usdin K. The biological effects of simple tandem repeats: Lessons from the repeat expansion diseases. Genome Research. 2008;18(7):1011–9. pmid:18593815
  15. 15. Perutz MF, Johnson T, Suzuki M, Finch JT. Glutamine repeats as polar zippers: their possible role in inherited neurodegenerative diseases. Proceedings of the National Academy of Sciences. 1994;91(12):5355–8.
  16. 16. Huntley MA, Golding GB. Neurological proteins are not enriched for repetitive sequences. Genetics. 2004;166(3):1141–54. pmid:15082536; PubMed Central PMCID: PMC1470788.
  17. 17. Karlin S, Burge C. Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proceedings of the National Academy of Sciences. 1996;93(4):1560–5.
  18. 18. Galant R, Carroll SB. Evolution of a transcriptional repression domain in an insect Hox protein. Nature. 2002;415(6874):910–3. pmid:11859369
  19. 19. Gerber H-P, Seipel K, Georgiev O, Hofferer M, Hug M, Rusconi S, et al. Transcriptional activation modulated by homopolymeric glutamine and proline stretches. Science. 1994;263(5148):808–11. pmid:8303297
  20. 20. Karlin S, Brocchieri L, Bergman A, Mrazek J, Gentles AJ. Amino acid runs in eukaryotic proteomes and disease associations. Proceedings of the National Academy of Sciences. 2002;99(1):333–8. pmid:11782551
  21. 21. Kazemi-Esfarjani P, Trifiro MA, Pinsky L. Evidence for a repressive function of the long polyglutamine tract in the human androgen receptor: possible pathogenetic relevance for the (CAG) n-expanded neuronopathies. Human Molecular Genetics. 1995;4(4):523–7. pmid:7633399
  22. 22. Veitia RA. Amino acids runs and genomic compositional biases in vertebrates. Genomics. 2004;83(3):502–7. pmid:14962676
  23. 23. Inoue K, Keegstra K. A polyglycine stretch is necessary for proper targeting of the protein translocation channel precursor to the outer envelope membrane of chloroplasts. The Plant Journal. 2003;34(5):661–9. pmid:12787247
  24. 24. Alba MM. Comparative Analysis of Amino Acid Repeats in Rodents and Humans. Genome Research. 2004;14(4):549–54. pmid:15059995
  25. 25. Hancock JM, Simon M. Simple sequence repeats in proteins and their significance for network evolution. Gene. 2005;345(1):113–8. pmid:15716087
  26. 26. Luo H, Nijveen H. Understanding and identifying amino acid repeats. Briefings in Bioinformatics. 2013;15(4):582–91. pmid:23418055
  27. 27. Faux NG, Bottomley SP, Lesk AM, Irving JA, Morrison JR, de la Banda MG, et al. Functional insights from the distribution and role of homopeptide repeat-containing proteins. Genome research. 2005;15(4):537–51. pmid:15805494
  28. 28. UniProt: a hub for protein information. Nucleic Acids Research. 2014;43(D1):D204–D12. pmid:25348405
  29. 29. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic Acids Research. 2013;42(D1):D222–D30. pmid:24288371
  30. 30. Oates ME, Romero P, Ishida T, Ghalwash M, Mizianty MJ, Xue B, et al. D2P2: database of disordered protein predictions. Nucleic Acids Research. 2013;41(D1):D508–D16. pmid:23203878
  31. 31. Bernstein FC, Koetzle TF, Williams GJB, Meyer EF, Brice MD, Rodgers JR, et al. The protein data bank: A computer-based archival file for macromolecular structures. Journal of Molecular Biology. 1977;112(3):535–42. pmid:875032
  32. 32. Thomas PD. PANTHER: A Library of Protein Families and Subfamilies Indexed by Function. Genome Research. 2003;13(9):2129–41. pmid:12952881
  33. 33. Karolchik D. The UCSC Table Browser data retrieval tool. Nucleic Acids Research. 2004;32(90001):493D–6. pmid:14681465
  34. 34. La Spada AR, Taylor JP. Repeat expansion disease: progress and puzzles in disease pathogenesis. Nat Rev Genet. 2010;11(4):247–58. pmid:20177426
  35. 35. Wickham H. ggplot2: elegant graphics for data analysis: Springer Science & Business Media; 2009.
  36. 36. Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH. Hydrophobicity of amino acid residues in globular proteins. Science. 1985;229(4716):834–8. pmid:4023714
  37. 37. Johansson ME, Thomsson KA, Hansson GC. Proteomic analyses of the two mucus layers of the colon barrier reveal that their main component, the Muc2 mucin, is strongly bound to the Fcgbp protein. Journal of proteome research. 2009;8(7):3549–57. pmid:19432394
  38. 38. Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, et al. Intrinsically disordered protein. Journal of Molecular Graphics and Modelling. 2001;19(1):26–59. pmid:11381529
  39. 39. Mandal A, Mandal S, Park MH. Genome-wide analyses and functional classification of proline repeat-rich proteins: potential role of eIF5A in eukaryotic evolution. PloS one. 2014;9(11):e111800. pmid:25364902
  40. 40. Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nature Reviews Molecular Cell Biology. 2005;6(3):197–208. pmid:15738986
  41. 41. Romero P, Obradovic Z, Li X, Garner EC, Brown CJ, Dunker AK. Sequence complexity of disordered protein. Proteins: Structure, Function, and Genetics. 2000;42(1):38–48.
  42. 42. Simon M, Hancock JM. Tandem and cryptic amino acid repeats accumulate in disordered regions of proteins. Genome Biol. 2009;10(6):R59. pmid:19486509