Proline: The Distribution, Frequency, Positioning, and Common Functional Roles of Proline and Polyproline Sequences in the Human Proteome

Proline is an anomalous amino acid. Its nitrogen atom is covalently locked within a ring, thus it is the only proteinogenic amino acid with a constrained phi angle. Sequences of three consecutive prolines can fold into polyproline helices, structures that join alpha helices and beta pleats as architectural motifs in protein configuration. Triproline helices are participants in protein-protein signaling interactions. Longer spans of repeat prolines also occur, containing as many as 27 consecutive proline residues. Little is known about the frequency, positioning, and functional significance of these proline sequences. Therefore we have undertaken a systematic bioinformatics study of proline residues in proteins. We analyzed the distribution and frequency of 687,434 proline residues among 18,666 human proteins, identifying single residues, dimers, trimers, and longer repeats. Proline accounts for 6.3% of the 10,882,808 protein amino acids. Of all proline residues, 4.4% are in trimers or longer spans. We detected patterns that influence function based on proline location, spacing, and concentration. We propose a classification based on proline-rich, polyproline-rich, and proline-poor status. Whereas singlet proline residues are often found in proteins that display recurring architectural patterns, trimers or longer proline sequences tend be associated with the absence of repetitive structural motifs. Spans of 6 or more are associated with DNA/RNA processing, actin, and developmental processes. We also suggest a role for proline in Kruppel-type zinc finger protein control of DNA expression, and in the nucleation and translocation of actin by the formin complex.


Introduction
There are about 900 naturally occurring amino acids. Evolution has selected only 21 for inclusion in the group (not distinguishing between cysteine and selenocysteine) that act as subunits in the assemblage of human proteins [1,2]. One of these amino acids, proline, is highly anomalous in many respects; its unique features contribute to the special roles proline plays in protein structure and function. Williamson and MacArthur have previously reviewed aspects of this subject, including an analysis of prolinerich regions in some proteins [3,4].
Proline's nitrogen atom is covalently bound within the molecule's five-membered ring, a feature that markedly restricts the phi (w) angular range in peptide bond formation at this locus in a peptide or protein (Figure 1a). Furthermore, proline can readily adopt a cis configuration as well as a trans configuration in response to subtle influences, presumably differences in local charge distribution. This exceptional behavior accounts for the tendency of prolyls to bend the regional amino acid alignment and therefore to fold the protein [5].
Having only one hydrogen atom attached to its nitrogen, proline cannot donate protons but it can serve as a proton acceptor. Prolyls tend to be excluded from alpha helices and beta sheets. They can, however, be situated at positions at the ends of these motifs. In one simplified view, proline disrupts protein secondary structure by inhibiting the backbone to conform to an alpha-helix or beta-sheet conformation. The alternate intpretation is that proline imposes its own kind of secondary structure with a confined phi angle that overrides other forms of secondary structure. Because of their hydrophobicity they tend to adopt positions within the interior of a protein.
Tri-proline sequences may fold into right-handed or left-handed helices, referred to as polyproline I (PPI) or polyproline II (PPII), respectively [6][7][8]. The left-handed version is far more common. It forms when the consecutive residues assume approximate dihedral angles of 275u at the phi (Q) position and 150u at the psi (y) position and are isomerized to the trans position of their peptide bonds. Its left-handed helix contains three residues per turn and the rise is about 3.1 Å . On the other hand, the rarer polyproline I helix forms when the consecutive residues assume approximate dihedral angles of 275u at the Q position and 160u at the y position and is isomerized in the cis position at the peptide bond. The right-handed helix contains about 3.3 residues per turn, and the rise is only about 1.9 Å .
Because of its unique structural properties, we were interested in determining proline's distribution across the proteome and identifying the shared functional properties of proteins with high levels of proline and long polyproline stretches. Herein we report information about the presence, distribution, and effects of proline residues and repetitive sequences in 18,666 unique human proteins. These proteins were taken from the HUGO Gene Nomenclature Committee's list of accepted protein coding genes [23]. Although the number of open reading frames transcribed into mRNA may be greater than 18,666, we believe this curated set of protein-coding genes is reasonable representation of the members of the uniquely coded human proteome. We present detailed information about the occurrence of polyproline sequences of three or more residues and their association with structure and function.

Results
The total number of human proteins studied was 18,666; of these, 99.8% contain proline. The median length of all the proteins is 436 residues. The proteins encompass a total of 10,882,808 amino acids, among which there are 687,434 prolyls, which account for 6.3% of all amino acids in the human proteome. The relative abundance of proline in the proteins of the human proteome is shown in the histogram in Figure 2a. The distributions of all 20 amino acids as a function of relative length are shown in the Figure 3 (Table S1). There are 564,316 singlet prolines (82.1%), 46,540 pairs of proline (13.5%), and 30,038 spans of 3 or more (4.4%).

Proline-Rich, Polyproline-Rich, and Proline-Poor Proteins
There are 46 proteins that contain no proline (Table S2). On the other hand, the long keratinocyte envelope protein SPRR2G (small proline rich protein 2G), some 73-amino acid residues in length, is comprised of 39.7% proline. We examined the enrichment of functional annotations in the top and bottom 1% of molecules in the human proteome, as ordered by their percentage of proline, Table S3 & Table S4.
Both the proline-rich and proline-poor proteins are heavily involved with the formation of the dermis and with keratinization, but their functional roles are very different. Proline-rich proteins include the collagens (e.g. COL3A1, COL5A1, COL1A2) and proteins in the cornified envelope (e.g. LCE2C, SPR2G, LCE2B, SPRR2F) which rely on the properties of proline and hydroxyproline to form their helices. By contrast, many of the proline-poor proteins exhibit a different coiled-coil motif that includes the intermediate filament proteins that make up the keratins (e.g.KRT9, KRT19, KRT77, KRT14). This distinction highlights the important role of proline and polyproline in determining helical structure. The two kinds of helical structures lie on opposite ends of the proline abundance scale. Proline-poor proteins and domains form one class of helices (e.g. keratins), which can assemble only by excluding proline. Proline-rich molecules, which exhibit a contrasting triple-helical conformation, such as the collagens, constitute another large group of proteins.
In addition to the collagens, proline-rich proteins include some of the homeobox (HOXA3, HOXA4, ESX1, CDX1, TPRX1) , ß-lactam (c), nicotianamine (d), and mugineic acid (e). Aze is the lower homologue of proline. Its ring has only four members instead of five. Plants synthesize Aze as essential constituents of the two metal chelating molecules, nicotianamine and mugineic acid. These compounds trap metal ions in the soil and transport them to various plant parts. Aze is in particularly high concentrations in the bulbous roots of many plants, making its way into the human and lifestock food supply. Aze exerts its toxic effects by eluding the proof-reading function of the prolyl tRNA synthetases, allowing it to be misincorporated into nascent peptides or proteins in which it replaces proline. When Aze replaces proline, it can change protein structure, function, and antigenicity. This molecular mimicry is analagous to the other 4-member nitrogen containing ring of ß-lactam, which exerts its bactericidal effects by mimicking a D-Ala-D-Ala sequence of a transpeptidase, irreversibly blocking its role in bacterial cell wall synthesis. The role of Aze in human health is yet to be established. doi:10.1371/journal.pone.0053785.g001 and forkhead box (FOXE3, FOXN1) proteins, as well as the zinc finger proteins. The proline-rich proteins also tend to be highly enriched for consecutive sequences of prolines, so called polyproline sequences.
We examined proteins containing abundant proline, as well as proteins with stretches of contiguous prolines forming a ''polyproline motif,'' which we define as three or more prolines in consecutive sequence. The distribution of polyproline motifs among proteins in the human proteome is shown in Figure 2b & Figure 2c. Table S2 provides detailed information about the number of prolines in the longest spans and the number of separate polyproline sequences. Table S5 shows the start and end positions of each polyproline span.
We examined 11 proteins that start with a tri-proline repeat (ZBTB4, DHX34, HINFP, CD19, IRGQ, ELMO1, ELMO2, MAP1LC3C, IGLON5, IDS, and CPZ), along with 4 proteins that end with a tri-proline repeat (PYDC2, ARHGEF15, RHBDL1, and OR10S1). There are no sequential patterns of amino acids in any of these and there are no apparent functional commonalities among them. Protein OR10S1, which ends with a tri-proline, is an olfactory receptor protein that interacts with odorants and triggers a neuronal response. This pattern was not found in the other proteins that are believed to be odorant detectors: 52 NP olfactory receptor protein, MOR256-8, MOR256-17, MOR256-22, OMP olfactory marker protein.
Overall, we could find no association between polyproline at the initial or terminal ends and protein functions.
Consecutive sequences of six or more prolines are associated with DNA/RNA processing, including zinc fingers, actin, and developmental processes. There are 27 proteins that contain from  Table 1. Of the total 91 proteins in the above groups, there are 42 proteins (45%) associated with DNA/RNA processing, including 14 zinc finger proteins (15%), and 11 proteins associated with actin (12%).

Zinc Finger Proteins
Because of the apparent over representation of zinc finger proteins, we focused on the structure and function of these molecules to gain further insight into the role of polyproline, and found that none of the first 10 display regular recurring motifs: PCLO, ZIF268, ZFP746, ZNF827, Zinc Family Member 5, Zinc Finger CCCH domain, ZFHX4, Zinc Finger Protein 318, Zinc Finger Homeobox protein 3, ZNF367, ZFP579, ZFPM1 ( Figure  S1). We suspect that the complex configurations introduced by  polyproline helices disrupt long continuous motifs. On the other hand, acute angular changes in conformation could subserve the geometric requirements of highly articulated intra-and intermolecular interactions.
In some zinc finger proteins (Kruppel type) that contain only singlet or doublet prolyls (no triplets and their helices, and no longer runs of prolines), there is an amino acid motif in which a prolyl recurs every 28 residues ( Figure S2). A 28-residue conserved motif is a well-known feature of some zinc finger structures [24,25], which we call TWEAZR (Twenty-Eight Amino acid Zinc finger Repeat). It includes a linker sequence TGEH. The proline is followed by a YKCEEC sequence, and later an HXXXH sequence Figure 4. The two cysteines and the two histidines conjugate with a zinc atom. By contrast, in zinc finger proteins that contain prolyl triplets and their miniature helices, as well as longer consecutive repeats that may encompass such helices, this pattern breaks down, possibly because the small polyproline helices insert irregularities into the larger spiral contours of this class of zinc finger molecules. For instance, among the first ten proteins free of polyproline sequences (ZNF100, ZFP726, ZFP729, ZFP732, ZFP733, ZFP736, ZFP737, ZFP739, ZNF741), we found the TWEAZR motif in each, with proline recurring every 28 residues. Figure S3 shows the pattern repeated in ZNF729.
We conducted a detailed analysis of proline in zinc finger proteins according to their number of consecutive prolyl repeats, from 2 to 27. Among the 95 members containing 9 to 27 consecutive repeats, there are 13 zinc finger proteins (13.4%) ( Table 2). In molecules that contain consecutive prolyl spans of three or more (highest 22), there are 4245 proteins, of which 83 are zinc finger proteins (1.95%). Among the proteins lacking repeats, there are 14,102 proteins, and 425 zinc finger proteins (0.03%). Of the first 9 zinc finger proteins that show a disorderly amino acid arrangement, 8 (89%) contain prolyl dimers in the pattern of ppx. In these 9 proteins there is a total of 63 dimers, of which 35 contain ''guest'' amino acids in the third position (Table  S6). Common guests include glycine, asparagine, alanine, glutamine, valine, aspartic acid, histidine and lysine.
By contrast in the low-proline zinc finger protein group, among the first 10 proteins, there are only 5 which contain proline dimers, all with guests in the third position (Table S7). In the total there are 50 zinc finger proteins in this subgroup, containing a total of 659,569 amino acids, and 28 prolyl dimers. Prolyl dimers account for only 0.00004% of the residues. Thus prolyl dimers are rare in such zinc finger proteins.
The repetitive pattern, TWEAZR, with prolyls recurring every 28 residues, was not found in any of the 10 zinc finger proteins containing the longest consecutive spans or in proteins with the highest percentage of prolyl residues (20%). To determine whether one or more proline trimers in a molecule is associated with the presence or absence of TWEAZR, we compared the frequency of such patterns among zinc finger proteins in which there are one or more trimers, with the frequency of such patterns in molecules in which there are no trimers. There are 43 zinc finger protein molecules in the first group. Of these, only two (ZNF189 and ZNF283) display the repetitive motif, or a frequency of 4.6%. Noteworthy is the fact that in both cases the ppp triplet is located near the beginning of the lead sequence. In ZNF189 (612 amino acids) the triplet prolyls occur in positions 6,7, and 8. In ZNF 283 (679 amino acids) the prolyl triplet occupies positions 22, 23, and 24. On the other hand, of the first 43 zinc finger proteins in the ''0'' category in which the molecules contain no consecutive prolyl spans beyond 2 (that is, dimers), there are 32 molecules that display the recurring TWEAZR motif (74%), and 11 molecules (26%) that lack it.
These results indicate that a polyproline sequence of three is unlikely to be associated with the presence of the recurring TWEAZR motif within a zinc finger molecule. In the unusual instances in which tri-prolines are present, they appear to be limited to the lead sequence and do not occur within the repetitive domains.
It is suggested that tri-proline helices disrupt the repetitive amino acid zinc finger protein sequences that we have noted and that have been previously described [26,27]. This conclusion is supported by that fact that of the 144 proteins in which there are 8 to 27 polyproline repeats, only formin-2 (NP_064450.3), a 1,722 amino acid actin-associated protein, displays a repetitive motif. This consists of 22 consecutive sequences comprised of a quintet of prolines, each followed by lpgagi, commencing at residue number 976 and ending at residue number 1211 ( Figure S4). Formins are multidomain proteins that are involved in actin nucleation [28,29].

Genetic and Acquired Disorders Related to Proline
There is a voluminous literature about hereditary disease caused by mutations involving proline. PubMed lists 6,068 citations, as of 26 May 2012. Little is known about acquired disease in humans caused by the ingestion of azetidine-2-carboxylic acid (Aze), the lower homologue of proline, containing four members in its ring instead of five (Figure 1). It is a constituent of the diet. Aze eludes the gatekeeping function of prolyl aminoacyl tRNA synthetases, and is misincorporated into proteins in place of proline [1,30,31].

Discussion
Gene processing proteins, such as those proline rich proteins that catalyze splicing of a primary gene transcript (pre-mRNA), lead to translation of a single message into a large number of different protein isoforms. Thus the misassembly of one alternatively splicing protein can result in the malformation of a very large number of downstream protein products. The result is the marked amplification of the effects of a single protein misconstruction. Such events during early embryogenesis could have damaging consequences. They may also contribute to carcinogenesis.
Zinc finger proteins are intimately involved in DNA expression, RNA assembly, transcription, and apoptosis [32,33]. Our data indicate that in one class of zinc finger proteins that contain only singlet or doublet prolyls -no triplets and their helices or longer runs of prolines, there is an amino acid motif (TWEAZR) in which a prolyl recurs every 28 residues and very rarely anywhere else. It follows the sequence TGEH, and preceeds a YKCEEC sequence Figure 4. Why evolution has selected proline for this specific position in these molecules is yet to be determined. Obviously the size, shape, charge distribution, and peptide angles are important. Beyond these properties is the fact that proline readily isomerizes between the trans and cis positions. Its flexing action shifts the location of variable residues downstream, as in positions 13 and 16. These residues in ZNF 729 are shown in Table 3, in which we arbitrarily assigned proline to residue position 1. In the 30sequence repeats, position 14 is occupied by a serine 18 times and by a phenylalanine 12 times. Position 15 is always occupied by a serine. In zinc finger proteins 100, 141, 208, 726, and 737, position 13 is occupied by 15 different amino acids, and position 16 by 12 different amino acids. On the other hand, position 14 is occupied by serine and phenylalanine 56% and 33% of the time, respectively. Position 15 is occupied by serine 97% of the time. These molecular locations may be relevant to zinc finger amino acid/DNA contact [34]. In ZNF 729 the 30 residues in position 13 and in position 16, 24 are polar and 6 are nonpolar. Of the 30 residues in position 14, 18 are polar and 12 are nonpolar, and all 30 residues in position 15 are polar.
Perhaps some of these variable amino acids make on-off contact with bases in DNA, depending on the cis/trans state of the preceding proline. Thus prolyl isomerization may be a highly conserved mechanism that determines zinc finger protein control of DNA expression. This pivotal role of proline is supported by the fact that other prolyls are virtually excluded within the recurring motif domain. We found only 3 prolines in 2,084 outlying positions, a frequency of 1:695. The isomerization of an outlier could corrupt the logic of the molecule's structure. This hypothesis about the uniqueness of proline in this setting is strengthened by the fact that single residues of each of the other amino acids that form the repetitive sequences (t, g, a, e, k, y, c, and h) are found in random locations elsewhere within the molecule.
In addition, the critical role of proline isomerization in zinc finger function is supported by the close association between zinc  finger proteins and proteins that possess peptidyl-prolyl isomerase activity, such as cyclophilin, FKBP's, and parvulin [35]. Cyclophilin A, a peptidyl-prolyl isomerase, has been shown to be necessary for the proper function of the zinc finger protein Zpr1p [36]. We suggest that the control of prolyl isomerization may be a critical link in zinc finger protein regulation of gene expression. Furthermore, in zinc finger proteins that contain long consecutive prolyl repeats, including prolyl triplets and their miniature helices, this cyclic pattern breaks down. Presumably the polyproline helices introduce abrupt changes in the direction of downstream residues, and interfere with the formation of long continuous spiral domains. Curiously, zinc finger proteins are markedly over represented among these proline-laden molecules. It is apparent that zinc finger protein structure and function may be corrupted by the substitution of Aze for proline.
Overall, we have cataloged the presence, position, and general functional role of proline in the human proteome. We have suggested a possible role for proline in the regulation of zinc finger protein binding to nucleic acids based on cis-trans isomerization. Also, by listing the dominant functional roles of proline-rich proteins, we suggest likely future directions for the investigation of the impact of Aze misincorporation on human molecular pathophysiology. Such studies have been pioneered by Schimmel and his colleagues [37].

Methods
To define the human proteome, we started with the human genome. Using the HUGO Gene Nomenclature Committee's list of accepted protein coding genes [23], we obtained the corresponding protein sequences for these genes from Ensembl [38]. For each gene, we used the longest peptide sequence in Ensembl corresponding to each gene and removed the initial methionine in each case, giving us 18,666 total different amino acid sequences making up the human proteome.
Amino acid statistics were computed using the R statistical programming language [39]. We examined functional enrichment of lists of proteins rich or poor in proline or polyproline using the web-based DAVID tool from NIAID [40]. Figure S1 Five Examples of High Proline-Containing Zinc Finger Proteins Lacking a Repetitive Motif (TEAZR). The amino sequences of five proteins high in proline Table 3. Conservation and variation in the amino acids in the repeating motif in ZNF729.     Table S7 Zinc finger proteins with low proline occurrence. The proline dimers in zinc finger proteins lacking three or more consecutive prolyl residues. All contained guests in the third position, implying that they may take on a polyproline helical configuration. In the above group most of the prolyl dimers are located in the lead sequence, at a considerable distance from the repeat motifs.