High GC Content Causes De Novo Created Proteins to be Intrinsically Disordered

De novo creation of protein coding genes involves formation of short ORFs from noncoding regions; some of these ORFs might then become fixed in the population. De novo created proteins need to, at the bare minimum, not cause serious harm to the organism, meaning that they should for instance not cause aggregation. Therefore, although the creation of the short ORFs could be truly random, but the fixation should be of subject to some selective pressure. The selective forces acting on de novo created proteins have been elusive and contradictory results have been reported. In Drosophila they are more disordered, i.e. are enriched in polar residues, than ancient proteins, while the opposite trend is present in yeast. To the best of our knowledge no valid explanation for this difference has been proposed. To solve this riddle we studied structural properties and age of all proteins in 187 eukaryotic species. We find that, on average, there are small differences between proteins of different ages, with the exception that younger proteins are shorter. However, when we take the GC content into account we find that this can explain the opposite trends observed in yeast (low GC) and drosophila (high GC). GC content is correlated with codons coding for disorder-promoting amino acids, and inversely correlated with transmembrane, helix and sheet promoting residues. We find that for the youngest proteins, i.e. the ones that are most likely to be de novo created, there exists a strong correlation with GC and structural properties. In contrast, this strong relationship is not seen for ancient proteins. This leads us to propose that structural features are not a strong determining factor for fixation of de novo created genes. Instead these proteins resemble random proteins given a particular GC level. The dependency on GC content is then gradually weakened during evolution. Author Summary We show that the GC content of a genomic area is of great importance for the properties of a protein-coding de novo created gene. The GC content affects the frequency of the codons and this affects the probability for each amino acid to be included in a de novo created protein. The codons encoding for Ala, Pro and Glu contain 80% GC, while codons for Lys, Phe, Asn, Tyr and Ile contain 20% or less. Pro and Gly are disorder-promoting, while Phe, Tyr and Ile are order-promoting. Therefore random protein sequences at a high GC will be more disordered than the ones created at a low GC. The structural properties of the youngest (orphan) proteins match to a large degree the properties of random proteins when the GC content is taken into account. In contrast structural properties of ancient proteins only show a weak correlation with GC content. This suggests that even after fixation of de novo created proteins largely resemble random proteins given a certain GC content. Thereafter, during evolution the correlation between structural properties and GC weakens.

Proteins without any detectable homologs outside one genome are often referred to as 2 orphans. Orphan protein coding genes can be created by gene duplication, lateral 3 transfer of genetic material and de novo gene creation, that are of particular interest, 4 as they are the only source of completely novel protein coding material and present a 5 rare chance for full-frontal functional novelty. Further, studies of the properties of the 6 genes might provide unique insights into the fundamental processes in the formation 7 and selective pressure of all genes since clearly, in the strict sense, all protein 8 superfamilies were once created by a de novo mechanism. 9 Before the genomic era, the scientific consensus held that de novo creation of new 10 genes was rare -instead it was believed that the vast majority of all genes were 11 originally generated in an ancient "big bang". However, when the first complete 12 genomic sequences were initially published, this hypothesis was not supported [1]. In 13 fact, to this day, when analysing complete genomes from closely related genomes, a 14 surprisingly high number of orphan proteins persist [2][3][4]. It has later been shown that 15 some of the initially assigned orphan proteins are not de novo created but rather a 16 result of limited phylogenetic coverage of the genomes [5]. 17 Today, supported by the vast amount of complete genome sequences available and 18 improved search methods [6], many of the orphan proteins detected, at least in yeast, 19 appear to be created through de novo formation [7,8]. Some studies indicate that, in 20 yeast, there is a large set of proto-genes: ORFs that remain on the verge of becoming 21 fixed as bona fide protein-coding genes in the population [7]. This gives a possible 22 background in explaining how novel proteins can be generated from non-coding 23 genetic material. In other species the genomic coverage has been more limited and 24 therefore the studies have been less detailed. 25 It is clear that not all identified orphan proteins are de novo created. Several 26 reasons for this exists. Some orphans might be classified as such primarily because the 27 relationship with other proteins are missed. This problems is enhanced with a limited 28 amount of closely related genomes and for fast evolving proteins. In addition gene 29 duplication, lateral transfer, gene losses and domain rearrangements also make it 30 difficult to detect the true relationship between all proteins. To accurately detect de 31 novo created genes, the availability of several completely sequenced genomes not only 32 from closely related species, but also from a set of numerous and evenly spaced taxa is 33 essential. Even when this is present the best that can be obtained is a set of orphans 34 strongly enriched in de novo created proteins. 35 The availability of complete genomes separated at different evolutionary distances 36 also enables studies at different ages [3,5,7]. Here, a gene can be unique to a specific 37 species, or even to a strain; alternatively it can be present pervasively across a 38 taxonomic group. Even more ancient orphans may be defined as superfamilies that are 39 unique to a kingdom of life [9,10]. Using methods such as ProteinHistorian it is 40 possible to assign an age to each protein [11]. 41 more often essential and (iii) obtain lower β-strand content and higher stability [14]. 48 Some aspects of these, such as the fact that orphans on average are short, are likely 49 related to a de novo creation mechanism. However, other features, including intrinsic 50 disorder [4,15], are not obviously related to the bona fide gene genesis and could 51 instead be the result of the selective pressure acting during fixation.

52
In yeast, we have earlier reported that the most recent orphans, i.e. the ones 53 unique to S. cerevisiae, are less disordered than the average yeast proteins [3]. Studies 54 enabled by the sequencing of Drosophila pseudoobscura provide the opposite picture, 55 i.e. the youngest proteins are more disordered than ancient [4]. 56 To the best of our knowledge the origin of this difference has not been explained.

57
Could the selective forces for de novo creation be that disparate between two different 58 eukaryotes, or could the de novo genetic mechanisms be different, or is it an artefact 59 caused by evolutionary rates or evolutionary distances between the related genomes?

60
Alternatively, there might exist some genomic feature that is different between 61 drosophila and yeast that could explain the difference of the intrinsic disorder in their 62 orphan proteins. In addition to hugely different sizes and different gene structures, the 63 GC content differs significantly between the genomes of different taxa. The GC 64 content of Saccharomyces genomes is roughly 40%, while in Drosophila the GC 65 content is 55%.

66
To obtain a better understanding of the structural properties affecting the de novo 67 creation of proteins, we studied the age of proteins in 187 eukaryotic genomes.

68
Significantly more than used in earlier studies. Due to the frequency of lateral transfer 69 in prokaryotic mechanisms age estimates of prokaryotic genes is more troublesome 70 than for eukaryotic genes. Therefore, we focus on eukaryotic organisms in this study. 71 We find that the most striking difference between young and old proteins is their 72 difference in length. Surprisingly all other properties show a large overlap between 73 ancient and orphan proteins. However, we find that structural features in orphan 74 proteins differ significantly between low-GC and high-GC genes. Orphans in low GC 75 genes are more disordered and have less secondary structure than in high-GC genes.

76
In older proteins this relationship is much weaker, supporting a model where de novo 77 creation starts from random non-coding ORFs and then gradually adapts the features 78 of ancient proteins.

81
To start, protein data for 400 eukaryotic species were obtained from OrthoDB, release 82 8 [16]. These species are divided into 173 Metazoans and 227 Fungi, for a total of 83 4,562,743 protein sequences. For each species, a complete proteome was also 84 downloaded from UniProt Knowledge Base [17]. 85 Age estimate 86 The ProteinHistorian software pipeline [11] is aimed at annotating proteins with 87 phylogenetic ages. This method requires a phylogenetic tree relating a set of species, 88 and a protein family file, containing the orthology relationships between the proteins 89 of the species in the tree. The pipeline will then assign each protein to an age group, 90 depending on the species tree and the ancestral family reconstruction algorithm used 91 to identify protein families. For our application, we used ProteinHistorian with default 92 parameters, the NCBI phylogenetic tree [18], and protein orthology data obtained 93 from OrthoDB. The OrthoDB method is based on all-against-all protein sequence 94 comparisons using the Smith-Waterman algorithm and requiring a sequence alignment 95 overlap of at least 30 amino acids across all members of an orthologous group. 96 Therefore, the age group can be thought of as the level in the species tree on which a 97 shared sequence of at least 30 AAs first appeared, i.e. it assigns multi-domain proteins 98 to the age of its oldest domains.

99
One problem that exists using the NCBI phylogenetic tree is the presence of many 100 polytomic branches, especially at the genus level. The cases when more than one of 101 species were present in a multi-furcated branch are problematic, because 102 ProteinHistorian can not distinguish between its proteins being specific to that species 103 and proteins shared among the entire group. To solve this, we converted the NCBI 104 tree to a fully binary by forcing no polytomy on the terminal branches.  from OrthoDB for other reasons, including that they were not present when the 113 database was created or that they have undergone large domain rearrangements. We 114 would assume that truly de novo created orphans do not contain domains found in 115 other proteins. Therefore to ensure that we have a unique set of orphan proteins we 116 filtered out proteins with hits in the Pfam-A database, by using hmmscan. We believe 117 that, due to the very stringent criteria used here, the majority of this remaining set is 118 constituted of de novo created proteins, and we refer to them as orphans throughout 119 the rest of this paper. These proteins are specific to the species taxonomic level, i.e. 120 we expect not to find them in other species in the dataset, even in the same genus. For 121 Saccharomyces cerevisiae, that has several strains in the dataset, we also included the 122 strain specific proteins in the orphan group.

123
Among the OrthoDB proteins, we defined genus Orphans those that were assigned Orphans; for this reason, we selected for our final dataset only those species that have 129 at least one other species within the same genus.

130
Proteins having the maximum age according to ProteinHistorian were defined as 131 ancient: these proteins are thought to be present in the common ancestor of all Fungi 132 (taxon id = 4751) or all Metazoa (taxon id = 33208). Finally, proteins whose 133 calculated age is between genus orphans and ancient were defined as intermediate.  For instance, in Saccharomyces cerevisiae (reference strain s288c), we identified 16 141 orphans and 5 genus orphans, out of 6466 total proteins. As a comparison, in our 142 earlier study we have reported 157 species-specific and 125 genus-specific orphans [19] 143 and Vidal en co-workers reported 143 species-specific (ORFs 1 ) and 609 genus-specific 144 (ORFs 2−4 ) proteins [20]. In a more detailed view, 50-70% of the proteins earlier previously [4]. Four species were found to have more than 5% of orphans: Ciona 152 intestinalis (5.8%), Colletotrichum gloeosporioides (6.4%, Botryotinia fuckeliana 153 (6.5%) and Apis mellifera (7.2%).

154
In conclusion we do believe that the conservative estimate orphans here is suitable 155 for this study as our primarily aim is not to estimate the exact number of orphans but 156 to examine properties of proteins of different ages. Generally, the GC% of a coding region is higher than that of a non-coding region of 172 DNA [22]; therefore, we expect that, for any given species, the GC of coding segments 173 would be higher than the taxonomic GC. To examine this the genome wide GC 174 content of were downloaded, for each species, from NCBI Genome Reports 175 (ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME REPORTS/eukaryotes.txt ); Indeed for 176 94% of the species the CDS sequence is higher than the taxonomic GC. Therefore we 177 find it more relevant to define the genomic GC content as the average, for each species, 178 of the GC of its CDS. Anyhow, results computed for predicted structural properties Intrinsic disorder content was predicted for all the proteins by using IUPred in its long 183 disorder mode [23]. A single amino acid residue was then labelled as disordered if its 184 intrinsic disorder was > 0.5. The disorder content of a protein is shown as the 185 percentage of its disordered amino acid residues. 186 We used SCAMPI [24] to predict the percentage of transmembrane residues of each 187 protein. Low-complexity regions were predicted using the software SEG [25]. For each 188 PLOS 5/20 PLOS-submission protein, we indicate as SEG the percentage of residues in low-complexity regions.

189
PSIPRED ( [26]) was used to predict the secondary structure of all the proteins in 190 the dataset. Here, the secondary structure was predicted using only a single sequence 191 and not a profile. This reduces the accuracy but the overall frequencies should be 192 rather accurately predicted. We annotated each protein with the percentage of its 193 residues predicted to be in each type of structure (alpha helix, beta strand, coil).

194
Propensity scales 195 TOP-IDP [27] is a measure of the disorder-promoting vs. avoiding propensity of single 196 amino acids. For each protein, a single propensity was calculated by averaging the 197 TOP-IDP values of all its residues. 198 We express the hydrophobicity of each protein as the average score of all its 199 residues using the Hessa hydrophobicity scale [28].

200
For each protein, we computed the propensity of each amino acid to be in one of 201 the four possible secondary structures (helix, sheet, coil, turn) by using the energy 202 function-based propensity scales proposed earlier [29]. The average propensity for each 203 secondary structure was then calculated for each protein.

Random proteins at different GC contents 205
To test whether the studied intrinsic properties (disorder, transmembrane, TOP-IDP 206 and hydrophobicity), as well as the frequency of any given amino acid, were solely 207 dependent on GC content, we used a set of 21,000 random ORFs, generated as follows: 208 at each GC content ranging from 20 to 90%, in steps of 1%, a set of 400 ORFs 209 (equally divided into 300, 900, 1,500 and 2,100 bp long) was generated so that its content of GC f req is set accordingly: where N i is the nucleotide of the codon in position i and δ(N |GC is equal to 1 if 214 the nucleotide N is guanine or cytosine and zero otherwise, etc. Finally, start and stop 215 codons are added. These ORFs were then translated to polypeptides, and all their 216 intrinsic properties, as well as the frequencies of their amino acid were computed, as 217 described above.

219
The assignment of age to all proteins is based on the ProteinHistorian pipeline [11]. In 220 the youngest, orphan group, only proteins that are (a) not present in any other of the 221 400 eukaryotic genomes in OrthoDB [16]  proteins in the dataset are classified as orphans, see Fig. 1a.

224
In the next group, genus orphans, only proteins that are unique to a genus are 225 included; this group also makes less than 1% of the proteomes. Given that these 226 estimates are significantly more conservative than earlier methods it can be assumed 227 that a large fraction of both orphans and genus orphans. domain-fusions [30], additional secondary structure elements [31] and expansion within 237 intrinsically disordered regions [12].

238
As coding regions on average have higher GC content than non-coding regions [22], 239 it could therefore be expected that GC content would increase by length [32] and what is expected for a set of shorter proteins, but certainly it could also indicate there 250 is a preference for a subset of orphans to be disordered.

251
The fraction of transmembrane residues is on average ∼30% in orphan proteins, In general, it is apparent that the variation among the species is quite large, as in 271 some organism orphans are more disordered than ancient proteins, while in others the 272 opposite appears to be the case. What could possibly explain this difference? 273 One possibility is that the more complex regulations in animals require more 274 disordered residues in comparison with yeast. But the average disorder content is 275 similar in all eukaryotic species, contradicting this idea. We also noted that yeast is 276 also one of the genomes with lowest GC content (∼40%). Therefore, we decided to 277 examine the properties of proteins from different age groups in respect with to their 278 GC content.

279
Orphans are more disordered in high-GC genomes

295
The GC is not constant over a genome. In general coding regions have higher GC 296 than non-coding regions [33]. Further, there are also variation in GC between different 297 regions of a genome, so when a non-coding region is turned into a gene the local GC 298 will decide the amino acid content of the protein. Therefore, it might be more relevant 299 to study the GC of each gene individually.

300
A strong relationship between GC and structural properties of 301 orphan genes.

302
In Fig. 4 we show the dependency of structural properties on GC content for 303 individual genes. In addition to the variation for protein of the four age groups, we 304 have examined the structural properties for a set of random proteins generated from 305 codons at a given GC frequency, for details see methods. It can be seen that the 306 structural properties of these random genes are clearly GC dependent.

307
Orphans, and genus orphans, show a definite dependency of all studied properties 308 on GC, thus indicating that, broadly, orphan proteins appear to be simklar to random 309 protein in their nature, given a certain GC level see Fig. 4. In contrast ancient and they appear to contain less sheet and more helical residues than expected by random.

312
When studying Fig. 4 in more detail a few notable differences between the random 313 proteins and the orphans can be observed: orphans are more disordered; contain more 314 low complexity regions but fewer sheets independently of the GC level.

315
It should be recalled that what we describe above is based on predicted structural 316 features and they are a reflection of the sequence of a protein. If a certain group of 317 proteins is predicted to be more disordered, or contain more sheets, it is quite likely a 318 consequence of changes in amino acid frequencies, in such a way that the frequency of 319 order-/disorder-promoting amino acids changes.  [27], for hydrophobicity we used the biological hydrophobicity scale [28], while 327 sheet, turn, coil and helix propensities were analysed using structure-based 328 conformational preferences scales [29].  Interestingly, also the propensities of the two groups of older proteins change by 337 GC; however, this dependency is much less pronounced than in younger or random 338 proteins. We should remember that amino acid preferences and GC content are 339 coupled both ways: changes of amino acid composition will not only affect the 340 properties but also the codons used and thereby the GC; so it is possible that the 341 relationship between properties and GC for ancient proteins is an indirect consequence 342 of the amino acid preferences and not that the disorder is caused by high GC. The big 343 difference seen between orphan and ancient proteins indicates that, given evolutionary 344 time, the selective pressure to change the GC level is weaker than the selective 345 pressure to change the protein properties.

347
The GC content affects the codon usage between different genomes [34] and it has 348 been argued that the GC content might be solely responsible for the codon bias [35].

349
The difference in codon usage causes differences in amino acid frequencies, in such a 350 way that some amino acids are more frequent in higher GC content levels. Obviously, 351 the reverse could also be true, i.e. that high disorder content increases the GC content 352 of a gene. But if this was the case the correlation should be stronger for ancient   increase their frequency to a higher level than the 3.3% expected by chance.

381
Finally, Cys and His are less frequent, independently of GC content, in real 382 genes than in random ones, indicating their special roles in protein function and 383 folding as well as their rareness.

384
In Fig. 7 the GC content of the codons of each amino acid is compared with the 385 propensity of that amino acid to be in a certain structural region. Three amino acids, 386 Ala, Gly and Pro are "high GC" amino acids, i.e. they have more than 80% GC in 387 their codons, while five amino acids, Lys, Phe, Asn, Tyr and Ile, have "low GC codons" 388 have less than 20% GC in their codons. The other twelve amino acids show weaker 389 dependency with GC content, see Fig. 7.

390
All three "high GC" amino acids are intrinsic disorder-promoting (high TOP-IDP), 391 while four out of five "low GC" amino acids are order-promoting (low TOP-IDP) 392 residues. Therefore at high GC content, DNA codons coding for hydrophilic, 393 disorder-promoting amino acid are prevalent in any given protein, by simple statistics, 394 while DNA sequences low in GC tend to contain codons for hydrophobic amino acids, 395 associated with low intrinsic disorder. 396 A comparison between the GC level and structural preferences is shown in Fig. 7. In a random sequence, the most disorder-promoting amino acid, Pro, only 420 represents less than 5% of the amino acids at 40% GC, but 10% at 60% GC. This 421 actually agrees well with what is observed in the youngest proteins: 5% at 40% GC vs. 422 9% at 60% GC, see Fig. 6. Interestingly, even ancient proteins show a similar but 423 significantly weaker trend. Here, the fraction of Pro increases from 4.5% to 6%.

424
Similar changes in frequencies can be observed for several amino acids.

425
On average, young proteins are more disordered than ancient proteins, but this 426 property is strongly related to the GC content. In a low-GC genome the disorder 427 content of an orphan protein is ∼30% while in a high-GC genom eit is over 50%, see  Figure 1. Overview of the proteins assigned to the four age groups in this study. Orphan proteins are proteins unique to one strain/species; genus orphans are found at the immediately superior level (species/genus); Intermediate are found in more general taxonomic levels, but not assigned to be present in the ancestor to all fungi/metazoans. ancient proteins are supposed to be present in the ancestral genomes. In this plots are shown (a) the fraction of proteins belonging to each age group, (b) the average length, in amino acids, (c) the average GC content of the genes, (d) Intrinsic disorder (long) predicted by IUpred (% of disordered residues), (e) percentage of transmembrane residues, (f) fraction of residues in low-complexity regions, (g) fraction of residues predicted to be coil, (h) fraction of residues to predicted to be in a beta sheet and (i) fraction of residues predicted to be in a helix.