High GC content causes orphan proteins to be intrinsically disordered

De novo creation of protein coding genes involves the formation of short ORFs from noncoding regions; some of these ORFs might then become fixed in the population. These orphan proteins need to, at the bare minimum, not cause serious harm to the organism, meaning that they should for instance not aggregate. Therefore, although the creation of short ORFs could be truly random, the fixation should be subjected to some selective pressure. The selective forces acting on orphan proteins have been elusive, and contradictory results have been reported. In Drosophila young proteins are more disordered than ancient ones, while the opposite trend is present in yeast. To the best of our knowledge no valid explanation for this difference has been proposed. To solve this riddle we studied structural properties and age of proteins in 187 eukaryotic organisms. We find that, with the exception of length, there are only small differences in the properties between proteins of different ages. However, when we take the GC content into account we noted that it could explain the opposite trends observed for orphans in yeast (low GC) and Drosophila (high GC). GC content is correlated with codons coding for disorder promoting amino acids. This leads us to propose that intrinsic disorder is not a strong determining factor for fixation of orphan proteins. Instead these proteins largely resemble random proteins given a particular GC level. During evolution the properties of a protein change faster than the GC level causing the relationship between disorder and GC to gradually weaken.

Proteins without any detectable homology are often referred to as orphans. The 2 presence of orphans can be attributed to several causes; rapid sequence divergence 3 beyond the point of homology recognition [1,2], lateral transfer of genetic material [3], 4 and de novo gene creation [4]. The latter is of particular interest, as it is a source of 5 completely novel coding material. Studies of the properties of these proteins might 6 provide unique insights into the fundamental processes in the formation of all proteins, 7 since, in the strict sense, all proteins were once created by a de novo mechanism. 8 Before the genomic era, the scientific consensus held that de novo creation of new 9 genes was rare -instead it was believed that the vast majority of all genes were 10 generated in an ancient "big bang". However, when the first complete genomic 11 sequences were initially published, this hypothesis was not supported [5]. In fact, to this 12 day, when analyzing complete genomes from closely related species, a surprisingly high 13 number of orphan proteins is still found [6][7][8]. It has later been shown that some of 14 these proteins are not de novo created but rather assigned as orphans as a result of 15 limited phylogenetic coverage in earlier studies [9]. 16 Today supported by the vast amount of complete genome sequences available and 17 improved search methods [10], many of the initially identified orphans have been shown 18 to have distant homologs in other genomes. Still, at least in yeast, a large set of genes 19 appears to have been created through recent de novo formation [11,12]. These studies 20 indicate that in yeast there is a large set of proto-genes: ORFs that remain on the verge 21 of becoming fixed as protein-coding genes in the population [11]. This provides a 22 possible model of how novel proteins can be generated from noncoding genetic material. 23 In other species than yeast the genomic coverage is more limited and therefore studies 24 have been less detailed. 25 The availability of many, complete, evenly spaced genomes allows classifying proteins 26 at different evolutionary age [7,9,11], using methods such as ProteinHistorian [13]. 27 Here, a protein can be unique to a specific species, or even to a strain; alternatively it 28 can be present pervasively across a taxonomic group [14,15]. After de novo creation, a 29 gene needs to become fixed in the population. The selective forces governing this 30 process have been studied by examining the properties of orphan proteins. Intrinsic 31 disorder, low complexity, subtelomeric location, high β-sheet preference as well as other 32 features have been associated with orphan proteins [16,17]. It has also been proposed 33 that with age proteins (i) accumulate interactions, (ii) become more often essential and 34 (iii) obtain lower β-strand content and higher stability [18]. Some aspects of these 35 observations, such as the fact that orphans on average are short, are likely connected to 36 a de novo creation mechanism. However, other features, including intrinsic disorder, are 37 not obviously related to the gene genesis and could instead be the result of the selective 38 pressure acting during fixation. 39 In yeast, we have earlier reported that the youngest proteins, i.e. the ones unique to 40 S. cerevisiae, are less disordered than older proteins [7], while in Drosophila the opposite 41 can be seen: the youngest proteins are more disordered than the ancient ones [19]. To Predicted properties of proteins 138 Intrinsic disorder content was predicted for all the proteins by using several disorder 139 predictors; short and long disorder predictions by IUPred [24], three type of predictions 140 (REM-465, Hotloops and Coils) by DisEMBL [25] and GlobPlot [26]. In the main 141 figures we only report the prediction by IUPred long; the others are found in the 142 supplementary material (Fig. S3 to S8). It is worth mentioning that these predictors 143 operate with different definitions of disorder, so a consensus should not be expected. 144 We used SCAMPI [27] to predict the fraction of transmembrane residues in a protein. 145 The fraction of low-complexity residues is predicted using SEG [28]. PSIPRED [29] was 146 used to predict the secondary structure of all the proteins in the dataset, using only a 147 single sequence and not a profile. This reduces the accuracy but the overall frequencies 148 should not be changed significantly. We annotated each protein with the fraction of 149 residues predicted to be in each type of secondary structure (α-helix, β-strand, coil). 150 Propensity scales 151 TOP-IDP [30] is a measure of the disorder-promoting propensity of a single amino acid. 152 For each protein, the average propensity was calculated by averaging the TOP-IDP 153 values of all its residues. Similarly the hydrophobicity of each protein was expressed as 154 the average hydrophobicity using the biological hydrophobicity scale [31]. Finally, we 155 computed the propensity of each amino acid to be in a secondary structure (helix, sheet, 156 coil, turn) in the same manner by using secondary structure propensity scales [32]. 157 Statistical significance of the results 158 In order to test the statistical significance of the results, a number of tests were 159 performed. Rank-sum tests between all possible pairs of age groups were performed for 160 the entire dataset and for each studied property. Due to the large number of samples 161 the p-values from these tests are always smaller than 10 −141 even when the absolute 162 difference in numbers is minuscule. 163 To study the difference between young and old proteins on a global level, we 164 performed a rank-sum test for orphan versus ancient proteins within each species. To 165 exclude small variations we only considered the species where the p-value of this test 166 was <0.01.

167
To determine the relationship between a property and GC we studied the slopes for 168 proteins of different age. If the p-value of a linear regression test is <0.01, the 169 corresponding property is considered significantly correlated with GC.

171
To test whether the studied intrinsic properties, as well as the frequency of any given 172 amino acid, were solely dependent on GC content, we used a set of 21,000 random 173 ORFs, generated as follows: at each GC content ranging from 20 to 90%, in steps of 1%, 174 a set of 400 ORFs (equally divided into 300, 900, 1,500 and 2,100 bp long) was 175 generated so that its content of GC was fixed. The ORFs were generated by randomly 176 selecting codons among the 61 non-stop codons. The probability to select one codon 177 given a GC content of GC f req is set accordingly: where N i is the nucleotide of the codon in position i and δ(N |GC is equal to 1 if the 179 nucleotide N is guanine or cytosine and zero otherwise, etc. Finally, start and stop 180 codons were added. These ORFs were then translated to polypeptides, and all their 181  Figure 1. Overview of the proteins assigned to the four age groups: (a) the fraction of proteins belonging to each age group, (b) the average length, in amino acids, (c) the average GC content of the genes, (d) intrinsic disorder predicted by IUpred (long), (e) percentage of transmembrane residues, (f) percentage of residues in low-complexity regions, percentage of residues predicted to be in (g) a coil, (h) a β-sheet and (i) in a helix. The difference between orphans and ancient is statistically significant for all the considered properties: the p-value of a rank-sum test is always < 10 −141 .
The assignment of age to all proteins is based on the ProteinHistorian pipeline [13]. proteins. As a comparison, in our earlier study we reported 157 species-specific and 125 198 genus-specific orphans [7] while Vidal and co-workers reported 143 species-specific 199 (ORFs 1 ) and 609 genus-specific (ORFs 2−4 ) proteins [33]. Similarly, in Drosophila 200 pseudoobscura we could identify only 6 orphan proteins, in comparison to the 228 201 reported previously [8]. This shows that the exact identification of which proteins are de 202 novo created remains difficult and depends on the genomes included in the study.

203
However, our primarily aim for this study is not to estimate the exact number of 204 orphans, but to examine properties of proteins of different ages. Therefore, we do 205 believe that our conservative estimate is useful to enhance the fraction of de novo 206 created proteins in the youngest groups. Next we set out to estimate the functional evidence for our set of proteins; for this 209 we explore their Gene Ontology (GO) annotation. For each main GO category (process, 210 function and component), we computed the fraction of proteins being annotated with at 211 least one GO term in UniProt. In addition we calculated the fraction of proteins having 212 at least one experimentally verified GO annotation, Table 1.

213
The fraction of annotated proteins increases steadily with age, from ∼3-9% in 214 orphans to ∼25% in ancient, Table 1. This is expected, as older proteins have more 215 regulatory, protein-protein, and genetic interactions [18]. However, the fraction of 216 proteins with experimental functional evidence is small (<1% of protein) irrespectively 217 of age. This shows that there exists at least a fraction of proteins of any age that is 218 functionally characterized, but it is difficult to exactly determine how substantial it is. 219 In most genomes orphan proteins are more disordered.   Table 2. For the 187 considered species, the number of species in which a property is significantly higher (increasing) or significantly lower (decreasing) in orphans compared to ancient proteins is shown.
Next, we compared predicted structural properties of all proteins, see Fig. 1d-l, S2 231 and Table 2.

232
The amount of predicted disorder residues ranges between 20% and 40%, depending 233 on the prediction method. For most disorder predictors the fraction of disordered 234 residues is higher in orphans than in ancient proteins. However, there exists about a 235 handful of genomes where the opposite trend is observed: supporting earlier 236 observations, orphans are significantly more ordered in Candida albicans according to 5 237 out of 6 methods, in Saccharomyces cerevisiae s288c for 4 methods and in Fusarium 238 pseudograminearum for 3. An interesting case is that of Drosophila pseudoobscura, that 239 appears to have more ordered orphans according to IUPred long, contrary to all others 240 Drosphila species.

241
The fraction of transmembrane residues is on average ∼2% in orphan proteins, with 242 an increasing trend towards ancient (4%). Similarly the amount of helical residues 243 increase slightly with age, while the fraction of low complexity residues decrease by age. 244 For all these structural predictions the changes are quite small and there are genomes  Above, we noted that on average orphan proteins are more disordered. However, we also 249 noted that in a handful of genomes a statistically significant opposite trend could be 250 observed. To investigate this further we studied the amount of predicted disorder in 251 each genome separately. When studying intrinsic disorder, orphans and genus orphans 252 of S. cerevisiae appear remarkably ordered (∼3% of the amino acids) as shown 253 before [7] see Fig. 2a

256
In contrast, but also consistent with earlier studies [36], orphans and genus orphans 257 in most Drosophila genomes are more disordered than ancient proteins, see Fig. 2d and 258 e. In the worm C. elegans (Fig. 2f) orphan proteins appear to be consistently more  In general, it is apparent that in most organisms orphans are more disordered than 263 ancient proteins, while in yeast the opposite appears to be the case. What could 264 possibly explain this difference? One possibility is that the more complex regulations in 265 animals require more disordered residues in comparison with yeast. But the average 266 disorder content is similar in all eukaryotic species, contradicting this idea. 267 We noted that yeasts are among the genomes with lowest GC content (∼40% in 268 S.cerevisiae, 35% in C. glabrata). Therefore, we decided to examine the properties of 269 proteins from different age groups in respect to their GC content. With the exception of β-sheet frequency, the difference between orphans and ancient 276 proteins for all the considered properties is statistically significant: the p-value of a 277 rank-sum test (a non-parametric equivalent of the t-test) is always < 10 −11 . For proteins of all ages, low complexity (SEG) and predicted coil frequency increase 279 with GC, while transmembrane, helix and sheet frequency decrease.

280
Notable is that intrinsic disorder shows a clear, directly proportional dependency on 281 GC: higher GC corresponds to more disorder. At the extreme (over 60% GC), more 282 than 50% of the residues in orphan proteins are predicted to be disordered, while for 283 ancient proteins the disorder fraction is about 30%. At low GC (below 40%) the fraction 284 of disordered residues is lower and similar in ancient and orphan proteins (15-20%). 285 Further, the dependency of GC is clearly stronger for younger proteins, indicating 286 that it is related to the creation of the protein and then gradually lost during evolution. 287 To assess the significance of this dependency, we performed a linear regression test for 288 each age group. The p-values of such test is presented for orphans and ancient in the 289 boxes of Fig. 3. All the properties, with the exception of low complexity, show a p-value 290 <0.01, indicating that they are significantly correlated with GC in both orphan and 291 ancient proteins.

292
The GC is not constant over a genome. In complex eukaryotic organisms, the global 293 GC content is heavily determined by the GC composition of isochores: these regions of 294 uniform GC form a mosaic in the genomes of many complex eukaryotes, and their 295 maintenance is likely the result of natural selection [37]. In general coding regions have 296 higher GC than noncoding regions [38,39]. Further, there are also variation in GC 297 between different regions of a genome, so when a noncoding region is turned into a gene 298 the local GC will decide the amino acid content of the protein.

299
Therefore, it might be more relevant to study the GC of each protein individually.

300
A strong relationship between GC and structural properties of 301 orphans.

302
In Fig. 4 we show the dependency of structural properties on GC content for individual 303 proteins. In addition, structural properties of a set of proteins generated randomly at all 304 GC levels are shown. Orphans and genus orphans, as well as random proteins, show a 305 definite dependency on GC. In contrast, ancient and intermediate proteins are only 306 loosely dependent on GC.

307
In general there is a resemblance between orphans and randomly generated proteins. 308 However, when studying Fig. 4 in more detail a few notable differences between them 309 can be observed: orphans are more disordered and contain more low complexity regions 310 but fewer sheets, independently of the GC level.

311
It should be recalled that what we describe above is based on predicted structural 312 features that are an indirect reflection of the protein sequence. If a certain group of 313 proteins is predicted to be more disordered, or contain more sheets, it is quite likely a 314 consequence of changes in amino acid frequencies.  Property scales 316 Next, we studied the proteins using six different amino acid propensity scales. The 317 difference between the scales and predicted features is that scales describe general 318 properties, and are directly calculated from amino acid frequencies, while predicted 319 properties can also include other features. For disorder we use the TOP-IDP scale [30], 320 for hydrophobicity we use the biological hydrophobicity scale [31], while sheet, turn, coil 321 and helix propensities are analyzed using the structure-based conformational preferences 322 scales [32].  Interestingly, the propensities of the two groups of older proteins also change by GC; 331 however, this dependency is less pronounced than for orphan or random proteins. The 332 difference seen between orphan and ancient proteins indicates that, given evolutionary 333 time, the selective pressure to change the GC level is weaker than the selective pressure 334 to change amino acid frequencies.

336
GC content affects the codon usage in the genome [40] and it has been argued that GC 337 might be the sole responsible for codon bias [41]. The difference in codon usage causes 338 differences in amino acid frequencies in such a way that some amino acids are more 339 frequent in higher GC for random protein sequences. Also, a preference for some amino 340  acids might cause a change in GC just by a higher frequency of certain codons. In 341 younger proteins the correlation between amino acid frequency and GC is stronger than 342 in older proteins, see Fig. 5. This indicates that the selective pressure to change amino 343 acids in a protein is stronger than the one to change the GC content. At low GC 344 ancient proteins are more disordered than expected for random sequence while at high 345 GC they are less.

346
The influence of GC on amino acid preferences and folding as well as their rareness.

371
In Fig. 7 and Table S2 the GC content of the codons of each amino acid is compared 372 with the propensity of that amino acid to be in a certain structural region. Three amino 373 acids, Ala, Gly and Pro are "high GC" amino acids, i.e. they have more than 80% GC 374 in their codons, while five amino acids, Lys, Phe, Asn, Tyr and Ile, are "low GC 375 codons", with less than 20% GC in their codons.

376
All three "high GC" amino acids are disorder promoting (high TOP-IDP), and four 377 out of five "low GC" amino acids are order-promoting (low TOP-IDP) residues.

378
Therefore at high GC content, codons coding for hydrophilic, disorder-promoting amino 379 acid are prevalent. Genes low in GC tend to contain codons for hydrophobic amino 380 acids, associated with order. We have studied the properties of proteins and their age in a large set of eukaryotic 386 genomes. As shown before, orphan proteins are shorter than ancient proteins, but, 387 surprisingly, we do find that on average for other structural features the young and old 388 proteins are rather similar. However, we also observe that the properties of youngest 389 proteins vary significantly with the GC content. At high GC the youngest proteins 390 become more disordered and contain less secondary structure elements, while at low GC 391 the reverse is observed. We show that these properties can be explained by changes in 392 amino acid frequencies caused by the different amount of GC in different codons. The 393 influence of this can be seen in the frequency of the amino acids that have a high or low 394 fraction of GC in their codons, such as Pro.

395
In a random sequence, Pro only represents less than 5% of the amino acids at 40% 396 GC, but 10% at 60% GC. This actually agrees well with what is observed in orphan 397 proteins: 5% at 40% GC vs. 9% at 60% GC, see Fig. 6. Similar changes in frequencies 398 can be observed for several amino acids.

399
On average, young proteins are more disordered than ancient proteins, but this 400 property is strongly related to the GC content. In a low-GC genome the disorder content 401 of an orphan protein is ∼30% while in a high-GC genome it is over 50%, see Fig. 3.

402
Here we show that GC content of a genome strongly affects the amino acid      Figure S4. For six selected species (two strains of S. cerevisiae, C. Albicans, D. melanogaster, D. sechellia and C. elegans), intrinsic disorder (% of amino acid predicted as disordered by IUpred short) is shown as violin plots for proteins in the different age groups.   Figure S8. For six selected species (two strains of S. cerevisiae, C. Albicans, D. melanogaster, D. sechellia and C. elegans), intrinsic disorder (% of amino acid predicted as disordered by GlobPlot is shown as violin plots for proteins in the different age groups.