Table 1.
For the four age groups, the fraction of proteins being annotated in Gene Ontology (GO) is shown.
In parentheses is the fraction of proteins having an experimentally validated annotation (i.e. a GO evidence code equal to ‘EXP’,‘IDA’,‘IPI’,‘IMP’,‘IGI’ or ‘IEP’).
Fig 1.
Overview of the proteins assigned to the four age groups: (a) the fraction of proteins belonging to each age group, (b) the average length, in amino acids, (c) the average GC content of the genes, (d) intrinsic disorder predicted by IUpred (long), (e) percentage of transmembrane residues, (f) percentage of residues in low-complexity regions, percentage of residues predicted to be in (g) a coil, (h) a β-sheet and (i) in a helix.
The difference between orphans and ancient is statistically significant for all the considered properties: the p-value of a rank-sum test is always <10−141.
Table 2.
For the 187 considered species, the number of species in which a property is significantly higher (increasing) or significantly lower (decreasing) in orphans compared to ancient proteins is shown.
Fig 2.
For six selected species ((a,b) two strains of S. cerevisiae, (c) C. Albicans, (d) D. melanogaster, (e) D. sechellia and (f) C. elegans), intrinsic disorder (% of amino acid predicted as disordered by IUpred long) is shown as violin plots for proteins in the different age groups.
Fig 3.
Structural properties of proteins of different ages plotted against the GC content of the genome (coding regions).
For clarity only the ancient (blue) and orphan (red) proteins are shown individually, but the linear fitted lines for genus orphans (pink line) and intermediate ones (light blue) are also shown. In the text box three values are presented: rank-sum p-value = p-value of a rank-sum test of orphans versus ancient (only the property on y axis is considered); correlation p-values = p-value of a linear regression test for orphan and ancient.
Fig 4.
Running averages of predicted structural properties against GC content: (a) disorder, predicted by IUpred (long); (b) low complexity, predicted by SEG; (c) percentage of transmembrane residues predicted by Scampi; (d,e,f) percentage of residues in secondary structure of type, respectively, coil, beta sheet and alpha helix.
For each property, colored lines represent proteins of different age: orphans (red), genus orphans (pink), intermediate (light blue) and ancient (blue). The black lines represent randomly generated proteins at different GC frequencies.
Fig 5.
Running averages of structural properties computed from amino acid scales against GC content: (a) Intrinsic Disorder Propensity (TOP-IDP); (b) hydrophobicity (Hessa scale); (c,d,e,f) average propensity for secondary structure of type, respectively, turn, coil, beta sheet and alpha helix.
For each property, colored lines represent proteins of different age: orphans (red), genus orphans (pink), intermediate (light blue) and ancient (blue). The black lines represent randomly generated proteins at different GC frequencies.
Fig 6.
The relationship of each amino acid frequency with the GC content and age of the protein.
A black line represents the expected values. The amino acids are sorted by the GC content in their codons.
Fig 7.
The fraction of GC in all codons encoding an amino acid is plotted as a dotted line and the values for the different propensity scales as filled bars: (a) TOP-IDP, (b) Hessa transmembrane scale, and (c-f) Koehl secondary structure preference scale.
For each scale the Pearson (R) correlation with GC is also shown.