Protein Under-Wrapping Causes Dosage Sensitivity and Decreases Gene Duplicability

A fundamental issue in molecular evolution is how to identify the evolutionary forces that determine the fate of duplicated genes. The dosage balance hypothesis has been invoked to explain gene duplication patterns at the genomic level under the premise that a dosage imbalance among protein-complex subunits or interacting partners is often deleterious. Here we examine this hypothesis by investigating the molecular basis of dosage sensitivity. We focus on the extent of protein wrapping, which indicates how strongly the structural integrity of a protein relies on its interactive context. From this perspective, we predict that the duplicates of a highly under-wrapped protein or protein subunit should (1) be more sensitive to dosage imbalance and be less likely to be retained and (2) be more likely to survive from a whole-genome duplication (WGD) than from a non-WGD because a WGD causes little or no dosage imbalance. Our under-wrapping analysis of more than 12,000 protein structures strongly supports these predictions and further reveals that the effect of dosage sensitivity on gene duplicability decreases with increasing organismal complexity.


Introduction
Gene duplication is a primary source for the emergence of new genes and increases genome complexity [1,2]. In recent years, the evolutionary forces influencing gene duplicability have been under intense study. In particular, the gene dosage balance hypothesis [3] has been often invoked to explain gene duplication patterns at the genomic level [4]. The dosage balance hypothesis states that an imbalance in the concentrations of the subcomponents of macromolecular complexes can be deleterious [3]. Although this notion was originally proposed in the context of protein complexes, it can be extended to other protein interaction partnerships [5]. If dosage imbalance is indeed deleterious, the outcome of a gene duplication event would largely depend on the immediate dosage sensitivity effect. While significant progress has been made in the last several years [4,[6][7][8][9], the influence of dosage imbalance on the retention of gene duplicates remains not well understood. So far, the most relevant studies on this topic have mainly focused on protein complex data or protein-protein interaction data, which have inherent limitations. First of all, such data represent the interacting context of a protein in an abstract way. For example, the potential dosage imbalance effect of protein subunits in a complex may crucially depend on their topological positions within the complex and on the complex-assembly pathway [5]. Second, more importantly, there is a conceptual distinction between a-priori plausible protein associations and obligatory associations required to preserve the structural integrity and functionality of the protein. Thus, even if the interacting context of a protein could be characterized by some measurements (e.g., protein connectivity or interacting surface), the potential imbalance effect would still be hard to assess. Lastly, it is known that most current protein interaction data are noisy, being plagued with both false positives and false negatives [10,11].
Recent advances in structural genomics and biophysics enable us to examine the dosage balance hypothesis in the light of the three-dimensional structure of proteins. In this regard, we focus on a specific attribute of protein structure, the so-called under-wrapping [12][13][14][15][16][17]. This attribute quantifies the extent to which the protein structure is reliant on the interactive context to maintain its integrity. In particular, overexpressing a highly under-wrapped protein can increase the propensity for aberrant misfolding and aggregation [16], promoting dosage sensitivity.
The under-wrapping parameter describes the solvent accessibility of the major determinants of protein structure: the backbone hydrogen bonds ( Figure 1). Thus, in order for the structure to prevail and remain functionally competent, backbone hydrogen bonds must be ''wrapped'' by clusters of non-polar amino acid residues that exclude the surrounding water, thereby preventing the competing hydration of the paired polar groups. Since backbone hydration competes with structure retention, the intramolecular hydrogen bonds that are water-accessible, termed dehydrons [13], represent structural vulnerabilities. As a consequence, dehydrons promote binding partnerships with the concurrent exclusion of surrounding water, as needed to maintain the structural integrity of the protein [13,15,17]. The hydrogen-bond protection requirement poses a strong constraint on protein architecture and dictates that highly under-wrapped proteins, i.e., those with a large number of dehydrons, should be highly interactive [15] to maintain their structural integrity.
As shown in Figure 1, the wrapping extent can be accurately determined from reported structure deposited in the Protein Data Bank (PDB) [12]. As protein structures become more under-wrapped, they become more reliant on binding partnerships [15]. Thus, protein under-wrapping quantifies how strongly the structural integrity of a protein depends on its binding partners [13], thereby framing a vantage point to study the dosage imbalance effect.

Results/Discussion
From the above reasoning we predict that the probability of retention of gene duplicates in evolution (i.e., gene duplicability) should decrease with the extent of hydrogen bond under-wrapping of the polypeptide encoded by the gene. To test this prediction, we compiled non-redundant proteins with PDB-reported structures, calculated the underwrapping extent for each protein (subunit), and determined the duplicability (m, the gene family size) for the corresponding gene. Interestingly, in all six organisms studied (Escherichia coli, yeast, worm, fly, human and thale cress), we found a negative correlation between protein under-wrapping extent and gene duplicability (Figures 2A-2C and S1).
Since it has been shown that genes with particular biological functions tend to duplicate in evolution [18][19][20], we examined the potential influence of functional bias on our results. We compared the under-wrapping extent of yeast singletons with that of duplicates in different functional categories and found that singletons are consistently more under-wrapped than duplicates in each functional category ( Figure S2). This result indicates that the effect of protein wrapping on gene duplicability is independent from the previously known functional bias of gene duplication.
Our study reveals a universal negative effect of protein under-wrapping on gene duplicability in a variety of species, strongly supporting the dosage balance hypothesis. The decreasing tendency is most significant from m ¼ 1 to 4 and becomes less obvious at higher duplicability. However, the dependence between the two variables in different species varies a lot: the negative correlation is quite strong in simple organisms such as E. coli and yeast, but becomes weak in complex organisms such as humans. To perform a more rigorous comparison, we used the linear regression to roughly capture the dependence between protein under-wrapping and gene duplicability. As shown in Figure 2D, as organismal complexity increases, the effect of protein under-wrapping on gene duplicability decreases, that is, E. coli . yeast . worm . fly ; human ; thale cress, suggesting a less important role of the dosage imbalance effect in complex organisms. To further understand this intriguing trend, we examined the per-gene-family protein under-wrapping distributions in different species. As shown in Figure 3, E. coli and yeast proteins have relatively broad under-wrapping distributions, while human proteins show a narrow distribution mainly from 35% to 55%. There are fewer well-wrapped proteins (,35%) in humans, implying that most human proteins need binding partners to maintain the integrity of their functional structure. On the other hand, unicellular species appear to The extent of wrapping of a single intramolecular hydrogen bond. This parameter defines the solvent-exposure extent of the bond. The hydrogen bond is mainly an electrostatic interaction between opposite partial charges in the amide and carbonyl groups of the paired residues. A desolvation domain defines the local microenvironment of the hydrogen bond and is depicted as the union of two spheres centered at the a-carbons of the paired residues. The outer boundaries of the desolvation balls are indicated by magenta circles. The solid black disks represent non-polar carbonaceous groups on the residue side chains. These non-polar groups ''wrap'' the bond by excluding surrounding water, thereby protecting the structure from the competing hydration of the polar amide and carbonyl groups. The solid blue dots represent the a-carbons on the protein backbone, which in turn is depicted by curved blue lines. The extent of wrapping (q) is defined as the number of nonpolar groups in the desolvation domain. Thus, an under-wrapped hydrogen bond, or dehydron, is one whose wrapping is insufficient, as statistically defined in Methods. doi: 10

Author Summary
A gene duplication provides an extra gene copy that can be free to accumulate mutations and gain a new function. Therefore, gene duplication plays a very important role in evolution. However, the presence of an additional gene copy can sometimes be deleterious because it can lead to an excessive dosage relative to those of its interacting partners. This dosage imbalance effect in turn influences the fate of duplicated genes in evolution. Our study gives the first description to our knowledge of the molecular/structural basis for the dosage imbalance effect. We study the relationships between gene family size and extent of protein under-wrapping, a molecular quantifier of the reliance of the protein on binding partnerships to maintain structural integrity, indicative of the extent of structure protection from disruptive hydration. Using more than 12,000 protein three-dimensional structures from six organisms that range from bacteria to human, we show an inverse relationship between extent of protein under-wrapping and family size. That is, a duplication is unlikely to be tolerated if the protein is highly under-wrapped (i.e., its structure requires substantial stabilizing interactions with other proteins). We also show that the effect of dosage imbalance is more apparent in unicellular organisms but is buffered to some extent in higher eukaryotes.  [17]. However, the contrasting distributions between complex and simple organisms are hard to interpret, due to the staggering difference at the proteome level. Duplicated genes can arise from either whole-genome duplication (WGD) or non-WGD (including individual or segmental duplication) [21]. In a WGD, every gene in the genome is duplicated at the same time, so that binding partnerships are also duplicated, leading to less chance of dosage imbalance than a non-WGD. Thus, an interesting prediction stemming from the dosage balance hypothesis is that duplicates of highly under-wrapped proteins would be more likely to survive from a WGD than from a non-WGD event. Since the duplication history of yeast genes has been largely elucidated [22], we decided to test this prediction using yeast duplicates with m ¼ 2. We classified the yeast duplicates into two groups: one group from WGD and the other from non-WGD. By performing the analysis conditioned on the same m, the under-wrapping difference between the two groups should mainly be determined by the underlying duplication mechanisms. We found that the under-wrapping extent in WGD duplicates is significantly higher than that in non-WGD duplicates ( Figure 4A, N WGD ¼ 51, N non-WGD ¼ 56, two-tailed Wilcox rank test p , 8 3 10 À10 ), implying that the dosage imbalance effect was indeed relaxed in the WGD. Again, we examined this trend in different functional categories and found that the WGD duplicates are consistently more under-wrapped than the non-WGD duplicates in each category ( Figure 4B).
In higher eukaryotes, considerable amount of highly underwrapped proteins are associated with highly duplicated genes, suggesting that complex organisms are less sensitive to the dosage imbalance effect. This can possibly be attributed to several factors. First, complex organisms may have more efficient systems to adjust gene expression levels (e.g., chaperons, proteases and non-coding RNAs). It has been shown that in cultured cells more than 60% human promoter polymorphisms cause more than two-fold differences in geneexpression level [23]. Second, widespread alternative splicing in higher eukaryotes may play an important role to fix the imbalance effect, since different splicing variants might represent an ''escape route'' to avoid dosage imbalance. Third, it has been suggested that proteins tend to physically interact with similar partners, especially with their own duplicates [24]. Complex organisms may have higher allostery (i.e., dimerization or oligomerization), which can partly alleviate dosage imbalance. Fourth, complex organisms generally have a smaller effective population size than do simple organisms [25], so that a duplicate bearing a slightly deleterious dosage imbalance effect would have a better chance to be fixed in the population, thereby allowing a longer time for functional innovation. Last but no the least, adaptation (positive selection) due to functional diversification may have played an important role in determining the retention of duplicated genes in complex organisms [26,27] (e.g., MHC genes in mammals [28]).
In summary, we have identified protein under-wrapping as a molecular basis of dosage sensitivity. An imbalancegenerating duplication becomes less tolerable if the protein is severely under-wrapped and therefore requires substantial stabilizing interactions with other proteins. Indeed, the extent of under-wrapping in a protein can be used as an approximate predictor of the strength of the effect of dosage imbalance on gene duplicability. The prediction can be made more broadly and precisely in the future when more data on protein structures, especially on protein complexes, become available.
Computing the extent of protein under-wrapping. For each of the six organisms under study, we constructed a set of non-redundant genes with at least one PDB representative structure. From the reported structure we calculated the extent of protein underwrapping by determining the ratio of the number of insufficiently wrapped hydrogen bonds (dehydrons) to the total number of backbone hydrogen bonds in the structure. The dehydron identification from reported protein structure follows the protocol detailed in Chen et al. [12]. Together, our dataset includes 822 E. coli genes, 476 yeast genes, 29 worm genes, 94 fly genes, 2,275 human genes and 168 thale cress genes, for which we have both gene duplicability and protein structural data.
The extent of hydrogen-bond wrapping, q, measures the number of non-polar groups contained within a desolvation domain defined as two intersecting balls of fixed radius (;thickness of three water layers) centered at the a-carbons of the residues paired by the amidecarbonyl hydrogen bond ( Figure 1). In this study we adopted r ¼ 5.7Å , and while the wrapping statistics on hydrogen bonds vary with this radius, the tails of the distribution remain invariant, thus enabling a unique identification of dehydrons. An across-PDB analysis reveals that hydrogen bonds are wrapped on average by q ¼ 24.3 6 4.8 nonpolar groups for desolvation radius 5.7Å . Being insufficiently wrapped, dehydrons lie in the tails of the distribution, i.e., their desolvation microenvironment contains 19 or fewer non-polar groups, so that their q value is below the mean minus one Gaussian dispersion [12,15]. Thus, the overall under-wrapping of a protein is computed by determining the percentage of intramolecular hydrogen bonds with q 19. This criterion for identifying a dehydron fits the well-defined ansatz used to assess the wrapping statistics, which places dehydrons at the 8% percentile of most under-wrapped hydrogen bonds irrespective of the desolvation radius adopted [13][14][15][16][17]. Hence, the criterion is justified by the robustness of the results to variations in the assessment of the bond microenvironment.
The under-wrapping variation of a protein generated by structural differences in reported PDB entries is less than 8.8%. This variability arises from the different structural adaptations (induced fits) adopted by the protein in different crystallized complexes or from differences between uncomplexed protein structure in solution (often determined by NMR) and crystal structure. To account for such differences, the under-wrapping extent for each gene is typically averaged over all its PDB representations (Text S1). We obtained per-genefamily under-wrapping distributions by averaging the under-wrapping values among members within a gene family whenever available.
In this study, the wrapping computations involved more than 12,000 protein structures because a large fraction of the non-redundant proteins examined had various PDB representations with differences arising from the following sources: complexation diversity, level of structure resolution, NMR conformational diversity and high Bfactors in the crystal (Text S1). The under-wrapping data obtained in our study are given in Tables S1-S6.
Yeast WGD versus non-WGD duplicates analysis. We obtained WGD gene duplicate pairs from Kellis et al. [22]. We used the Wilcoxon rank test (two-tailed) to determine whether the distributions of protein under-wrapping between WGD and non-WGD are different, since the underlying distributions are not normal. We used the GO term analysis tools [31] to map yeast genes into the GO terms in the default GO slim file.