Role of Duplicate Genes in Robustness against Deleterious Human Mutations

It is now widely recognized that robustness is an inherent property of biological systems [1],[2],[3]. The contribution of close sequence homologs to genetic robustness against null mutations has been previously demonstrated in simple organisms [4],[5]. In this paper we investigate in detail the contribution of gene duplicates to back-up against deleterious human mutations. Our analysis demonstrates that the functional compensation by close homologs may play an important role in human genetic disease. Genes with a 90% sequence identity homolog are about 3 times less likely to harbor known disease mutations compared to genes with remote homologs. Moreover, close duplicates affect the phenotypic consequences of deleterious mutations by making a decrease in life expectancy significantly less likely. We also demonstrate that similarity of expression profiles across tissues significantly increases the likelihood of functional compensation by homologs.


Introduction
The ability of an organism to survive in various environmental conditions indicates robustness to external perturbations. On the other hand, relative insensitivity to harmful genetic mutations represents genetic robustness. Several large scale gene deletion studies demonstrated that organisms exhibit a significant degree of genetic robustness against null mutations [6]. Although these studies have an important caveat that genes without a detectable phenotype may be essential under different growth conditions [7,8], it is clear that genetic robustness is widespread in biological systems [3].
Two distinct mechanisms of genetic robustness have been extensively discussed. Alternative signaling or parallel metabolic pathways illustrate network contributions to genetic robustness [9]. In contrast, a partial functional overlap between sequence paralogs represents the contribution of gene duplicates. The study by Gu et al. [4] demonstrated a significant contribution to functional compensation by duplicate yeast genes. A similar pattern of the functional compensation was also observed in C. elegans [5]. The mechanism of genetic robustness by duplicates was recently investigated by Kafri et al. [10], who showed that null deletions in yeast are often compensated by over-expression of sequence homologs.
The role and magnitude of the paralog contribution to robustness against deleterious human mutations are not currently well understood. While the study by Lopez-Bigas et al. [11] suggested a contribution by highly conserved paralogs, Yue et al. [12] showed recently that disease and all genes have an equal fraction of paralogs. In the present work, we demonstrate the importance of considering the sequence similarity between paralogs for understanding the likelihood and magnitude of functional compensation. We also explore the effects of mRNA coexpression between duplicates on the observed functional back-up. Understanding the mechanisms of genetic robustness will be important for identification and prioritization of medically important human mutations.

Disease and all gene sets
We investigated the functional compensation by duplicates using three curated collections of human disease genes. Although we currently do not know the total number of disease genes, more than a thousand genes with known mutations affecting human health have been identified [13]. First, we used the collection of 1003 Swiss-Prot [14] human genes with non-synonymous disease mutations annotated in the OMIM database [13]. Second, we investigated the collection of 1609 human genes from the OMIM Morbid Map annotated to be involved in disease, but not as susceptibility or non-disease. Our third disease gene set, obtained from the study by Jimenez-Sanchez et al. [15], included a curated collection of 881 human genes and the associated disease phenotypes such as the age of onset and reduction in life expectancy. The considered disease gene sets significantly overlap, i.e. 636 genes are present in all three sets (see Figure S1, Supporting Information).
Without a collection of human genes which are certainly nondisease, we used several large collections of all human genes (all gene sets). We primarily used the comprehensive collection of 20,262 human genes from the Ensembl build 35 [16]. As a representative set of well-characterized human genes, we also considered the collection of 12211 human genes from the Swiss-Prot database [14].

The effects of duplicate sequence homology
To understand the role of gene duplicates in robustness against deleterious human mutations we searched for homologs of the disease and all human genes using protein BLASTP [17] (see Methods). Briefly, for each query sequence its closest human paralog was identified as the non-self hit which can be aligned over more than 80% of the length of both sequences. The sequence hits with an E-value larger than 0.001 were not considered (results are qualitatively insensitive to the gene set used or the cutoffs and parameters applied in the similarity searches, see Table S1-S3, Supporting Information). For the human genes with identified paralogs (475 in the disease gene set and 8257 in the all-gene set), the distributions of amino acid sequence identities of the closest homologs are significantly different for disease and all-gene sets (see Figure S2, Supporting Information). The average identity of the closest homolog is 52.9% for disease genes and 58.3% for all genes (non-parametric Wilcoxon's test P = 1.6*10 27 ). The observed difference cannot be explained simply by the existence of several large protein families with a small number of known disease genes; after removing sequences with more than one paralog in the human genome, the average identity of the closest homolog is 50.0% for disease genes and 54.3% for all genes (P = 2*10 22 ). Neither can the difference arise due to difficulties in disambiguating allelic variants from close sequence differences in copy number variable genes [18,19]. After excluding genes with highly similar paralogs of sequence identity greater than 90%, the average identity of the closest paralog is 51.4% for disease genes and 54.4% for all genes (P = 7*10 24 ).
In Figure 1 we show the conditional probability that a human gene will harbor a disease mutation given the amino acid sequence identity of its closest homolog. To calculate the conditional probability (see Methods) we assume that, although the total number of human disease genes is not known, the currently available collection of disease genes is unbiased towards sequence identities of the closest homologs. Figure 1 demonstrates that genes with at least 90% sequence identity to their closest homologs are three times less likely to harbor disease mutations compared to genes with remote paralogs. No correlation was observed between the number of disease mutations in a gene (Spearman's rank correlation r S = 20.025, P = 0.6) or gene density of disease mutations (r S = 20.036, P = 0.4) and the sequence identity of the closest homolog. This suggests that the number of disease mutations identified in genes may be determined primarily by experimental, mutational, or gene history biases [20], and not affected by the possibility of functional compensation. Similarly, no correlation between deleterious variability and evolutionary distance to murine orthologs was observed in the study by Sunyaev et al. [21].
If close sequence homologs provide functional back-up against medically damaging mutations, it is likely that they also contribute to relaxation of constraints against deleterious human polymorphisms. As was demonstrated by Lynch et al. [22], most duplicated genes experience a brief period of relaxed selection after duplication. The functional constraints on human genes can be estimated through the normalized ratio of non-synonymous to synonymous single nucleotide polymorphisms (SNPs) per site (Ka/ Ks) [17,23]. A small value of the Ka/Ks ratio suggests a higher constraint on a gene, i.e. a smaller fraction of observed nonsynonymous polymorphisms. Figure 2 shows the relationship between the average Ka/Ks ratio and sequence identity to the closest homolog (shown separately for all and validated SNPs from the dbSNP database [24]). The Ka/Ks ratio of the validated SNPs is about two times higher for genes with a 90% sequence identity homolog compared to genes with remote homologs.
While there are many examples of homologous iso-enzymes providing functional compensation [7,25], this mechanism is less established for other functional classes. To understand the significance of the duplicate compensation among various functional categories we (applied the approach described in the previous section and) compared sequence identities of closest paralogs for disease genes and all human genes in the 53 ''GO slim'' functional classes. Using a false discovery rate of 5%, we found that, in additional to metabolism, the functional category ''response to stimulus'' showed evidence of statistically significant compensation by duplicates (see Table 1 and Table S4, Supporting Information); the ''response to stimulus'' category contains cytokines, receptors, protein kinases and other proteins involved in signal transduction. Consequently, functional compensation by duplicates is not limited to metabolism and is also significant among other important functional classes.
The observed paucity of close homologs for known disease genes could be a consequence of their faster evolution in comparison with all human genes. To investigate this possibility we analyzed Ka and Ka/Ks values calculated using PAML [26] for all 13055 one-to-one human-mouse orthologous pairs from the Ensembl database [27]. Both Ka and Ka/Ks measures for known disease genes are significantly lower than those of all-gene set (mean/median Ka: disease 0.0729/0.0833, all 0.0851/0.0971, P = 4*10 22 ; mean/median Ka/Ks: disease 0.119/0.105, all 0.137/0.113, P = 1*10 22 .). These findings are in agreement with the study by Kondrashov et al. [28] who considered 1273 disease genes and 16580 other human genes. Although the earlier study by Smith and Eyre-Walker [29] reported the opposite pattern (a higher Ka/Ks ratio for disease genes), their results were based on significantly smaller gene sets (387 disease and 2024 non-disease genes). Consequently, it is unlikely that the elevated sequence similarity between paralogs of non-disease genes is related to their slower rate of evolution.
Recently, He et al. demonstrated a lower duplicability of ''important'' yeast genes (essential genes and genes with knockout phenotypes) [30]. To explore the possibility that lower duplicability of disease genes affects our results we followed the approach by He et al. [30]. Based on the Ensembl database [27] we identified singleton human genes (genes without duplicates in the human genome, see Methods) with mouse, chicken, and zebrafish orthologs. We then looked at whether the orthologs of singleton

Author Summary
Genetic robustness is the ability of an organism to buffer deleterious genetic mutations. It has been previously demonstrated that the functional compensation by duplicates plays an important role in protection against gene deletions in model organisms. Close duplicates often share similar functions, and loss of one paralog may be buffered by others. In the present work we specifically investigate the contribution of gene duplicates to backup against deleterious human mutations. We find that genes with close homologs are significantly less likely to harbor known disease mutations compared to genes with remote homologs. In addition, close duplicates affect the phenotypic consequences of deleterious mutations by making a decrease in life expectancy less likely. Similarity of expression profiles across tissues increases the likelihood of functional compensation by homologs. Taken together, our analysis demonstrates that functional compensation by close duplicates plays an important role in human genetic disease.
human genes have duplicated in the mouse, chicken, and zebrafish genomes (see Text S1, Supporting Information). The analysis showed that singleton disease genes are as likely to have duplicate orthologs as all human singleton genes (9.2% of 338 disease singletons and 8.5% of 5657 all human singletons, x 2 -test P = 0.5. See Figure S3, Supporting Information). Therefore, human disease genes are as likely to retain duplicates in evolution as all human genes.

Phenotypic consequences of mutations
The sequence identity between duplicates influences the phenotypic consequences of gene deletions in yeast [4]. As the sequence identity decreases, null mutations with weak growth phenotypes become less likely and mutations with strong growth phenotypes become more likely. Inspired by this analysis, we decided to investigate if duplicates also affect phenotypic consequences of human disease mutations. For that purpose we used the collection of human disease genes with manually curated phenotypes [15]. While we did not detect a significant correlation between the presence of close duplicates and the age of onset, the population frequency, or the mode of inheritance, we found a significant correlation between the sequence identity to the closest duplicate and the reduction in life expectancy (Spearman's rank correlation r S = 20.21, P = 2*10 26 , x 2 -test, P = 2*10 24 see Figure 3 and Methods). Consequently, the functional compensation by close duplicates may protect against ''mild'', ''moderate'', and ''severe'' decline in life expectancy.
Several known examples illustrate this interesting result. Mutations in red-sensitive opsin gene cause partial colorblindness (OMIM#303900). Nevertheless, the life expectancy is not seriously affected due to the presence of the green-sensitive opsin gene (close homolog of the red-sensitive gene). Another example involves several homologous iso-enzymes of the human glycogen phosphorylase; the three iso-enzymes are primarily active in muscle, liver, and brain. Although defects in the muscle and liver forms cause glycogen storage disease V (MIM#232600) and VI (MIM#232700) respectively, neither of the defects reduces life expectancy.

The effect of expression profile similarities
Because gene duplicates often have different patterns of expression [25,31,32], it is likely that the functional compensation depends not only on the sequence similarity, but also on the similarity of their expression profiles across human tissues. We decided to test this hypothesis using the comprehensive expression dataset by Su et al. [33], which includes expression of 44775 human transcripts in 79 tissues.
Initially, we used the absolute values of gene expression in different tissues to calculate the relative expression difference between every gene and its closest sequence homolog. The relative expression difference was defined as (Exp(Gene)2Exp(Paralog)/ (1/2*((Exp(Gene)+Exp(Paralog)). Using this measure we did not find any significant differences between disease and all genes (P = 0.1). It is likely that each gene is expressed primarily in a small number of tissues and the simple averaging of expression values across all tissues will not be informative. Therefore, in order to better reflect the observed expression patterns, we considered a gene to be expressed in a tissue if at least one of the gene  . The relationship between the sequence identity of the closest homolog and the ratio of non-synonymous to synonymous human SNPs per site (Ka/Ks). The Ka/Ks ratio was averaged for genes within each sequence identity bin. The ratio is shown for all (black) and only for validated (red) SNPs from the dbSNP database [24]. Above 60% sequence identity, the Ka/Ks ratio increases monotonically as the homolog sequence identity increases. doi:10.1371/journal.pgen.1000014.g002 transcripts was found to be significantly expressed (''present call'') in the tissue by Su et al. [33]. We defined Similarity of Tissue Expression (STE) for a gene pair as the ratio of the number of tissues where the two genes are both expressed to the number of tissues where at least one of the genes is expressed; STE is essentially the Jaccard's coefficient of similarity for binary expression patterns. The STE value of one would indicate complete overlap between expression profiles, while values close to zero would indicate poor overlap. Since expression profile similarity and sequence similarity of duplicates tend to be correlated [25,31,32], we demonstrated (see Figure 4 and Table  S5, Supporting Information) that the STE values are consistently lower for disease gene pairs in different sequence bins; the differences are significant for sequence identity bins from 30% to 80%. We also performed the likelihood ratio test to show that the similarity in tissue expression influences the probability of being a disease gene independently of the sequence identity to the closest homolog (likelihood ratio test x 2 = 4.0, P,0.05, see Methods).

Conclusions
Our analysis clearly demonstrates that gene duplicates affect the phenotypic consequences of deleterious human mutations. Several studies suggested possible mechanisms of functional back-up by duplicates [4,9,10,34]. It is likely that similar mechanisms also play a role in human genetic diseases. In some cases duplicates might    actively compensate for the mutated homolog, for example by partially carrying the metabolic flux of the mutated gene [25]. In other cases, genes with close duplicates may have smaller functional loads compared to singletons, i.e. genes with duplicates may be essential in a smaller number of environmental conditions [7]. As a result, a disease phenotype is less likely to be observed. We take the view that both of these cases represent functional compensation, although it may be called active compensation in the first case and passive compensation in the second. In our view, the probabilistic approach used in our paper to investigate the likelihood of disease mutations given the sequence identity of the closest homolog can be applied for identification and prioritization of medically relevant mutations. Such prioritization approaches are necessary as large collections of human genetic variation, such as mutations associated with various cancers [35,36] and common human polymorphisms [37], are being generated at an accelerated rate. A probabilistic scheme, similar to the one used in our paper, can be directly applied as a prior in search for causative mutations; the information about homolog expression profiles can be also considered. The development of such probabilistic prioritization schemes is beyond the scope of this paper. Nevertheless, the fact that genes with 70-100% sequence identity homologs are about 2-3 times less likely to harbor disease mutations, and a significant fraction of such genes in the human genome, suggest that duplicate homology information may be important for the prioritization of medically relevant mutations.
The collections of disease genes used in our work are incomplete and significantly biased towards Mendelian diseases [15]. When large and reliable datasets of genes responsible for complex diseases become available it will be interesting to investigate whether fundamental differences exist between functional compensation for Mendelian and multi-factorial diseases. In future studies, it will be also important to investigate robustness to deleterious human mutations achieved through various network effects [3,9]. Such studies will bring the important biological concept of robustness into the realm of human genetics.

Methods
Three sets of human disease genes were used in our study. We obtained a list of 1003 human genes (1006 Swiss-Prot entries) with disease non-synonymous mutations from the Swiss-Prot database [14] (July 2005; http://expasy.org/cgi-bin/lists?humsavar.txt). The list of 881 human disease genes (923 OMIM entries) with annotated phenotypes was taken from the study by Jimenez-Sanchez et al. [15]. We also considered another disease set consisting of genes annotated as ''disease'', but neither as ''susceptibility'' nor as ''non-disease'' in the OMIM Morbid Map [13]. This set included 1609 genes (2239 MIM entries). Two sets of all human genes were used based on the Ensembl [16] and Swiss-Prot databases. The longest protein isoform of every human gene was obtained from the Ensembl human genome build 35. We only retained genes annotated as ''pep:known'' or ''pep:CCDS'' (representing genes mapped to human-specific entries of Swiss-Prot, RefSeq, SPTrEMBL or CCDS). In total 20,262 genes were included. The other all-human gene set consisted of 12,211 protein sequences from the Swiss-Prot database. All-against-all BLASTP searches were performed using standard parameters [17]. Sequence homologs were identified as non-self hits with Evalue , = 0.001 that could be aligned over more than 80% of both the query length and the length of identified sequence. Throughout the manuscript the term ''singleton human genes'' is used to describe the genes without any sequence homologs which can be identified the BLASTP searches.
We obtained H. sapiens to D. rerio, H. sapiens to G. gallus, and H. sapiens to M. musculus orthology information as well as paralogous relationships within D. rerio, G. gallus, and M. musculus from the Ensembl database [27]. Ka and Ka/Ks values of all 1:1 humanmouse orthologous pairs were calculated using the PAML package and obtained directly from the Ensembl database [27].
The sets of synonymous and non-synonymous human SNPs were obtained from the dbSNP database [24]. These included 87920 SNPs corresponding to 14825 human genes. For each bin of homolog sequence identity, the Ka/Ks ratio was calculated. The proportion of non-synonymous sites (0.717) was calculated from simulation; for each nucleotide in the protein coding region a random transition or transversion mutation was performed at the ratio of 0.6/0.4, according to the published estimates in mammals [38,39,40,41].
We used manually curated phenotypes from the study by Jimenez-Sanchez et al. [15] to calculated Spearman's rank correlation between reduction in life expectancy (ordinal data: none, mild, moderate, and severe) and sequence identity to the closest homolog.
The functional categories of human genes used in our study were based on the annotation by GOA [42]; 53 of GO slims for GOA (http://www.geneontology.org/GO_slims/goslim_goa.obo) were considered and Benjamin-Hochberg's algorithm was applied for multiple hypothesis correction.
The gene expression profiles in 79 human tissues were obtained from the study by Su et al. [33]. We eliminated probe sets with cross hybridization effects (as identified by Su et al.). In total, we considered expression profiles for 15097 human genes. The expression value of gene G at tissue T was set to 1 if at least one of gene G's transcripts was detected as ''Present call'' in tissue T based on the Affymetrix detection algorithm (provided by Su et al. [33]). Similarity of Tissue Expression (STE) of a gene pair was defined as the Jaccard's coefficient of the binary expression profiles of the two genes, that is, the ratio of the number of tissues where the two genes are both expressed to the number of tissues where at least one of the genes is expressed. We performed the likelihood ratio test to investigate whether the similarity in tissue expression influences the probability of being a disease gene independently of the sequence identity to the closest homolog. The logistic regression was used to model the probability of being a disease gene using the expression and sequence similarities. In the null hypothesis the disease gene probability is determined only by sequence identity of the closest homolog; in the alternative hypothesis the probability is determined by sequence identity and tissue expression similarity of the closest homolog.
The probabilities shown in Figure 1 represent conditional probabilities. Specifically, the conditional probability P(disease| seq_id_homolog) that a gene is associated with a genetic disease given that it has a closest homolog with a certain sequence identity, was calculated according to the equation: where P(seq_id_homolog | disease) is the probability that the closest homolog of a disease gene has a certain sequence identity, P(seq_id_homolog) is the probability that a randomly selected human gene (disease or non-disease) has a closest homolog with a certain sequence identity, and P(disease) is the probability that a random human gene is associated with a genetic disease. Importantly, because P(disease) is currently unknown (as we know only a fraction of all disease genes), we estimate P(disease | seq_id_homolog) up to a constant by assuming certain P(disease) value. For display purposes, we assumed P(disease) = 0.2 in Figure 1.