The authors have declared that no competing interests exist.
Conceived and designed the experiments: OM YYW ER RS. Performed the experiments: OM YYW. Analyzed the data: OM. Contributed reagents/materials/analysis tools: OM YYW. Wrote the paper: OM YYW ER RS.
The prioritization of candidate disease-causing genes is a fundamental challenge in the post-genomic era. Current state of the art methods exploit a protein-protein interaction (PPI) network for this task. They are based on the observation that genes causing phenotypically-similar diseases tend to lie close to one another in a PPI network. However, to date, these methods have used a static picture of human PPIs, while diseases impact specific tissues in which the PPI networks may be dramatically different. Here, for the first time, we perform a large-scale assessment of the contribution of tissue-specific information to gene prioritization. By integrating tissue-specific gene expression data with PPI information, we construct tissue-specific PPI networks for 60 tissues and investigate their prioritization power. We find that tissue-specific PPI networks considerably improve the prioritization results compared to those obtained using a generic PPI network. Furthermore, they allow predicting novel disease-tissue associations, pointing to sub-clinical tissue effects that may escape early detection.
Identifying the genes causing genetic disease is a key challenge in human health, and a crucial step on the road for developing novel diagnostics and treatments. Modern discovery methods involve genome-wide association studies that reveal regions of the genome where the causal gene is likely to reside, and then prioritizing the candidate genes within these regions and experimentally examining the most promising candidates' potential influence on the disease. Many computational methods were developed to automatically prioritize candidate genes. Some of the most successful methods use a biological network of interacting genes or proteins as an input. However, these networks – and subsequently, these methods – do not take into account the differences between tissues. In other words, a heart disease is analyzed using the same network as a skin disease. We constructed tissue-specific protein interaction networks and explored their effect on an existing prioritization algorithm by comparing the algorithm's performance on the tissue-specific networks and the generic network. We find that integrating tissue-specific data indeed leads to better prioritization. We also used the prioritization results of different tissues in order to suggest new disease-tissue associations.
A fundamental challenge in human health is elucidating the molecular basis of hereditary diseases. Contemporary methods for discovering disease-causing genes usually consist of two steps: first, genome-wide association studies identify genomic intervals that are linked to a disease of interest. Second, the genes within these intervals are examined for their causal relation to the disease
Many state of the art algorithms for the gene prioritization problem use protein interaction or functional linkage networks
In this work, we incorporate tissue-specific gene expression data into the prioritization process and demonstrate its impact on the prioritization results. The integration is achieved by constructing tissue-specific protein-protein interaction (PPI) networks and employing them in the prioritization. The rationale behind this approach is that many disorders involve a disruption of the ‘molecular fabric’ of different, healthy tissues. From a protein interaction network point of view, this disruption can be often characterized as a perturbation of a gene, corresponding to node removal, or the perturbation of an interaction between two gene products, corresponding to an edge removal
The concept of tissue-specific protein interactions is relatively unexplored. Bossi and Lehner
Of note, the lack of tissue specific PPI networks stands in marked difference from the existence of many tissue- and cell-specific variants of other types of biological networks, such as regulatory networks
The current study is the first large-scale study that aims to enhance the accuracy of existing network-based gene prioritization algorithms by taking into account tissue-specific information. This is achieved by constructing tissue specific PPI networks and utilizing them for gene prioritization instead of the standard, generic PPI network. First, we examine the hypothesis that a gene is likely to be expressed in a healthy tissue for its mutation to clinically manifest in that tissue. Indeed, a large majority (71–83%) of the known disease-causing genes are significantly expressed in the corresponding disease-associated tissue. However, not all disease-associated genes are significantly expressed in the tissues where the disease is manifested. Interestingly, as shown below, we find that most of the remaining genes either have a low expression level across all tissues, or are involved in mediating a response to external stimulus or being involved in multi-cellular developmental processes, and as such are not expected to have high expression under steady-state conditions in the adult tissue.
Focusing on the cases where the disease-related gene is expressed in the associated tissue, we show that integrating tissue specific expression information into a gene prioritization scheme markedly improves its prediction accuracy. Specifically, we generate tissue-specific PPI networks for 60 healthy human tissues using gene expression data from those tissues
We constructed literature-based gene-disease and disease-tissue association sets. To this end, we retrieved a set of known gene-disease associations from GeneCards
For each disease, we assigned the tissue that had the
Next, we constructed binary tissue-specific gene expression profiles for 60 healthy tissues based on the Novartis Research Foundation Gene Expression Database (GNF)
For each gene-disease association, we checked whether the causal gene is expressed in the tissue assigned to the disease. Interestingly, we found that a considerable fraction of the causal genes were not expressed in their assigned tissue, ranging between 29% and 17% from MAS>8% to MAS>60%, respectively (
The fraction of disease-causing genes expressed in the tissue of their pertaining disease, compared to the random expectation (obtained through a permutation test;
To better understand why disease-causing genes might be lowly expressed in their associated tissues, we studied in detail the 76 lowly-expressed disease-causing genes under a MAS threshold of 40%. First, we analyzed the functional annotations of those genes. Notably, 44 (58%) of the genes were found to be involved in multicellular development processes (GO:0007275, FDR E-value: 1.8E−11), where 36 of those were directly involved in organ development (GO:0048513, FDR E-value: 7.1E−12). Hence, mutations in these genes might disrupt their early embryonic activity leading to pathologies in adult tissues regardless of their expression in these mature tissues. In addition, 17 (22%) of the genes were involved in cellular response to stimulus (GO: 0051716, FDR E-value: 1.8E−4) and, therefore, may not be expressed under normal conditions.
We also found that disease-causing genes that were lowly expressed in the tissue associated to the disease tended to be expressed in fewer tissues than expected (12.1 tissues on average compared to 17.5 at random, p<1E−5;
We considered two methods for converting the generic PPI network into a tissue-specific network using a given tissue-specific expression profile. These methods are summarized in
First, we determine the set of expressed genes in a given tissue based on an expression cutoff of 200 Affymetrix AD units. The set of expressed genes is then superimposed on the general PPI using one of two strategies: (a) Node Removal – removing genes which are considered unexpressed from the network. (b) Edge Reweight - Reducing the weight of an edge connecting one or two unexpressed genes. This results in a tissue specific PPI network.
A naïve method, titled “
The number of interactions also drops, from 41049 in the generic network to 14257.21 on average (Range: 2026[4.9%]–27571[67.1%], standard deviation: 6195.4). The amount of expressed proteins and retained interactions in the network have a strong positive correlation (Pearson: p = 0.9939). Moreover, there's also a similarly strong positive correlation between the amount of expressed proteins and average interactions per expressed protein at the tissue (Pearson: p = 0.9803), suggesting that the power-law distribution of interactions is retained. See Supp.
The second tissue-specific network reconstruction method, novel to this work, is titled ‘
The NR and ERW (with
In order to prioritize candidate disease genes, we used the PRINCE prioritization algorithm, which we have previously shown to compare favorably to other state-of-the-art algorithms
PRINCE receives a weighted PPI network, a disease-disease phenotypic similarity network and a disease-gene association set as inputs. Given a query disease, PRINCE assigns a prior score to genes associated with known diseases that are phenotypically similar to the query. This score is then propagated through a PPI network in an iterative process, culminating in a smooth scoring function where the score of a node tends to be similar to the scores of its neighboring nodes.
In more detail, let
We applied PRINCE to score disease-causing genes using both the original PPI network and the tissue-specific networks built with the
For MAS>40%, the AUC of the original, generic PPI network (0.825) was lower than that of each representative tissue-specific network (0.85–0.88). The results, summarized in
Performance comparison between the generic and different variants of tissue-specific PRINCE, according to the ROC Area under curve (AUC) of causal gene prediction in a leave-one-out cross validation test. Error bars represent the standard deviation of AUC values obtained when replacing leave-one-out with 25-fold cross validation of ten random partitions. Results are for a disease-tissue MAS threshold of 40%.
We further inspected the cross-validation results by comparing the ranking of true causal genes in the generic network to the tissue-specific networks on a case-to-case basis, in order to estimate how often the tissue-specific data improves the prioritization. Instead of bundling all of the cross-validation results together, we regarded every test case (disease-gene association) in the data set separately, and compared the ranking given to the actual causal gene by PRINCE using the generic and the tissue-specific PPI networks. We found that for every tested MAS threshold, both ERW and NR tissue-specific PRINCE ranked true associations higher than the generic PRINCE in a majority of the cases (
#cases of better ranking | |||||
MAS threshold | Tissue-specific network type | Tissue-specific | Tie | Generic | Wilcoxon signed-rank test p-value |
|
NR | 295 | 203 | 103 | 2.09e−15 |
|
ERW, |
291 | 233 | 88 | 9.12e−26 |
|
ERW, |
288 | 266 | 58 | 1.88e−37 |
|
ERW, |
248 | 334 | 30 | 8.85e−37 |
|
NR | 125 | 91 | 40 | 7.68e−9 |
|
ERW, |
124 | 102 | 30 | 1.24e−14 |
|
ERW, |
122 | 117 | 17 | 2.84e−17 |
|
ERW, |
103 | 145 | 8 | 6.34e−17 |
The table presents a case-to-case comparison of the ranking provided by generic and tissue-specific PRINCE, as well as the statistical significance of this comparison using Wilcoxon signed-rank test.
Having the ability to predict the effects of disease genes on specific tissues, naturally gives rise to the question: given a disease (a collection of disease-causing genes), what tissues are most likely to be affected? This is of particular interest, since while the overt clinical manifestations of a disease are usually well-known, in many cases it may have more subtle, sub-clinical tissue effects that may escape early detection. Such alterations may manifest themselves at later stages of the disease, and may be wrongly attributed to other potential complications and confounding factors, instead of the original disease, which can serve at least as an important predisposing factor.
To investigate this potential scenario in depth, we developed a method to computationally infer disease-tissue associations using the framework presented in the previous section. For a given query disease, we applied TS-ERW PRINCE (
We compared our predicted disease-tissue associations to the data collected by Lage et al. For every disease with MAS>40%, we checked what ranking was given to the tissue which was assigned the highest association score by Lage et al (
The histogram shows the distribution of our disease-tissue ranking for the tissues assigned by Lage et al. in every test case (disease-gene association). As can be seen, in more than half of the cases the associated tissue was predicted first among all other tissues.
In the current study we aimed to infer disease causing genes using tissue-specific PPI networks. Most previous studies that used these networks to infer causal genes were based on generic PPI networks and ignored differences between tissues
We used the PRINCE algorithm for gene prioritization and contrasted between generic and tissue specific PPIs. We found that the tissue specific approach enhances the performance of the algorithm. In our analysis we used two different methods for tissue specific PPI networks construction that yield different gene prioritization performance. We observed that better results were obtained when modifying the weights of the networks edges (using the ERW method) compared to following the more drastic approach of removing lowly-expressed proteins from the network (using the NR method). There may be several explanations for these differences. First, it may be related to PRINCE algorithm. A global network-based algorithm such as PRINCE is expected to be less successful when applied to a more disconnected network, such as those generated by the NR approach. Moreover, even for other algorithms that are based on local inference which is not propagated, ERW may be proven more appropriate. NR is a very strict method, eliminating every unexpressed protein, while ERW assigns a continuous value for the interaction based on the expression of the two interacting proteins. Thus, the former is likely to be less robust to noisy data such as gene expression
One might suggest that there is no need to generate tissue specific PPI networks for tissue specific prioritization. Rather, one might use the generic PRINCE and then, in a post-processing manner, assign the lowest possible ranks to the lowly-expressed genes in the tissue being investigated. While such an attenuation approach performs poorly when applied to the entire gene-disease association set (AUC = 0.755,
Interestingly, as a preprocessing step for the tissue specific PRINCE algorithm, we found that a considerable fraction of disease genes are not expressed in the tissue associated with the disease. There may be several explanations for this observation. First, it may reflect an error in measurements, either of the expression microarray or the computational inference of disease-tissue association. Nevertheless, such a substantial fraction of genes is more likely to reflect a true biological observation. For example, a protein may be active although having lower mRNA levels. Posttranscriptional modifications or higher translational efficiency may also result in higher protein levels or longer protein half-lives
Another possibility may be that the damage to the tissue was caused by a disruption of the protein function within the tissue in earlier developmental stages. Supporting this hypothesis we found that lowly-expressed disease causing genes are enriched with developmental annotations such as multicellular development processes (GO:0007275) and organ development (GO:0048513), and with stimulus response annotations (GO:0051716). Hence, the protein may not be active in the adult tissues (as manifested by its expression pattern), but a mutation in the genes may alter normal development of the tissue or may prevent the normal response of the tissue to stress or other stimuli, resulting in a disease. Finally, due to the complexity and the dependencies between tissues in a multi-tissue organism, a mutation in a protein active in one tissue may result in clinical pathology in another tissue. For example, Vitamin D – dependent rickets 1A (MIM: 264700) is primarily a bone disorder, but it is caused by a mutation in the gene
Some limitations of the current analysis should be mentioned. First, a direct tissue specific measure of protein abundance would be more adequate than mRNA levels as a measure for the presence and hence the activation and functionality of a protein in a tissue. However, despite the best efforts of the scientific community, compendiums of human tissue-specific protein abundance levels across multiple tissues are not nearly as comprehensive as the mRNA expression dataset we use, both in tissue scope, gene coverage and quantitative resolution
In recent years, PPI networks were shown to be a powerful tool in many fields of molecular biology, such as predicting protein annotation and more
We downloaded the Novartis Research Foundation Gene Expression Database (GNF) tissue-specific gene expression data set
The disease-tissue association matrix was contributed by Kasper Lage
For 76 disease genes that were lowly-expressed or not expressed (i.e., expression below 200 AD units) in the tissue associated with the relevant disease, we conducted functional enrichment analysis using the DAVID web server
To estimate the number of disease genes that are expected to be lowly-expressed at an assigned tissue at random, we computed this quantity for 10,000 permutations of the tissue assignment vector taken for a given MAS threshold. We permuted the vector instead of picking a tissue from a uniform distribution for every disease, in order to maintain the bias of tissues that tend to be associated with many diseases (e.g. skin diseases, cardiac diseases). Since the fraction measured experimentally was lower than those resulting from the 10,000 permutations, the estimated p-value is p<1E−5.
We constructed a weighted human PPI network with 9,998 proteins and 41,702 interactions. The network is based on three high throughput experiments
We now describe in detail the reweighting scheme that we used. Our underlying assumption was that an interaction between proteins P1 and P2 occurs at a specific tissue t if only if P1 and P2 interact in the general network and are both expressed at tissue t. Denote the event that proteins Pi and Pj interact in the generic network as Ii,j, and the event that protein Pi is expressed in tissue t as X(i,t). Now, a gene is considered expressed in a given tissue if its measured expression level in that tissue is above 200 AD units. However, expression data is often noisy
We extracted from GeneCards
To evaluate the performance of the different variants of PRINCE, we used a leave-one-out cross validation procedure. In each cross-validation trial, a single disease-gene association, <
To generate the ROC curve, we bundled together all of the scores from all of the cross validation trials, sorting them from highest to lowest and recording true- and false- positive rates at various score cutoffs. The actual causal genes were considered positive, and the rest of the genes were considered negative.
For case-to-case rank comparison, we considered each trial separately, and counted in how many trials did the tissue-specific PRINCE gave the actual causal gene a better rank compared to the entire network PRINCE, in how many times tissue-specific PRINCE gave a worse rank, and in how many cases both input networks yielded the same rank (
To assess the significance of the difference between the different AUCs, we employed 25-fold cross validation. We performed random partitions and used the standard deviation as error bars in
To fine-tune the
We filtered the disease-gene association set with a MAS>40% disease-tissue association threshold. The 40% threshold was picked in order to retain only high-confidence associations (∼90% estimated accuracy). We considered only disease-tissue associations where the causal gene is known to be expressed in the tissue assigned by Lage et al.
For each disease-gene pair, we removed the association and ran PRINCE with the same definitions and parameters as the previous section. We repeated the procedure once per tissue, using that tissue's TS-ERW PPI with
We evaluated the correlation of our tissue ranking with the tissues given the highest association score by Lage et al. for each disease (denoted ‘
To provide an estimated p-value for the high number of highly-ranked assigned tissues, we performed a permutation test as follows: For every disease-gene association, we assigned at random a tissue to the disease, selecting from the tissues where the causal gene is expressed (to counter the bias caused from focusing on disease-gene associations where the gene is expressed in the assigned tissue), and marked the ranking we give the randomly assigned tissue'. When using the ‘ranking by PRINCE rank’ scheme, we counted how many times the random tissue was ranked first. We repeated this procedure 1000 times.
(XLSX)
(ZIP)
(ZIP)
(PNG)
(PNG)
(PNG)
(PNG)
(PNG)
(XLSX)
(PDF)
(PDF)
We would like to thank Kasper Lage for providing us the disease-tissue association matrix used in