Figures
Abstract
Many human genetic disorders and diseases are known to be related to each other through frequently observed co-occurrences. Studying the correlations among multiple diseases provides an important avenue to better understand the common genetic background of diseases and to help develop new drugs that can treat multiple diseases. Meanwhile, network science has seen increasing applications on modeling complex biological systems, and can be a powerful tool to elucidate the correlations of multiple human diseases. In this article, known disease-gene associations were represented using a weighted bipartite network. We extracted a weighted human diseases network from such a bipartite network to show the correlations of diseases. Subsequently, we proposed a new centrality measurement for the weighted human disease network (WHDN) in order to quantify the importance of diseases. Using our centrality measurement to quantify the importance of vertices in WHDN, we were able to find a set of most central diseases. By investigating the 30 top diseases and their most correlated neighbors in the network, we identified disease linkages including known disease pairs and novel findings. Our research helps better understand the common genetic origin of human diseases and suggests top diseases that likely induce other related diseases.
Citation: Almasi SM, Hu T (2019) Measuring the importance of vertices in the weighted human disease network. PLoS ONE 14(3): e0205936. https://doi.org/10.1371/journal.pone.0205936
Editor: Kwang-Il Goh, Korea University, KOREA, REPUBLIC OF
Received: September 30, 2018; Accepted: February 26, 2019; Published: March 22, 2019
Copyright: © 2019 Almasi, Hu. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The source code for computing the vertex centrality measure DIL-W is provided at: https://github.com/MIBlab-MUN/vertex-centrality-DILW.
Funding: TH acknowledges the Discovery Grant RGPIN-2016-04699 from the National Sciences and Engineering Research Council of Canada (NSERC) (http://www.nserc-crsng.gc.ca/ResearchPortal-PortailDeRecherche/Instructions-Instructions/DG-SD_eng.asp). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
During the past decades, significant progress has been made in our understanding of human diseases [1]. However, the genetic architectures of complex diseases are still largely unclear. Many common diseases tend to be related to each other, and it is speculated that they may share common genetic origin. Thus, studying the correlations of human diseases has the potentials of better understanding the genotype to phenotype mapping [2, 3] and better predicting disease association genes [4, 5, 6, 7, 8]. Moreover, learning which diseases are correlated can help use existing drugs to treat multiple similar diseases [9, 10, 11, 12, 13].
Meanwhile, network science is a rising field where entities and their complex relationships are studied on a global scale [14, 15, 16], and has seen increasing applications to perform advanced analysis on biomedical data [17, 18, 19, 20, 21, 22, 23, 24]. There are various cellular components in the human body that interact with each other within the same cell or across different cells [15]. A network called the human interactome can be constructed according to the interactions of those different cellular components. Each component can be represented as a vertex in the network and interactions among them can be captured as links (or edges) connecting pairs of the cellular components. Those cellular components can be proteins or metabolites, and the network refers to protein-protein interaction (PPI) network [25, 26, 27] or metabolic network [28, 29, 30].
Some studies aimed at identifying the correlations among diseases through network analysis [15, 31, 32]. Goh et al. [33] constructed a human disease network (HDN) by connecting pairs of diseases when they share common association genes. Of 1,284 diseases in the HDN, 867 have at least one link to other diseases, and 516 form a giant component, suggesting that the genetic origins of most diseases, to some extent, are shared with other diseases. Moreover, the HDN naturally and visibly clustered according to major disease classes such as cancer cluster and neurological disease cluster. Zhou et al. [34] extracted over twenty million bibliographic records from PubMed [35] in order to obtain 147,978 connections between 322 symptoms and 4,219 diseases. A human symptoms-disease network (HSDN) was then constructed and was able to show the symptom similarity between all pairs of diseases (7,488,851 links) in the network. The weight of links represented the similarity of symptoms between two diseases. They showed that the correlations among diseases were significantly related to the genetic associations that each pair of diseases had in common as well as the interactions between their related proteins. Lee et al. [36] built a disease metabolism network in order to study disease comorbidity for better disease prediction and prevention. Two diseases are connected if enzymes associated with them catalyze adjacent metabolic reactions. Their results show that diseases with higher degrees, i.e., connecting with many other diseases, have a higher rate of prevalence and mortality.
Measuring the centrality of vertices helps identify important vertices in the network in terms of connecting to all other vertices. Centrality measures have been used frequently to analyze biological networks over the past decades [37, 38, 39]. The most common centrality measures include degree (the total number of neighbors), closeness (the total distance to all other vertices), and betweenness (the fraction of locating on the shortest paths of all pairs of vertices) [40]. Despite wide applications in biological networks, these centrality measures are rather general and may not be able to capture all the properties of vertices in the context of biological networks. Furthermore, closeness and betweenness have high computational complexity due to the fact that pair-wise shortest paths in a network need to be enumerated in order to compute the centralities. Therefore, carefully tailored and more efficient centrality measures are needed for specific network of interest, in this study, the human disease network.
Köhler et al. [41] proposed a vertex importance measure for disease genes in the context of PPI networks. They used a random walk strategy to assess the distance between vertices in the network, and reported improved performance comparing with conventional distance-based centrality measures. Wu et al. [42] integrated PPI networks with gene expression data in order to rank disease genes associated with various cancers. They showed that their method was able to find replicable high-rank genes using different datasets. Martinez et al. [43] proposed a generic vertex prioritization method using the idea of propagating information across data networks and measuring the correlation between the propagated values for a query and a target set of entities. The authors tested their method by ranking disease genes associated with Alzheimer’s disease, diabetes mellitus type 2 and breast cancer. They reported some new high-rank association genes that could bring new insights into the diseases.
In this article, we propose a new method for the construction of a weighted human disease network(WHDN) and a new centrality measure to identify the most important diseases. First we use a large database of disease-gene associations to build a weighted bipartite disease-gene network, and then construct a weighted disease network where link weights capture the strengths of the pairwise disease correlations. After the backbone extraction of the WHDN, we design a centrality measure for the context of the WHDN that considers not only the degree of a vertex but also the importance of its incident edges. Then we compare our new centrality measure with degree, closeness and betweenness by evaluating the network efficiency decline rate with the removal of top-ranked vertices by each centrality measurement. Finally, we present the top 30 diseases ranked by our centrality measure in our WHDN and discuss their biological implications.
Methods and results
Given the multiple-step pipeline structure of this study, we show the result of each step after the description of the corresponding method. The source code of our analysis and network files are accessible through the Github link: https://github.com/MIBlab-MUN/vertex-centrality-DILW.
Disease-gene associations (DGAs)
The data used in this project describe disease-gene associations (DGAs) from multiple curated databases including UNIPROT [44], CTD (human subset) [45], PsyGeNET [46], Orphanet [47], and HPO [48]. The disease-gene association data are collected by DisGeNet group, available on DisGeNET v4.0 [49]. The current version of the data set contains 130,821 DGAs, between 13,075 diseases and 8,949 genes. Each DGA is assigned with a score , for disease i and gene j, within the range of [0, 1] based on its level of evidence, the number and the type of database sources supporting the DGA, and the number of publications verifying the association between the gene and the disease [49]. We first clean up the data in order to ensure that all diseases and genes in the dataset are unique and that there is no replication of disease-gene associations. Next, since we would like to consider the correlation among all diseases, we keep diseases and syndromes in the dataset for our analysis and remove injuries or poisonings, anatomical abnormalities, acquired abnormalities, mental or behavioral dysfunctions, signs or symptoms, findings, congenital abnormalities, neoplastic processes, and pathologic functions. We use DisGeNet web-based application [49] for this filtering.
Network construction
Bipartite disease-gene association network.
The best representation for depicting the associations among genes and diseases is a bipartite graph, which is called the disease-gene association network in this research. The bipartite graph contains two different sets of vertices. One set includes diseases and the other one contains genes. By definition, no edge is allowed to connect a pair of vertices in the same set of vertices in a bipartite graph. That is, there can be no link either between a pair of diseases or a pair of genes. There is an edge between a gene and a disease if there is an association between them. Their link weight is assigned as the score , for disease i and gene j, computed in the DGA database described in the previous section. A sample subgraph of the bipartite network is shown in Fig 1.
The bipartite network has two sets of vertices, i.e., genes and diseases, represented by rectangle and gray ellipses respectively. An edge connects a disease and a gene if there is a known association between them. The weight of an edge reflects the strength of the DGA between disease i and gene j.
Fig 2 depicts the degree distributions of diseases and genes in the bipartite disease-gene association network. For the set of diseases, the maximum degree is 564, of the disease epilepsy, and the average degree is 5.43. In Fig 2a), the degree distribution of the diseases is right-skewed and heavy-tailed, indicated by the straight linear fit on a log-log scale. For the set of genes, the maximum degree is 111, of the gene LMNA, and the average degree is 5.81.
Degree distribution of a) diseases and b) genes in the bipartite disease-gene association network. The distributions are shown on a log-log scale.
The bipartite network is comprised of multiple connected components with a single giant component. Fig 3 shows its distribution of the size of connected components. The giant component has 10,212 vertices consisting of 5,278 diseases and 4,934 genes. Apart from the giant component, all other connected components are small with a size varying from two to nine, and most of them are only single pairs of one disease and one gene. Fig 3 shows that there is a considerable number of components with two vertices, i.e., 844 isolated disease-gene pairs. Since we are interested in investigating the large-scale genetic correlations of human diseases, we focus on the giant component of the disease-gene bipartite network in the downstream analyses.
The network has a single giant component with 10,212 vertices, and the majority of other connected components are of size two, i.e., consisting of only one disease and one gene.
Weighted human disease network (WHDN).
We construct the WHDN using the giant connected component of the bipartite disease-gene network. We use D and G to denote sets of 5,278 diseases and 4,934 genes respectively in the giant connected component. In the WHDN, an edge links two diseases i and j if they have at least one association gene in common, and the weight of the edge, wij, is computed based on the number of shared association genes, as well as the strengths of those associations.
Such a weight definition is inspired by Newman’s study on scientific collaboration networks [14], where vertices are scientists and two scientists are connected by an unweighted edge if they have coauthored one or more scientific papers together. To define the strength of the tie between two connected scientists, two factors are considered. First, two scientists whose names appear on a paper together with many other coauthors know one another less well on average than two who are the sole authors of a paper. Thus, the collaborative ties are weighted inversely according to the number of coauthors of a paper. Second, authors who have written many papers together will know one another better on average than those who have written few papers together. Thus, all coauthored papers are added up to account for the tie strength of two scientists.
Here, similarly, first we consider that the correlation of two diseases through a gene is stronger when they are the sole associated diseases with this gene than when there are many other diseases associated with the same gene. Second, the correlation of two diseases is considered stronger when they share more genes through stronger associations than less genes or weaker associations. Thus, we extend Newman’s method to weighted graph and define the weight of edge wij between two diseases i and j as
(1)
where
is one if disease i and gene g have a DGA, and zero otherwise.
is the score of their DGA assessed by DisGeNET as discussed in the previous section, and sg is the strength of gene g as a vertex in the bipartite disease-gene network, defined as the sum of the scores of the DGAs between gene g and its directly linked diseases,
(2)
Such a weight definition indicates that the correlation strength of two diseases is weighted inversely according to the strengths of the genes they share, and is proportional to the total number of genes they share and the strengths of their DGAs.
For example, in Fig 1, the weight between diseases contact dermatitis (CD) and white sponge nevus 1 (WSN1) is calculated as follows,
Note that the weight of two diseases can be greater than one when they share multiple genes. For example the weight between diseases WSN1 and hereditary mucosal Leukokeratosis (HML) is calculated as follows,
Since the WHDN is constructed using vertices from the giant component of the bipartite disease-gene association network, it only has a single connected component with all 5,278 vertices in the disease set D. Two vertices have an edge connecting them if the represented two diseases have at least one shared gene, and the edge weight is assessed as described above. The WHDN has 11,2324 edges and an average vertex degree of 42.56. That is, a disease correlates with on average 42.56 other diseases with varying strengths. Fig 4 depicts the distribution of all the edge weights in the WHDN. As we can see that a large number of edge weights are of small values and may not be particularly interesting for the subsequent analysis. Those weak edges not only add computational overhead to the network analysis, but also render the network difficult to interpret. Therefore, next we perform an edge reduction and only extract the most meaningful structure of the network.
The weight of an edge quantifies the shared genetic background of two connected diseases. There are 112,324 edges in the graph with weights ranging from 0.0152 to 22.4506.
The multi-scale backbone of WHDN.
The most straightforward strategy for network reduction may be to use a global weight threshold and remove all links that have weights lower than the threshold. However, such a global thresholding strategy is somewhat arbitrary and may overlook the network information present below the cutoff scale. Here, to preserve the multi-scale backbone of the WHDN while removing less relevant and meaningful edges we use a multi-scale filtering method proposed by Serrano et al. [50]. Such a multi-scale backbone exaction algorithm has been used to reduce the network size while preserving the meaningful structure of biological networks in multiple studies [34, 51, 52, 53].
First, the weight of edge linking vertex i with its neighbor j can be normalized as
(3)
where si is the vertex strength, i.e., the sum of weights incident to vertex i, similar to Eq (2) and defined as
(4)
where Γi is the set of vertex i’s neighbors. Therefore, there are two different normalized values for a link eij using the strengths of its two end vertices si and sj as the denominator.
Second, a null model is used to assess the expectation if the weights of links connecting to a particular vertex were distributed randomly. That is, the normalized weight Nij that corresponds to the link connecting to a certain vertex of degree k is produced by a random assignment from an uniform distribution. Thus the probability density function for the variable taking a particular value x is
(5)
Then, to identify whether the probability, βij, of link weight Nij is compatible with the null model with a threshold β is given as
(6)
All links with computed βij lower than a given threshold β are preserved in the network. Note that each edge has two different values βij and βji. For solving this problem, OR and AND rules can be used. Under the first rule, if either βij and βji is lower than β, the link will be preserved. In the second case, an edge is preserved if both βij and βji are lower than β. Darabos et al. [51] empirically found that the AND rule preserve the network features better than using the OR rule in the context of human phenotype networks. In this article, the AND rule is adopted to reduce the size of the network by removing the links which are less relevant.
To find the best cutoff for β, we calculate clustering coefficient, percentage of remaining vertices and links, and total weight of the networks as a function of β in the range [0, 1]. Fig 5 shows the results of network metrics as a function of β cutoffs. We choose a β cutoff when the clustering coefficient and the remaining vertices and weights are maximally preserved while as many links are removed as possible. Accordingly, the cutoff β = 0.501 can be determined, shown as the vertical dashed line in the figure.
CC represents clustering coefficient, %Vertices is the percentage of remaining vertices, %Weights is the percentage of weights left after removing links, and %Links is the percentage of remaining links.
After the backbone extraction, the WHDN has 4,898 vertices and 38,275 edges. Those vertices are no longer connected in a single component. Fig 6 shows the size distribution of its connected components. There is a giant component with 4,810 vertices and its degree distribution is shown in Fig 7. Again the degree distribution is heavy tailed and resembles a power-law relationship. The vertex epilepsy has the highest degree of 576. This giant component will be the focus for our next step analysis, i.e., measuring vertex importance in order to find the most central diseases in terms of correlating with other diseases.
The network has a single giant component with 4,810 vertices.
The distribution is shown on a log-log scale.
Measuring vertex importance in WHDN
Although various vertex centrality measures have been proposed in the literature [37, 38, 40, 41, 54], the quantification of the importance of a vertex in a network is often context-specific. For some networks, measuring degree may suffice since a vertex can be considered important when its number of neighbors is the sole criterion. For some networks, e.g., information communication networks, a vertex may be considered more important if its distances to all other vertices are short, then closeness centrality serves this purpose well. For our WHDN, a disease is considered important if it correlates with many other diseases (degree) as well as if the correlations are themselves very important (edge importance).
We propose a vertex importance measure for WHDN by extending a centrality measure for unweighted networks proposed by Liu et al. [54]. This measure assesses the centrality of a vertex based on both its degree and the importance of its incident links (DIL centrality). For its extension on weighted graphs, we name it the DIL-W centrality.
First, in the context of unweighted graph, the importance of a link eij that connects vertex vi and vj can be calculated as follows:
(7)
where
and
. Following the convention, ki and kj are the degrees of vertex vi and vj, respectively, and t is the number of triangles with one edge being eij.
Subsequently, the contribution that vertex vi makes to the importance of eij is computed as
(8)
where j ∈ Γi, and Γi is the neighborhood of vertex i.
Then, the DIL centrality of vertex vi is calculated by combining both its degree and the importance of its incident links,
(9)
For weighted networks, we modify the computation of U in Eq (7) as
(10)
where si is the strength of vertex vi, calculated as in Eq (4), and ti is the weight sum of links incident to vertex vi that form triangles with eij. This follows the intuition that first an edge is considered more important when its two end vertices have higher strengths. Second, the importance of an edge is reduced when it has alternative two-hop paths connecting the same set of end vertices. Therefore, we subtract ti from si in Eq (10).
We define λ for weighted graphs as
(11)
Finally, the importance of a vertex can be measured by
(12)
where
is defined as
(13)
Note that, if we remove the second component in the definition of DIL-W, the centrality measure simply becomes vertex strength, i.e., weighted degree.
In the weighted graph given in Fig 8, vertex a has a higher strength but a lower degree than vertex b. We compute their DIL-W centralities and investigate which one is more central when both factors are considered.
First we have their strength values sa = 0.9 + 0.3 + 0.5 + 0.6 = 2.3, and sb = 0.2 + 0.11 + 0.2 + 0.7 + 0.5 = 1.71. Their neighborhoods are Γa = {b, c, d, g} and Γb = {a, c, e, f, g}. For vertex a,
where
and
We have
and
So
We can also have
Then
Similarly, we can compute the DIL-W centrality of vertex b DIL-Wb = 2.8916. Therefore, based on both the degree and importance of incident edges, vertex a is considered more important than vertex b.
We apply the DIL-W centrality measurement to the giant component of the backbone of WHDN, the distribution is shown in Fig 9. The DIL-W scores have a high dynamic range, from 0.0610 to 80688.1129. The majority of the vertices have low scores and a few number of vertices can have scores that are greater by orders of magnitude.
Comparison and evaluation
We compare our DIL-W measurement with three most commonly used centralities, i.e., degree, closeness, and betweenness, when applied to the giant component of the backbone of WHDN. For weighted graphs, degree centrality is calculated as vertex strength given by Eq (4). Closeness and betweenness are shortest-path-based centralities. Shortest path computation can be extended for weighted graph as follows,
(14)
Here
denotes the weighted distance between vertex i and j, and wih is the weight of the edge linking vertex i and h, where h is the intermediate vertex between vertices i and j. Since in our WHDN edge weight suggests strength, the distance between two vertices is the minimum sum of the inverse of edge weight along the path connecting them. Once the weighted distance is defined, closeness and betweenness can be calculated by their original definitions.
Fig 10 shows the correlation of DIL-W scores with a) degree, b) closeness, and c) betweenness centralities. As we can see, there is a positive correlation between DIL-W measure and all other three vertex centrality measures. The Spearman’s rank correlation coefficient is 0.672 comparing DIL-W with closeness, is 0.71 comparing DIL-W with betweenness, and is 0.947 comparing DIL-W with degree.
Correlation of DIL-W scores with a) degree centrality, b) closeness centrality, and c) betweenness centrality in the WHDN.
To evaluate our new vertex importance quantification method, DIL-W, we measure the network efficiency before and after we remove the most important vertices in the WHDN. In the context of the WHDN, the network efficiency indicates the extend to which the original connectivity of the network is maintained. We calculate the decline rate of network efficiency after removing m top-rank vertices. The network efficiency [55] is computed based on the connectivity of a network. A higher connectivity suggests a higher network efficiency. The network efficiency is defined by
(15)
where n is the total number of vertices in the network, V is the vertex set, and dij is the weighted distance between vertex vi and vj. Thus, the decline rate of the network efficiency is calculated as
(16)
where η0 is the efficiency of the original network, and η is the network efficiency after some vertices are removed.
When a more importance vertex is removed, we expect to see a greater decline rate of the network efficiency. Thus we can use μ as an indicator for the actual impact of removing a vertex in the network. Fig 11 shows the decline rate of the network efficiency when we remove each of the top 40 vertices ranked by a) degree (DC), b) closeness (CC), c) betweenness (BC), and d) DIL-W. Further removal of top ranked vertices could be investigated but was not included in the current study given the high computational demand. As shown in the figure, we do not observe a monotonic relationship across all four centrality methods. However, the correlation analysis shows that our method, DIL-W, has a slighter stronger negative correlation between the decline rate and the rank of the removed vertex than the other three. The Spearman’s rank correlation coefficient, ρ, for degree, closeness, and betweenness is −0.1801, −0.0017, and −0.0679, respectively. In comparison, DIL-W has a negative correlation coefficient −0.2698.
Decline rate of network efficiency after removing a single vertex ranked by a) degree centrality (DC), b) closeness centrality (CC), c) betweenness centrality (BC), and d) DIL-W.
We also consider removing all m top-rank vertices at once and see how this accumulative removal affects the efficiency of the network. Fig 12 shows the decline rate of the network efficiency after removing all top m vertices ranked by different centrality measures. The graph shows that the proposed method, DIL-W, has the highest decline rate of network efficiency for 57.5% of the data points, while betweenness, closeness, and degree have 27.5%, 10%, and 5%, respectively. This suggests that DIL-W is able to select a set of more important vertices comparing with the other three centrality measures. As seen in Fig 12, the four methods are very comparable until the top 11 diseases are removed from the network. Then DIL-W has a significant higher network efficiency decline rate than the rest. Betweenness centrality catches up around point 30 and becomes very comparable afterwards.
Since one main contribution of our study is to add edge weights to the HDN, we collect another set of results by computing vertex centralities without the consideration of edge weights. That is, the network structure remains the same but edges now do not carry weights, then the weighted DC, CC, and BC simply become their original definitions for un-weighted graphs, and DIL-W is replaced by the original DIL. The comparison is depicted in S1 Fig, which shows that excluding edge weights results in very similar vertex rankings by various centrality measures and essentially no significant difference in evaluation.
Table 1 shows the top 30 diseases ranked by our DIL-W method, their degrees, and their neighbors that have the strongest correlations (i.e., edge weights). References that support the known comorbidity of the disease pairs are also given.
The table shows the diseases, along with their rankings, their degrees in the WHDN, their direct neighbors with the strongest edge weight, and literature references that have discussed the correlations of the disease pairs.
In addition, we compare the top 30 diseases ranked by different centrality measures (see Fig 13). The figure shows the top 30 diseases ranked by our proposed DIL-W (x-axis), as well as their rankings by other three centrality measures. If a disease is not among the top 30 ranks by a centrality measure, the data point will be shown as a zero on the x-axis. We see that 18 out of 30 top diseases ranked by DIL-W are not picked up by at least one other centrality measures. This comparison result further justifies the utility of our proposed centrality measure on finding central vertices (diseases) undetected by other conventional centrality measures.
The diseases on the x-axis are ordered based on their ranks by DIL-W. A data point landing on the x-axis indicates that the corresponding disease was not among the top 30 ranked by a centrality measure.
Discussion
In this article, we use a network-based analysis to identify important human diseases that share genetic background with many other diseases through strong associations. We collect a large number of known disease-gene associations (DGAs) using DisGeNET in order to construct a bipartite disease-gene network. Subsequently, a weighted human disease network (WHDN) is built by connecting pairs of diseases that share associated genes and the edge weights reflect the number of genes they share as well as the strength of the DGAs. Then we develop a new vertex centrality measure for the WHDN, degree and importance of link centrality (DIL-W), which considers both the degree of a vertex and the importance of its incident edges in weighted graphs. Our network-based analysis methods are shown to be able to identify more important diseases comparing to degree, closeness and betweenness centralities. The identified disease-disease correlations include previous knowledge supported by published literature as well as less known and novel correlations that can be valuable for further studies.
The contribution of our study is two fold, the construction of the WHDN and the importance measurement of a vertex considering both its degree and edge(s). First, comparing to the HDN (an un-weighted graph) proposed by Goh et al. [33], the mechanism of including vertices and edges is the same, but we add the consideration of the confidence and strength of disease-disease correlations and add weights to edges of the HDN. Such a WHDN allows us to prune the network using a vertex disparity filter [50], which considerably reduces the complexity of the network by removing less-significant edges (from 112,324 to 38,275 before and after the back-bone extraction), while preserving most of the vertices (from 5,278 to 4,898, respectively).
Second, we further extend a new vertex centrality measure DIL-W for the WHDN, which quantifies the importance of a vertex by considering its degree and the aggregative importance of its attached edge(s), with the inspiration that a disease should be considered important if it is correlated with many other diseases (i.e., its degree) and these correlations are themselves strong and significant (i.e., edge importance).
DIL-W only uses local information of a vertex for its importance assessment, and its computational complexity is , where |V| is the total number of vertices and
is the average degree of vertices in a network. Thus, DIL-W can be efficient to compute for large and sparse networks.
Upon application to the WHDN, DIL-W is shown to outperform three commonly used centrality measures, degree, closeness and betweenness, and has identified top diseases including epilepsy, anemia, and obesity. Table 1 shows the degree in the WHDN and the most correlated disease of those 30 top-rank diseases. We are also able to find previous publications that verify almost all the correlations of those pairs of diseases, shown as references in the table. Besides some very well-known correlations such as heart failure—obesity and diabetes—obesity, the table also reports some less known but interesting correlations. For instance, Savin [58] showed that atypical retinitis pigmentosa is correlated with obesity. Moreover, the correlation between anemia and pediatric failure to thrive had not been reported in the literature until recently Dimmock et al. [57] suggested anemia as one of the novel causes of failure to thrive in children. Zimmerman [61] studied the cause of different types of cirrhosis resulting from different drug-induced injuries. This supports our finding on the correlation between cirrhosis and chemical and drug induced liver injury.
The disease-gene associations come from DisGeNet [49] only. While this is a valuable resource, it is merely one of the many databases that have disease gene information (including Jensen Lab’s DISEASES [80] and DiseaseConnect [81] databases), all of which have their own disease association scoring convention. The alternative databases will be explored in our future study.
Another future direction we would like to explore is to implement our proposed centrality measure DIL-W for other networks and to test its utility. Centrality measures essentially tell us how important a vertex is in the context of a network structure, and this “importance” can take different meanings in various types of networks. For instance, in Internet, vertices are physical routers, servers, and computers that are responsible for information transportation, therefore, vertex importance should reflect how much a vertex controls and its remove influences the traffic flow. We expect DIL-W to find useful venues for weighted networks that consider vertices as important when they are connected with many others through strong relationships.
Our understanding of human diseases is still largely unclear and the disease-gene associations are far from being complete. Future studies could explore the utilization of multiple types of data and more powerful computational tools to better cluster and categorize human diseases and to predict new genes and other factors that can explain diseases.
Supporting information
S1 Fig. The decline rate of the network efficiency as a function of removing the m (x-axis) top-ranked vertices using centrality measures degree (DC), closeness (CC), betweenness (BC), and DIL without the consideration of edge weight.
https://doi.org/10.1371/journal.pone.0205936.s001
(EPS)
Acknowledgments
This research was supported by the Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) RGPIN-2016-04699 to TH. The computation was feasible with the help from the IBM HPC cluster at the Center for Health Informatics & Analytics (CHIA), Faculty of Medicine, Memorial University.
References
- 1. Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease. Nature Genetics. 2003;33(3s):228. pmid:12610532
- 2. Lage K, Karlberg EO, Størling ZM, Olason PI, Pedersen AG, Rigina O, et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nature Biotechnology. 2007;25(3):309–316. pmid:17344885
- 3. Wu X, Liu Q, Jiang R. Align human interactome with phenome to identify causative genes and networks underlying disease families. Bioinformatics. 2008;25(1):98–104. pmid:19010805
- 4. Wu X, Jiang R, Zhang MQ, Li S. Network-based global inference of human disease genes. Molecular Systems Biology. 2008;4(1):189. pmid:18463613
- 5. Barrenas F, Chavali S, Holme P, Mobini R, Benson M. Network properties of complex human disease genes identified through genome-wide association studies. PloS One. 2009;4(11):e8090. pmid:19956617
- 6. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. Associating genes and protein complexes with disease via network propagation. PLoS Computational Biology. 2010;6(1):e1000641. pmid:20090828
- 7. Wang X, Gulbahce N, Yu H. Network-based methods for human disease gene prediction. Briefings in Functional Genomics. 2011;10(5):280–293. pmid:21764832
- 8. Moreau Y, Tranchevent LC. Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nature Reviews Genetics. 2012;13(8):523–536. pmid:22751426
- 9. Suthram S, Dudley JT, Chiang AP, Chen R, Hastie TJ, Butte AJ. Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets. PLoS Computational Biology. 2010;6(2):e1000662. pmid:20140234
- 10. Luo H, Wang J, Li M, Luo J, Peng X, Wu FX, et al. Drug repositioning based on comprehensive similarity measures and bi-random walk algorithm. Bioinformatics. 2016;32(17):2664–2671. pmid:27153662
- 11. Chiang AP, Butte AJ. Systematic evaluation of drug-disease relationships to identify leads for novel drug uses. Clinical Pharmacology & Therapeutics. 2009;86(5):507–510.
- 12. Gottlieb A, Stein GY, Ruppin E, Sharan R. PREDICT: a method for inferring novel drug indications with application to personalized medicine. Molecular Systems Biology. 2011;7(1):496. pmid:21654673
- 13. Chen H, Zhang H, Zhang Z, Cao Y, Tang W. Network-based inference methods for drug repositioning. Computational and Mathematical Methods in Medicine. 2015;2015.
- 14. Newman ME. Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Physical Review E. 2001;64(1):016132.
- 15. Barabási AL, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease. Nature Reviews Genetics. 2011;12(1):56–68. pmid:21164525
- 16. Vidal M, Cusick ME, Barabási AL. Interactome networks and human disease. Cell. 2011;144(6):986–998. pmid:21414488
- 17. Hu T, Sinnott-Armstrong NA, Kiralis JW, Andrew AS, Karagas MR, Moore JH. Characterizing genetic interactions in human disease association studies using statistical epistasis networks. BMC Bioinformatics. 2011;12:364. pmid:21910885
- 18. Hu T, Chen Y, Kiralis JW, Moore JH. ViSEN: Methodology and software for visualization of statistical epistasis networks. Genetic Epidemiology. 2013;37:283–285. pmid:23468157
- 19. Yin T, Chen S, Wu X, Tian W. GenePANDA—a novel network-based gene prioritizing tool for complex diseases. Scientific Reports. 2017;7.
- 20. Junker BH, Koschützki D, Schreiber F. Exploration of biological network centralities with CentiBiN. BMC Bioinformatics. 2006;7(1):219. pmid:16630347
- 21. Kacprowski T, Doncheva NT, Albrecht M. NetworkPrioritizer: a versatile tool for network-based prioritization of candidate disease genes or other molecules. Bioinformatics. 2013;29(11):1471–1473. pmid:23595661
- 22. Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003;4(1):2. pmid:12525261
- 23. Hu T, Zhang W, Fan Z, Sun G, Likhodi S, Randell E, et al. Metabolomics differential correlation network analysis of osteoarthritis. Pacific Symposium on Biocomputing. 2016;21:120–131. pmid:26776179
- 24. Hu T, Oksanen K, Zhang W, Randell E, Furey A, Sun G, et al. An evolutioanry learning and network approach to identifying key metabolites for osteoarthritis. PLoS Computational Biology. 2018;14(3):e1005986. pmid:29494586
- 25. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, et al. Towards a proteome-scale map of the human protein–protein interaction network. Nature. 2005;437(7062):1173–1178. pmid:16189514
- 26. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, et al. A human protein-protein interaction network: a resource for annotating the proteome. Cell. 2005;122(6):957–968. pmid:16169070
- 27. Oti M, Snel B, Huynen MA, Brunner HG. Predicting disease genes using protein–protein interactions. Journal of Medical Genetics. 2006;43(8):691–698. pmid:16611749
- 28. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási AL. The large-scale organization of metabolic networks. Nature. 2000;407(6804):651–654. pmid:11034217
- 29. Fell DA, Wagner A. The small world of metabolism. Nature Biotechnology. 2000;18(11):1121–1122. pmid:11062388
- 30. Duarte NC, Becker SA, Jamshidi N, Thiele I, Mo ML, Vo TD, et al. Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proceedings of the National Academy of Sciences. 2007;104(6):1777–1782.
- 31. Rzhetsky A, Wajngurt D, Park N, Zheng T. Probing genetic overlap among complex human phenotypes. Proceedings of the National Academy of Sciences. 2007;104(28):11694–11699.
- 32. Hidalgo CA, Blumm N, Barabási AL, Christakis NA. A dynamic network approach for the study of human phenotypes. PLoS Computational Biology. 2009;5(4):e1000353. pmid:19360091
- 33. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL. The human disease network. Proceedings of the National Academy of Sciences. 2007;104(21):8685–8690.
- 34. Zhou X, Menche J, Barabási AL, Sharma A. Human symptoms–disease network. Nature Communications. 2014;5:4212. pmid:24967666
- 35. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research. 2007;36(suppl_1):D13–D21. pmid:18045790
- 36. Lee DS, Park J, Kay K, Christakis N, Oltvai Z, Barabási AL. The implications of human metabolic network topology for disease comorbidity. Proceedings of the National Academy of Sciences. 2008;105(29):9880–9885.
- 37. Koschützki D, Schreiber F. Centrality analysis methods for biological networks and their application to gene regulatory networks. Gene regulation and systems biology. 2008;2:GRSB–S702. pmid:19787083
- 38. Özgür A, Vu T, Erkan G, Radev DR. Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics. 2008;24(13):i277–i285. pmid:18586725
- 39. Chavali S, Barrenas F, Kanduri K, Benson M. Network properties of human disease genes with pleiotropic effects. BMC systems biology. 2010;4(1):78. pmid:20525321
- 40.
Newman M. Networks: an Introduction. Oxford university press; 2010.
- 41. Köhler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes. The American Journal of Human Genetics. 2008;82(4):949–958. pmid:18371930
- 42. Wu C, Zhu J, Zhang X. Integrating gene expression and protein-protein interaction network to prioritize cancer-associated genes. BMC Bioinformatics. 2012;13(1):182. pmid:22838965
- 43. Martínez V, Cano C, Blanco A. ProphNet: A generic prioritization method through propagation of information. BMC Bioinformatics. 2014;15(1):S5. pmid:24564336
- 44. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, et al. UniProt: the universal protein knowledgebase. Nucleic acids research. 2004;32(suppl_1):D115–D119. pmid:14681372
- 45.
Davis A, Murphy C, Johnson R, Lay J, Lennon-Hopkins K, Saraceni-Richards C, et al. CTD-Comparative Toxicogenomics Database.
- 46. Gutiérrez-Sacristán A, Bravo À, Portero-Tresserra M, Valverde O, Armario A, Blanco-Gandía M, et al. Text mining and expert curation to develop a database on psychiatric diseases and their genes. Database. 2017;2017. pmid:29220439
- 47. Pavan S, Rommel K, Marquina MEM, Höhn S, Lanneau V, Rath A. Clinical practice guidelines for rare diseases: the orphanet database. PloS one. 2017;12(1):e0170365. pmid:28099516
- 48. Köhler S, Carmody L, Vasilevsky N, Jacobsen JOB, Danis D, Gourdine JP, et al. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Research. 2018.
- 49. Piñero J, Bravo À, Queralt-Rosinach N, Gutiérrez-Sacristán A, Deu-Pons J, Centeno E, et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Research. 2017;45(D1):D833–D839. pmid:27924018
- 50. Serrano MÁ, Boguná M, Vespignani A. Extracting the multiscale backbone of complex weighted networks. Proceedings of the National Academy of Sciences. 2009;106(16):6483–6488.
- 51. Darabos C, White MJ, Graham BE, Leung DN, Williams SM, Moore JH. The multiscale backbone of the human phenotype network based on biological pathways. BioData Mining. 2014;7(1):1. pmid:24460644
- 52. Serrano MÁ, Boguná M, Sagués F. Uncovering the hidden geometry behind metabolic networks. Molecular biosystems. 2012;8(3):843–850. pmid:22228307
- 53. Cantini L, Medico E, Fortunato S, Caselle M. Detection of gene communities in multi-networks reveals cancer drivers. Scientific reports. 2015;5:17386. pmid:26639632
- 54. Liu J, Xiong Q, Shi W, Shi X, Wang K. Evaluating the importance of nodes in complex networks. Physica A: Statistical Mechanics and its Applications. 2016;452:209–219.
- 55. Ren ZM, Shao F, Liu JG, Guo Q, Wang BH. Node importance measurement based on the degree and clustering coefficient information. Acta Phys Sin. 2013;6:128901.
- 56. Mansergh FC, Millington-Ward S, Kennan A, Kiang AS, Humphries M, Farrar GJ, et al. Retinitis pigmentosa and progressive sensorineural hearing loss caused by a C12258A mutation in the mitochondrial MTTS2 gene. The American Journal of Human Genetics. 1999;64(4):971–985. pmid:10090882
- 57. Dimmock D, Kobayashi K, Iijima M, Tabata A, Wong LJ, Saheki T, et al. Citrin deficiency: a novel cause of failure to thrive that responds to a high-protein, low-carbohydrate diet. Pediatrics. 2007;119(3):e773–e777. pmid:17332192
- 58. Savin L. Atypical retinitis pigmentosa associated with obesity, polydactyly, hypogenitalism, and mental retardation (the Laurence-Moon-Biedl Syndrome)(clinical and genealogical notes on a case). The British Journal of Ophthalmology. 1935;19(11):597. pmid:18169322
- 59. Silva DR, Coelho AC, Dumke A, Valentini JD, de Nunes JN, Stefani CL, et al. Osteoporosis prevalence and associated factors in patients with COPD: a cross-sectional study. Respiratory Care. 2011;56(7):961–968. pmid:21352667
- 60. Stolz SE, Chatrian GE, Spence AM. Epileptic nystagmus. Epilepsia. 1991;32(6):910–918. pmid:1743165
- 61. Zimmerman HJ. Drug-induced liver disease. Clinics in Liver Disease. 2000;4(1):73–96. pmid:11232192
- 62.
American Optometric Association. https://www.aoa.org/; 2017. Available from: https://www.aoa.org/patients-and-public/eye-and-vision-problems/glossary-of-eye-and-vision-conditions/nystagmus.
- 63. Kenchaiah S, Evans JC, Levy D, Wilson PW, Benjamin EJ, Larson MG, et al. Obesity and the risk of heart failure. New England Journal of Medicine. 2002;347(5):305–313. pmid:12151467
- 64. Rowland LP. Diagnosis of amyotrophic lateral sclerosis. Journal of the Neurological Sciences. 1998;160:S6–S24. pmid:9851643
- 65. Rodger W. Non-insulin-dependent (type II) diabetes mellitus. CMAJ: Canadian Medical Association Journal. 1991;145(12):1571. pmid:1742694
- 66. Millar J. Epilepsy and strabismus. Epilepsia. 1965;6(1):43–46. pmid:14302047
- 67. Czerwinski SL, Plummer CE, Greenberg SM, Craft WF, Conway JA, Perez ML, et al. Dynamic exophthalmos and lateral strabismus in a dog caused by masticatory muscle myositis. Veterinary Ophthalmology. 2015;18(6):515–520. pmid:25728848
- 68. Brookhouser PE. Sensorineural hearing loss in children. Pediatric Clinics of North America. 1996;43(6):1195–1216. pmid:8973508
- 69. Nørgaard F. Earliest roentgenological changes in polyarthritis of the rheumatoid type: rheumatoid arthritis. Radiology. 1965;85(2):325–329. pmid:14323910
- 70. Botez M, Attig E, Vézina JL. Cerebellar atrophy in epileptic patients. CanadianJournal of Neurological Sciences. 1988;15(3):299–303.
- 71. Weissmann G. Rheumatoid arthritis and systemic lupus erythematosus as immune complex diseases. Bulletin of the NYU Hospital for Joint Diseases. 2009;67(3):251. pmid:19852746
- 72. Sato O, Yamguchi T, Kittaka M, Toyama H. Hydrocephalus and epilepsy. Child’s Nervous System. 2001;17(1):76–86. pmid:11219629
- 73. Nabel EG, Braunwald E. A tale of coronary artery disease and myocardial infarction. New England Journal of Medicine. 2012;366(1):54–63. pmid:22216842
- 74. Kaplowitz N. Drug-induced liver injury. Clinical Infectious Diseases. 2004;38(Supplement_2):S44–S48. pmid:14986274
- 75.
Galli E, Gianni S, Auricchio G, Brunetti E, Mancino G, Rossi P. Atopic dermatitis and asthma. In: Allergy and Asthma Proceedings. vol. 28. OceanSide Publications, Inc; 2007. p. 540–543.
- 76. Arumugam K. Endometriosis and obesity. Journal of Obstetrics and Gynaecology. 1992;12(4):266–268.
- 77. Gajarski R, Naftel DC, Pahl E, Alejos J, Pearce FB, Kirklin JK, et al. Outcomes of pediatric patients with hypertrophic cardiomyopathy listed for transplant. The Journal of Heart and Lung Transplantation. 2009;28(12):1329–1334. pmid:19782603
- 78. Tucci DL, Born DE, Rubel EW. Changes in spontaneous activity and CNS morphology associated with conductive and sensorineural hearing loss in chickens. Annals of Otology, Rhinology & Laryngology. 1987;96(3):343–350.
- 79. Fiorentino E, Pantuso G, Cusimano A, Latteri S, Mastrosimone A, Cipolla C. Gastro-oesophageal reflux and “epileptic” attacks: casually associated or related efficiency of antireflux surgery. Chirurgia Italiana. 2008;58(6):689–696.
- 80. Pletscher-Frankild S, Pallejà A, Tsafou K, Binder JX, Jensen LJ. DISEASES: Text mining and data integration of disease–gene associations. Methods. 2015;74:83–89. pmid:25484339
- 81. Liu CC, Tseng YT, Li W, Wu CY, Mayzus I, Rzhetsky A, et al. DiseaseConnect: a comprehensive web server for mechanism-based disease–disease connections. Nucleic Acids Research. 2014;42(W1):W137–W146. pmid:24895436