Measuring the importance of vertices in the weighted human disease network

Many human genetic disorders and diseases are known to be related to each other through frequently observed co-occurrences. Studying the correlations among multiple diseases provides an important avenue to better understand the common genetic background of diseases and to help develop new drugs that can treat multiple diseases. Meanwhile, network science has seen increasing applications on modeling complex biological systems, and can be a powerful tool to elucidate the correlations of multiple human diseases. In this article, known disease-gene associations were represented using a weighted bipartite network. We extracted a weighted human diseases network from such a bipartite network to show the correlations of diseases. Subsequently, we proposed a new centrality measurement for the weighted human disease network (WHDN) in order to quantify the importance of diseases. Using our centrality measurement to quantify the importance of vertices in WHDN, we were able to find a set of most central diseases. By investigating the 30 top diseases and their most correlated neighbors in the network, we identified disease linkages including known disease pairs and novel findings. Our research helps better understand the common genetic origin of human diseases and suggests top diseases that likely induce other related diseases.


Introduction
During the past decades, significant progress has been made in our understanding of human diseases [1]. However, the genetic architectures of complex diseases are still largely unclear. Many common diseases tend to be related to each other, and it is speculated that they may share common genetic origin. Thus, studying the correlations of human diseases has the potentials of better understanding the genotype to phenotype mapping [2,3] and better predicting disease association genes [4,5,6,7,8]. Moreover, learning which diseases are correlated can help use existing drugs to treat multiple similar diseases [9,10,11,12,13].
Meanwhile, network science is a rising field where entities and their complex relationships are studied on a global scale [14,15,16], and has seen increasing applications to perform advanced analysis on biomedical data [17,18,19,20,21,22,23,24]. There are various cellular components in the human body that interact with each other within the same cell or across different cells [15]. A network called the human interactome can be constructed according to the a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 strengths of the pairwise disease correlations. After the backbone extraction of the WHDN, we design a centrality measure for the context of the WHDN that considers not only the degree of a vertex but also the importance of its incident edges. Then we compare our new centrality measure with degree, closeness and betweenness by evaluating the network efficiency decline rate with the removal of top-ranked vertices by each centrality measurement. Finally, we present the top 30 diseases ranked by our centrality measure in our WHDN and discuss their biological implications.

Methods and results
Given the multiple-step pipeline structure of this study, we show the result of each step after the description of the corresponding method. The source code of our analysis and network files are accessible through the Github link: https://github.com/MIBlab-MUN/vertexcentrality-DILW.

Disease-gene associations (DGAs)
The data used in this project describe disease-gene associations (DGAs) from multiple curated databases including UNIPROT [44], CTD (human subset) [45], PsyGeNET [46], Orphanet [47], and HPO [48]. The disease-gene association data are collected by DisGeNet group, available on DisGeNET v4.0 [49]. The current version of the data set contains 130,821 DGAs, between 13,075 diseases and 8,949 genes. Each DGA is assigned with a score a j i , for disease i and gene j, within the range of [0, 1] based on its level of evidence, the number and the type of database sources supporting the DGA, and the number of publications verifying the association between the gene and the disease [49]. We first clean up the data in order to ensure that all diseases and genes in the dataset are unique and that there is no replication of disease-gene associations. Next, since we would like to consider the correlation among all diseases, we keep diseases and syndromes in the dataset for our analysis and remove injuries or poisonings, anatomical abnormalities, acquired abnormalities, mental or behavioral dysfunctions, signs or symptoms, findings, congenital abnormalities, neoplastic processes, and pathologic functions. We use DisGeNet web-based application [49] for this filtering.

Network construction
Bipartite disease-gene association network. The best representation for depicting the associations among genes and diseases is a bipartite graph, which is called the disease-gene association network in this research. The bipartite graph contains two different sets of vertices. One set includes diseases and the other one contains genes. By definition, no edge is allowed to connect a pair of vertices in the same set of vertices in a bipartite graph. That is, there can be no link either between a pair of diseases or a pair of genes. There is an edge between a gene and a disease if there is an association between them. Their link weight is assigned as the score a j i , for disease i and gene j, computed in the DGA database described in the previous section. A sample subgraph of the bipartite network is shown in Fig 1.  Fig 2 depicts the degree distributions of diseases and genes in the bipartite disease-gene association network. For the set of diseases, the maximum degree is 564, of the disease epilepsy, and the average degree is 5.43. In Fig 2a), the degree distribution of the diseases is right-skewed and heavy-tailed, indicated by the straight linear fit on a log-log scale. For the set of genes, the maximum degree is 111, of the gene LMNA, and the average degree is 5.81.
The bipartite network is comprised of multiple connected components with a single giant component. Weighted human disease network (WHDN). We construct the WHDN using the giant connected component of the bipartite disease-gene network. We use D and G to denote sets of  Such a weight definition is inspired by Newman's study on scientific collaboration networks [14], where vertices are scientists and two scientists are connected by an unweighted edge if they have coauthored one or more scientific papers together. To define the strength of the tie between two connected scientists, two factors are considered. First, two scientists whose names appear on a paper together with many other coauthors know one another less well on average than two who are the sole authors of a paper. Thus, the collaborative ties are weighted inversely according to the number of coauthors of a paper. Second, authors who have written many papers together will know one another better on average than those who have written few papers together. Thus, all coauthored papers are added up to account for the tie strength of two scientists.
Here, similarly, first we consider that the correlation of two diseases through a gene is stronger when they are the sole associated diseases with this gene than when there are many other diseases associated with the same gene. Second, the correlation of two diseases is considered stronger when they share more genes through stronger associations than less genes or weaker associations. Thus, we extend Newman's method to weighted graph and define the weight of edge w ij between two diseases i and j as where d g i is one if disease i and gene g have a DGA, and zero otherwise. a g i is the score of their DGA assessed by DisGeNET as discussed in the previous section, and s g is the strength of gene g as a vertex in the bipartite disease-gene network, defined as the sum of the scores of the DGAs between gene g and its directly linked diseases, Such a weight definition indicates that the correlation strength of two diseases is weighted inversely according to the strengths of the genes they share, and is proportional to the total number of genes they share and the strengths of their DGAs. For example, in Fig 1, the weight between diseases contact dermatitis (CD) and white sponge nevus 1 (WSN1) is calculated as follows, Note that the weight of two diseases can be greater than one when they share multiple genes. For example the weight between diseases WSN1 and hereditary mucosal Leukokeratosis (HML) is calculated as follows, Since the WHDN is constructed using vertices from the giant component of the bipartite disease-gene association network, it only has a single connected component with all 5,278 vertices in the disease set D. Two vertices have an edge connecting them if the represented two diseases have at least one shared gene, and the edge weight is assessed as described above. The WHDN has 11,2324 edges and an average vertex degree of 42.56. That is, a disease correlates with on average 42.56 other diseases with varying strengths. Fig 4 depicts the distribution of all the edge weights in the WHDN. As we can see that a large number of edge weights are of small values and may not be particularly interesting for the subsequent analysis. Those weak edges not only add computational overhead to the network analysis, but also render the network difficult to interpret. Therefore, next we perform an edge reduction and only extract the most meaningful structure of the network.
The multi-scale backbone of WHDN. The most straightforward strategy for network reduction may be to use a global weight threshold and remove all links that have weights lower than the threshold. However, such a global thresholding strategy is somewhat arbitrary and may overlook the network information present below the cutoff scale. Here, to preserve the multi-scale backbone of the WHDN while removing less relevant and meaningful edges we use a multi-scale filtering method proposed by Serrano et al. [50]. Such a multi-scale backbone exaction algorithm has been used to reduce the network size while preserving the meaningful structure of biological networks in multiple studies [34,51,52,53].
First, the weight of edge linking vertex i with its neighbor j can be normalized as where s i is the vertex strength, i.e., the sum of weights incident to vertex i, similar to Eq (2) and defined as where Γ i is the set of vertex i's neighbors. Therefore, there are two different normalized values for a link e ij using the strengths of its two end vertices s i and s j as the denominator.
Second, a null model is used to assess the expectation if the weights of links connecting to a particular vertex were distributed randomly. That is, the normalized weight N ij that corresponds to the link connecting to a certain vertex of degree k is produced by a random assignment from an uniform distribution. Thus the probability density function for the variable taking a particular value x is Then, to identify whether the probability, β ij , of link weight N ij is compatible with the null model with a threshold β is given as All links with computed β ij lower than a given threshold β are preserved in the network. Note that each edge has two different values β ij and β ji . For solving this problem, OR and AND rules can be used. Under the first rule, if either β ij and β ji is lower than β, the link will be preserved. In the second case, an edge is preserved if both β ij and β ji are lower than β. Darabos et al.
[51] empirically found that the AND rule preserve the network features better than using the OR rule in the context of human phenotype networks. In this article, the AND rule is adopted to reduce the size of the network by removing the links which are less relevant.
To find the best cutoff for β, we calculate clustering coefficient, percentage of remaining vertices and links, and total weight of the networks as a function of β in the range [0, 1]. Fig 5  shows the results of network metrics as a function of β cutoffs. We choose a β cutoff when the clustering coefficient and the remaining vertices and weights are maximally preserved while as many links are removed as possible. Accordingly, the cutoff β = 0.501 can be determined, shown as the vertical dashed line in the figure.
After the backbone extraction, the WHDN has 4,898 vertices and 38,275 edges. Those vertices are no longer connected in a single component. Fig 6 shows the size distribution of its connected components. There is a giant component with 4,810 vertices and its degree distribution is shown in Fig 7. Again the degree distribution is heavy tailed and resembles a power-law relationship. The vertex epilepsy has the highest degree of 576. This giant component will be the focus for our next step analysis, i.e., measuring vertex importance in order to find the most central diseases in terms of correlating with other diseases.

Measuring vertex importance in WHDN
Although various vertex centrality measures have been proposed in the literature [37, 38, 40, 41, 54], the quantification of the importance of a vertex in a network is often context-specific. For some networks, measuring degree may suffice since a vertex can be considered important when its number of neighbors is the sole criterion. For some networks, e.g., information communication networks, a vertex may be considered more important if its distances to all other vertices are short, then closeness centrality serves this purpose well. For our WHDN, a disease is considered important if it correlates with many other diseases (degree) as well as if the correlations are themselves very important (edge importance).
We propose a vertex importance measure for WHDN by extending a centrality measure for unweighted networks proposed by Liu et al. [54]. This measure assesses the centrality of a vertex based on both its degree and the importance of its incident links (DIL centrality). For its extension on weighted graphs, we name it the DIL-W centrality. First, in the context of unweighted graph, the importance of a link e ij that connects vertex v i and v j can be calculated as follows: where U e ij ¼ ðk i À t À 1Þðk j À t À 1Þ and l e ij ¼ t 2 þ 1. Following the convention, k i and k j are the degrees of vertex v i and v j , respectively, and t is the number of triangles with one edge being e ij .
Subsequently, the contribution that vertex v i makes to the importance of e ij is computed as where j 2 Γ i , and Γ i is the neighborhood of vertex i. Then, the DIL centrality of vertex v i is calculated by combining both its degree and the importance of its incident links, For weighted networks, we modify the computation of U in Eq (7) as where s i is the strength of vertex v i , calculated as in Eq (4), and t i is the weight sum of links incident to vertex v i that form triangles with e ij . This follows the intuition that first an edge is considered more important when its two end vertices have higher strengths. Second, the importance of an edge is reduced when it has alternative two-hop paths connecting the same set of end vertices. Therefore, we subtract t i from s i in Eq (10). We define λ for weighted graphs as Finally, the importance of a vertex can be measured by where C v i v j is defined as Note that, if we remove the second component in the definition of DIL-W, the centrality measure simply becomes vertex strength, i.e., weighted degree.
In the weighted graph given in Fig 8, vertex a has  and We have t a ¼ w ac þ w ag ¼ 0:3 þ 0:6 ¼ 0:9; and Similarly, we can compute the DIL-W centrality of vertex b DIL-W b = 2.8916. Therefore, based on both the degree and importance of incident edges, vertex a is considered more important than vertex b.
We apply the DIL-W centrality measurement to the giant component of the backbone of WHDN, the distribution is shown in Fig 9. The DIL-W scores have a high dynamic range, from 0.0610 to 80688.1129. The majority of the vertices have low scores and a few number of vertices can have scores that are greater by orders of magnitude.

Comparison and evaluation
We compare our DIL-W measurement with three most commonly used centralities, i.e., degree, closeness, and betweenness, when applied to the giant component of the backbone of WHDN. For weighted graphs, degree centrality is calculated as vertex strength given by Eq (4). Closeness and betweenness are shortest-path-based centralities. Shortest path computation can be extended for weighted graph as follows, Here d w ij denotes the weighted distance between vertex i and j, and w ih is the weight of the edge linking vertex i and h, where h is the intermediate vertex between vertices i and j. Since in our WHDN edge weight suggests strength, the distance between two vertices is the minimum sum of the inverse of edge weight along the path connecting them. Once the weighted distance is defined, closeness and betweenness can be calculated by their original definitions. Fig 10 shows the correlation of DIL-W scores with a) degree, b) closeness, and c) betweenness centralities. As we can see, there is a positive correlation between DIL-W measure and all other three vertex centrality measures. The Spearman's rank correlation coefficient is 0.672 comparing DIL-W with closeness, is 0.71 comparing DIL-W with betweenness, and is 0.947 comparing DIL-W with degree.
To evaluate our new vertex importance quantification method, DIL-W, we measure the network efficiency before and after we remove the most important vertices in the WHDN. In the context of the WHDN, the network efficiency indicates the extend to which the original connectivity of the network is maintained. We calculate the decline rate of network efficiency after removing m top-rank vertices. The network efficiency [55] is computed based on the connectivity of a network. A higher connectivity suggests a higher network efficiency. The network efficiency is defined by where n is the total number of vertices in the network, V is the vertex set, and d ij is the weighted distance between vertex v i and v j . Thus, the decline rate of the network efficiency is calculated as where η 0 is the efficiency of the original network, and η is the network efficiency after some vertices are removed. When a more importance vertex is removed, we expect to see a greater decline rate of the network efficiency. Thus we can use μ as an indicator for the actual impact of removing a Further removal of top ranked vertices could be investigated but was not included in the current study given the high computational demand. As shown in the figure, we do not observe a monotonic relationship across all four centrality methods. However, the correlation analysis shows that our method, DIL-W, has a slighter stronger negative correlation between the decline rate and the rank of the removed vertex than the other three. The Spearman's rank correlation coefficient, ρ, for degree, closeness, and betweenness is −0.1801, −0.0017, and −0.0679, respectively. In comparison, DIL-W has a negative correlation coefficient −0.2698. We also consider removing all m top-rank vertices at once and see how this accumulative removal affects the efficiency of the network. Fig 12 shows the decline rate of the network efficiency after removing all top m vertices ranked by different centrality measures. The graph shows that the proposed method, DIL-W, has the highest decline rate of network efficiency for 57.5% of the data points, while betweenness, closeness, and degree have 27.5%, 10%, and 5%, respectively. This suggests that DIL-W is able to select a set of more important vertices comparing with the other three centrality measures. As seen in Fig 12, the four methods are very comparable until the top 11 diseases are removed from the network. Then DIL-W has a significant higher network efficiency decline rate than the rest. Betweenness centrality catches up around point 30 and becomes very comparable afterwards.
Since one main contribution of our study is to add edge weights to the HDN, we collect another set of results by computing vertex centralities without the consideration of edge weights. That is, the network structure remains the same but edges now do not carry weights, then the weighted DC, CC, and BC simply become their original definitions for un-weighted graphs, and DIL-W is replaced by the original DIL. The comparison is depicted in S1 Fig,  which shows that excluding edge weights results in very similar vertex rankings by various centrality measures and essentially no significant difference in evaluation. Table 1 shows the top 30 diseases ranked by our DIL-W method, their degrees, and their neighbors that have the strongest correlations (i.e., edge weights). References that support the known comorbidity of the disease pairs are also given.
In addition, we compare the top 30 diseases ranked by different centrality measures (see Fig 13). The figure shows the top 30 diseases ranked by our proposed DIL-W (x-axis), as well as their rankings by other three centrality measures. If a disease is not among the top 30 ranks by a centrality measure, the data point will be shown as a zero on the x-axis. We see

Discussion
In this article, we use a network-based analysis to identify important human diseases that share genetic background with many other diseases through strong associations. We collect a large number of known disease-gene associations (DGAs) using DisGeNET in order to construct a bipartite disease-gene network. Subsequently, a weighted human disease network (WHDN) is built by connecting pairs of diseases that share associated genes and the edge weights reflect the number of genes they share as well as the strength of the DGAs. Then we develop a new vertex centrality measure for the WHDN, degree and importance of link centrality (DIL-W), which considers both the degree of a vertex and the importance of its incident edges in weighted graphs. Our network-based analysis methods are shown to be able to identify more important diseases comparing to degree, closeness and betweenness centralities. The identified disease-disease correlations include previous knowledge supported by published literature as well as less known and novel correlations that can be valuable for further studies. The contribution of our study is two fold, the construction of the WHDN and the importance measurement of a vertex considering both its degree and edge(s). First, comparing to the HDN (an un-weighted graph) proposed by Goh et al. [33], the mechanism of including vertices and edges is the same, but we add the consideration of the confidence and strength of disease-disease correlations and add weights to edges of the HDN. Such a WHDN allows us to prune the network using a vertex disparity filter [50], which considerably reduces the complexity of the network by removing less-significant edges (from 112,324 to 38,275 before and after the back-bone extraction), while preserving most of the vertices (from 5,278 to 4,898, respectively).
Second, we further extend a new vertex centrality measure DIL-W for the WHDN, which quantifies the importance of a vertex by considering its degree and the aggregative importance of its attached edge(s), with the inspiration that a disease should be considered important if it is correlated with many other diseases (i.e., its degree) and these correlations are themselves strong and significant (i.e., edge importance). DIL-W only uses local information of a vertex for its importance assessment, and its computational complexity is OðjVj � k 2 Þ, where |V| is the total number of vertices and � k is the average degree of vertices in a network. Thus, DIL-W can be efficient to compute for large and sparse networks.
Upon application to the WHDN, DIL-W is shown to outperform three commonly used centrality measures, degree, closeness and betweenness, and has identified top diseases including epilepsy, anemia, and obesity. Table 1 shows the degree in the WHDN and the most correlated disease of those 30 top-rank diseases. We are also able to find previous publications that verify almost all the correlations of those pairs of diseases, shown as references in the table. Besides some very well-known correlations such as heart failure-obesity and diabetes-obesity, the table also reports some less known but interesting correlations. For instance, Savin [58] showed that atypical retinitis pigmentosa is correlated with obesity. Moreover, the correlation between anemia and pediatric failure to thrive had not been reported in the literature until recently Dimmock et al. [57] suggested anemia as one of the novel causes of failure to thrive in children. Zimmerman [61] studied the cause of different types of cirrhosis resulting from different drug-induced injuries. This supports our finding on the correlation between cirrhosis and chemical and drug induced liver injury.
The disease-gene associations come from DisGeNet [49] only. While this is a valuable resource, it is merely one of the many databases that have disease gene information (including Jensen Lab's DISEASES [80] and DiseaseConnect [81] databases), all of which have their own disease association scoring convention. The alternative databases will be explored in our future study.
Another future direction we would like to explore is to implement our proposed centrality measure DIL-W for other networks and to test its utility. Centrality measures essentially tell us how important a vertex is in the context of a network structure, and this "importance" can take different meanings in various types of networks. For instance, in Internet, vertices are physical routers, servers, and computers that are responsible for information transportation, therefore, vertex importance should reflect how much a vertex controls and its remove influences the traffic flow. We expect DIL-W to find useful venues for weighted networks that consider vertices as important when they are connected with many others through strong relationships.
Our understanding of human diseases is still largely unclear and the disease-gene associations are far from being complete. Future studies could explore the utilization of multiple types of data and more powerful computational tools to better cluster and categorize human diseases and to predict new genes and other factors that can explain diseases.