Network Properties of Complex Human Disease Genes Identified through Genome-Wide Association Studies

Background Previous studies of network properties of human disease genes have mainly focused on monogenic diseases or cancers and have suffered from discovery bias. Here we investigated the network properties of complex disease genes identified by genome-wide association studies (GWAs), thereby eliminating discovery bias. Principal findings We derived a network of complex diseases (n = 54) and complex disease genes (n = 349) to explore the shared genetic architecture of complex diseases. We evaluated the centrality measures of complex disease genes in comparison with essential and monogenic disease genes in the human interactome. The complex disease network showed that diseases belonging to the same disease class do not always share common disease genes. A possible explanation could be that the variants with higher minor allele frequency and larger effect size identified using GWAs constitute disjoint parts of the allelic spectra of similar complex diseases. The complex disease gene network showed high modularity with the size of the largest component being smaller than expected from a randomized null-model. This is consistent with limited sharing of genes between diseases. Complex disease genes are less central than the essential and monogenic disease genes in the human interactome. Genes associated with the same disease, compared to genes associated with different diseases, more often tend to share a protein-protein interaction and a Gene Ontology Biological Process. Conclusions This indicates that network neighbors of known disease genes form an important class of candidates for identifying novel genes for the same disease.

not far from it; and since it is commonly used, and developing a new null-model would be an ambitious research project in itself, we employ it.
The way to sample this null-model is straightforward -starting from the real network one chooses a pair of edges (i,j) and (i' ,j') and, provided it does not introduce a self-or multiple edge, it is replaced by (i,j') and (i' ,j). Throughout this procedure, all vertices keep their original degrees, but the information about how they were connected is lost. When every edge has been rewired at least once we have a sample of the null-model. Continuing the rewiring process until the all edges are rewired again gives us another instance of the null model. In our paper we repeat this process 1000 times to get 1000 samples of the null-model. The average values of the network measure over these samples are then the value we use for reference.

Assortativity
Is there a tendency of nodes with the same magnitude of degree to connect to each other, or are large-degree nodes primarily connected to low-degree nodes? This question is answered by the assortativity (or assortative mixing) coefficient r [7] -essentially the Pearson correlation coefficient of nodes at either side of an edge. The complication comes from the fact that Pearson's coefficient has a built in directionality -it measures the correlation between one variable as a function of another, whereas our question "one" is the same as "another". The solution is to consider undirected edges as two directed edges pointing in opposite directions. A practical formula for computer measurements of r is where 〈…〉 denotes averages over all the edges, k 1 is the degree of the first argument of the edge in the internal representation of the edge, and k 2 is the degree of the second argument.
The assortativity ranges from -1 to +1, positive values meaning a tendency for nodes of similar degrees to connect to each other, negative values means that large-degree nodes tend to attach to lowdegree nodes. One thing worth noting is that r is heavily influenced by the degree sequence of the graph [8]. Assuming the rewiring null-model described above, r of a completely random graph is negative. Zero represents neutrality of r with respect to a null-model of random multigraphs constrained only to the sizes (the number of vertices and edges). An illustration of assortativity can be seen in Fig. 1.

Centrality measures
While assortativity is a way to characterize the network as a whole, one can also use network measures to study individual nodes. One of the most fundamental network measures for nodes are those trying to capture the node's centrality [9]. There are several aspects on can take in assessing a nodes location between the center and the periphery. The simplest way (one can discuss if it really is a centrality measure) is the degree. Every node is at the center of its own neighborhood, a node high degree has a large neighborhood, and is thus at the center, myopically, of a relatively large part of the graph. The degree does not take the whole network into account. The simplest global centrality measure, or rather anti-centrality measure (the smaller the value is, the more central is the vertex), is eccentricity E Fig. 2. Illustration of modularity. In this graph the modularity of the two partitions (red and blue). The modularity of this partition is 0.48 which is also the maximal modularity for this graph.
where d(i,j) is the distance between i and j (number of edges in the shortest path between i and j). The eccentricity focus on a maximum of a property, and like maxima in general are sensitive for fluctuations (only one node needs to be added to a network to change the eccentricity by one unit). A more stable measure, that is arguable more relevant in for many biological processes, related to average than extremal performance, is the closeness centrality Closeness centrality, the reciprocal average distances from one vertex to the rest of the graph focus on exactly what its name suggests. The node with highest closeness centrality is the node that, on average, can be reached by fewest steps in the graph from other vertices.

Network modularity
A network cluster is a region of the network that is strongly connected within and relatively sparsely connected to the rest of the network. Such clusters are interesting in biology because of their similarity to the notion of biological module -a relatively independent subsystem performing some biologically well-defined function. By this analogy we also call the clusters network modules, admitting the biological concept has a stronger focus on dynamic processes than its network counterpart. The common way of measuring how well a subdivision of a network into clusters capture the modular structure of the network is by the network modularity [10] where e ij is the fraction of the edges going between modules i and j. The first term gives a positive contribution to edges within a cluster; the second term penalizes edges between different clusters (the form is chosen so that the expected Q-value in a random multigraph is zero). A method to divide a graph into clusters is to find the cluster division that maximizes Q. This turns out to be a very hard computational problem and a large body of literature has been devoted to finding approximate solutions. We use the method proposed in Ref. 11.
Q, the maximal Q-value over all partitions of the graph, is a prototype measure of the modularity of a network. A clear cluster structure should give a large Q'-value. Just like assortativity gets a non-zero value for simple graph models, especially if they have broad degree distributions, so does Q' [12]. It is thus important to interpret Q' compared with a null-model.

Overlap between disease categories and network clusters
How are the disease genes clustered in the protein-protein interaction network? Are disease genes of one category spread out randomly between the network clusters, or do the network clusters and disease categories divide the network in the same way? To answer these questions this, we calculate an overlap score defined in Ref. 13. Let ϕ ΔΛ (δ,λ) be the fraction of nodes associated with disease type δ (Δ is the set of diseases classes) in network cluster λ (Λ is the set of network clusters). In a network where the genes of the same type of diseases are grouped in the same network clusters ϕ ΔΛ will be either zero or relatively large, i.e. deviate much from its expected value, ϕ Δ (δ) ϕ Λ (λ). From this observation we get the following overlap measure [13] that increases if diseases of the same type are increasingly often located to the same network clusters. In an infinite system, in random systems without the overlap we are interested in, ν will be zero. In finite systems however, due to the absolute values, fluctuations will make the expectation value of ν, even for systems with no correlation between network clusters and disease types, positive. To get around this problem we rather measure the z-score (deviation from mean divided by the standard deviation) with respect to a randomized reference model with the only constraint that the same disease type cannot be assigned the same node twice, and every node should belong to one and only one network cluster. For our data set of human genes and their annotated diseases we measure the z-score to 3.9 ± 0.1.