HetIG-PreDiG: A Heterogeneous Integrated Graph Model for Predicting Human Disease Genes based on gene expression

Kathleen M. Jagodnik; Yael Shvili; Alon Bartal

doi:10.1371/journal.pone.0280839

Abstract

Graph analytical approaches permit identifying novel genes involved in complex diseases, but are limited by (i) inferring structural network similarity of connected gene nodes, ignoring potentially relevant unconnected nodes; (ii) using homogeneous graphs, missing gene-disease associations’ complexity; (iii) relying on disease/gene-phenotype associations’ similarities, involving highly incomplete data; (iv) using binary classification, with gene-disease edges as positive training samples, and non-associated gene and disease nodes as negative samples that may include currently unknown disease genes; or (v) reporting predicted novel associations without systematically evaluating their accuracy. Addressing these limitations, we develop the Heterogeneous Integrated Graph for Predicting Disease Genes (HetIG-PreDiG) model that includes gene-gene, gene-disease, and gene-tissue associations. We predict novel disease genes using low-dimensional representation of nodes accounting for network structure, and extending beyond network structure using the developed Gene-Disease Prioritization Score (GDPS) reflecting the degree of gene-disease association via gene co-expression data. For negative training samples, we select non-associated gene and disease nodes with lower GDPS that are less likely to be affiliated. We evaluate the developed model’s success in predicting novel disease genes by analyzing the prediction probabilities of gene-disease associations. HetIG-PreDiG successfully predicts (Micro-F1 = 0.95) gene-disease associations, outperforming baseline models, and is validated using published literature, thus advancing our understanding of complex genetic diseases.

Citation: Jagodnik KM, Shvili Y, Bartal A (2023) HetIG-PreDiG: A Heterogeneous Integrated Graph Model for Predicting Human Disease Genes based on gene expression. PLoS ONE 18(2): e0280839. https://doi.org/10.1371/journal.pone.0280839

Editor: Attila Gursoy, Koc Universitesi, TURKEY

Received: June 12, 2022; Accepted: January 10, 2023; Published: February 15, 2023

Copyright: © 2023 Jagodnik et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Source code and preprocessed datasets are available at: https://github.com/bartala/disease_gene.

Funding: K.M.J. was supported by a Mortimer B. Zuckerman STEM Leadership Program post-doctoral fellowship in the School of Business Administration at Bar-Ilan University and in the Departments of Psychiatry at Harvard Medical School and Massachusetts General Hospital. We thank Bar-Ilan University’s Data Science Institute (DSI) for partially supporting this research.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Understanding the complex biological phenomena involved in human diseases is essential for developing new preventive and therapeutic strategies [1]. Since authoritative sets of genetic associations for many diseases are unknown [1], and the experimentation necessary to validate these associations is costly and time consuming, researchers have developed computational methods, including machine learning (ML) models to discover gene-disease associations [2]. Analyzing biological data using graphs can identify complex interactions among entities (e.g., genes and diseases), and it facilitates the detection of variations (e.g., genetic mutations) via structural changes in the graph [3–5].

Some disease gene prediction models [6] assume that genes associated with biologically similar diseases have similar graph structures. Those models miss potentially relevant nodes beyond the local neighborhood of a node. Network diffusion models extend beyond the local neighborhood of known disease genes by walking over the edges of a biological graph [7–11]. For example, the MEXCOwalk algorithm [11] performs an edge-weighted random walk on a graph to identify cancer gene modules, and the HotNet2 algorithm [8, 9] employs a directed network diffusion model to assess the significance of mutations in genes, and the local topology of interactions among encoded proteins, to identify mutated subnetworks in a genome-scale interaction network. However, network diffusion models that apply walks over the graph miss potentially relevant nodes that are unconnected (e.g., isolated components or nodes) due to, e.g., missing or unknown data. For example, Del Sol et al. [12] reported that complete miRNA networks accurately represent healthy tissues, whereas cancer tissues are characterized by disjointed, disconnected sub-networks. Additionally, the lack of data about gene–gene associations is often the source of the “missing heritability” problem in which known interactions can explain only a small portion of a disease [13]. Graph-based models that assume that genes with high phenotypic similarity associate with the same disease [14] also rely on highly incomplete data that can lead to poor model performance. In addition, many models for predicting gene-disease associations construct a homogeneous graph based on a single type of data such as protein-protein interactions (PPI) [15, 16]. However, using a single type of data ignores the complexity inherent in gene-disease associations [17, 18]. For example, combining PPI with tissue-specific data is important for predicting disease genes [5].

In recent years, researchers have represented nodes as numeric vectors (embeddings) in a low-dimensional space while preserving node and graph topological similarity using neural networks [19]. These embedding vectors allow ML methods to predict disease genes in graphs [20], among other tasks. Automatic feature learning from graphs [19, 21–24] has been widely studied. Graph embeddings were successfully used in capturing the biological structures of proteins [25]; reducing data noise in graphs [26] by using tasks such as node classification, link prediction, and clustering [21]; and detecting drug-drug side-effects [27].

Node embedding models for identifying disease genes are limited by relying on gene-phenotype associations, which are highly incomplete in humans and other organisms [28]. In addition, data about gene-disease associations is often limited since these complex relationships are rare, and are usually not observed in small clinical trials, preventing ML models from learning these associations [18]. Moreover, ML models for predicting disease genes are typically approached using a binary classification of gene and disease association, by selecting a sample of (i) gene-disease edges as positive training samples, and (ii) non-associated gene and disease nodes (representing non-existing edges) as negative examples that might contain unknown disease genes. Training an ML model on those negative samples may result in poor model performance.

Another limitation of most gene-disease prediction models involves assuming that accurate performance on the test set leads to accurate predictions of novel gene-disease associations. However, predicting novel disease genes requires considering all potential associations between candidate genes and a disease, beyond selected gene-disease associations in the test set. The accuracy of past models when considering all candidate genes is typically supported by manually examined literature without systematic evaluation. Consequently, given a specific disease, we do not know whether the predictions of novel genes by past models are accurate enough to be validated in wet-lab experiments, even though those models perform well on a test set.

Lack of studies that consider in a single model knowledge from both local network neighbors and non-neighbors; incorporate rich knowledge from several biological domains; and address the problem of sampling negative edges for link prediction raises the need for developing new models that will enable better prediction of disease genes. We address those limitations by developing a model to improve the prediction of disease genes: we propose a Heterogeneous Integrated Graph Model for Predicting Disease Genes (HetIG-PreDiG) with gene prioritization based on gene expression. HetIG-PreDiG (pronounced “HET-ih-jee PRED-ih-jee”) detects human gene-disease associations by integrating data about gene expression in different tissues, and gene-gene, gene-disease, and gene-tissue associations into a heterogeneous graph. Using the node2vec algorithm [21], HetIG-PreDiG accounts for graph structure by learning low-dimensional representation embeddings of nodes. To extend beyond graph structure, a Gene-Disease Prioritization Score (GDPS) is developed. This GDPS reflects the association degree of a gene with a disease based on co-expression similarity across multiple tissues. Node embeddings and the GDPS are input as features to a classifier that predicts gene-disease edges. To train the classifier, we randomly select gene-disease edges as positive samples, and non-associating gene and disease nodes with lower GDPS as negative training samples, thus lowering the risk of including biologically existent yet to-date unreported disease genes. The results show that a model that considers both network structure and GDPS outperforms other baseline models. Finally, we provide a method to systematically evaluate the developed model’s success in predicting novel disease genes by classifying a disease based on its candidate genes’ prediction probabilities into three success level groups.

We make five novel contributions to improve the identification of human gene-disease associations: 1) developing our Gene-Disease Prioritization Score (GDPS) based on data of gene co-expression similarity. Whereas most models for predicting disease genes use gene expression to identify genes having expression most strongly associated with a disease, our model uses it together with node embeddings for learning the degree of association between any gene and a disease; 2) considering network structure using graph representation learning, and extending beyond network structure by accounting for biological similarity between unconnected gene and disease nodes using GDPS; 3) offering a solution to the problem of randomly selecting as negative training samples non-associated gene and disease nodes (reflecting non-existing edges) that might actually represent yet-unknown biological associations, by our method of favoring non-associating gene and disease pairs having lower GDPS values; 4) capturing rich biological knowledge in a single heterogeneous graph-based model, the utility of which for predicting gene-disease associations is validated by producing better results compared with baseline models, and via literature analysis; and 5) providing a new method to systematically evaluate the developed model’s success level in predicting novel genes for a given disease, as the accuracy of past models that consider all candidate genes is typically supported via literature analysis, but these results are often not systematically evaluated as in the current work.

1.1 Organization

Section 2 provides a detailed overview of existing methods for predicting disease genes, and the shortcomings of those methods. Section 3 describes the HetIG-PreDiG model for predicting human disease genes, and the datasets used for learning and predicting gene-disease associations. Section 4 details the analyses performed with the developed model. Section 5 discusses the results of comparing the proposed model with baseline models, empirically evaluating the proposed model, and demonstrating its usefulness via supporting literature analysis. Section 6 discusses the strengths and limitations of this work, and interprets aspects of our results. Finally, Section 7 summarizes the contributions of this study and describes potential additional applications of our model.

2 Related work

2.1 Biomedical data

Biomedical data is often high-dimensional, incomplete, and biased due to e.g., physical measurement limitations and technological constraints [14, 26, 29]. To better understand complex biomedical phenomena such as diseases, an effective model must incorporate diverse biomedical datasets from different domains [27]. ARCHS4 [30] is a web resource that provides co-expression similarity matrices of human and mouse genes, based on RNA-seq data processed from Gene Expression Omnibus (GEO) [31]. This data can be used to detect biological functions such as gene-disease associations [30]. For example, Lachmann et al. [30] found that genes with highly correlated expression tend to share biological functions. Moreover, the authors were able to predict gene function using the extensive expression data available from ARCHS4. Other biological data sources with relevance for understanding human diseases and their treatment include the DisGeNET database [32], which provides data on Mendelian, complex, and environmental human diseases. Additionally, the Human Protein Atlas [33] serves as a map of the human proteome, providing tissue-specific gene expression data that can be used to elucidate the mechanisms of disease [34].

Building models using incomplete data can cause them to perform poorly given new data. Thus, predicting gene-disease associations for a genetic disease requires analyzing genes that are associated with the disease and genes that are likely to associate with the disease, as well as their interactions in diverse biological functions [35].

2.2 Predicting gene-disease associations

The genome-wide association study (GWAS) [36] is a widely used approach that analyzes single nucleotide polymorphisms (genetic variations) among humans for predicting new disease genes. However, predicting gene-disease associations via laboratory experiments and statistical analyses is time consuming, and often results in a large number of candidate genes with multiple false positives [37]. Moreover, GWAS mainly focuses on gene-phenotype associations, excluding the functions of biological molecules that act via complex pathways [37]. To address this gap, researchers have developed computational approaches such as networks for predicting gene-disease associations. Analyzing biological data using graphs can identify complex interactions among entities (e.g., genes and diseases), and it facilitates the detection of variations (e.g., genetic mutations) via structural changes in the graph [3–5].

Different types of graphs have been exploited for predicting disease genes [17], including homogeneous [3, 16, 38], heterogeneous [39], and multiplex graphs [40]. A homogeneous graph includes nodes and edges each of a single type, a heterogeneous graph has different types of nodes and edges, and a multiplex graph is a collection of graphs with the same set of nodes and different types of edges.

Two network-based approaches are commonly used for predicting disease genes. Node classification learns features of known disease genes to predict the disease labels of genes of novel disease associations. Link prediction learns known gene-disease associations to predict novel gene-disease links. These two network-based approaches for predicting disease genes can be implemented via three categories: 1) network diffusion, 2) supervised ML methods in which features for diseases and genes are first extracted and then input to ML models such as Support Vector Machines (SVMs) for predicting gene-disease associations, and 3) graph representation learning.

The next subsections describe each of these methods.

2.3 Network diffusion methods

Most network-based methods assume that genes associated with biologically similar diseases have similar network structures [6, 41]. Some network methods for predicting disease genes consider only the local neighborhood of a node [3, 10, 38, 42], thus missing biological information at greater distances on the network. This limitation is partially resolved by network diffusion models that start from known disease genes and diffuse to other nodes via walks over the edges of the biological network. For example, the Random Walk with Restart (RWR) algorithm [43] performs a random walk on a graph with a restart probability r to return to any seed node at each iteration. It explores the neighborhood of seed nodes to study their functions, under the assumption that nodes related to similar functions are closer in the network. Adopting RWR, PRINCE [44] expands RWR to a weighted PPI network, and VAVIEN [7] prioritizes candidate disease genes based on the topological similarity of proteins that is calculated using RWR to perform random walks on a PPI network. RWR was widely used in PPI networks to detect novel disease genes. For example, ORIENT [45] uses RWR to detect novel disease genes in a weighted PPI network such that genes closer to known disease genes receive a higher prioritization score. The DP-LCC model [46] also detects novel disease genes using RWR on a PPI network and a phenotype similarity network.

Heterogeneous networks of gene-disease associations, disease-disease similarities, and protein–protein interactions have also been employed for predicting disease genes. For example [39], detects disease genes in heterogeneous networks using diffusion and node classification. Other examples include the RWRH model [47] that extends RWR on a heterogeneous phenotype-gene network. RWPCN [48] predicts disease genes on a heterogeneous network of phenotypes, genes, and proteins. CIPHER [49] predicts unknown disease genes in a heterogeneous network using phenotype similarity and gene proximity. The CATAPULT model [50] predicts gene-phenotype associations by vectors generated using walks on the heterogeneous network of gene-gene and gene-phenotype associations. BiRW [51] performs RWR on a heterogeneous network of phenotypes and genes. RWRMH [40] conducts RWR on a multiplex heterogeneous network of PPIs and disease associations based on phenotype similarities. Zeng et al. [52] proposed a latent factor method with heterogeneous similarity regularization to predict unknown gene-disease associations.

While network diffusion models utilize network structure to capture biological information beyond local neighborhood, they miss global information beyond network structure, involving unconnected nodes. Networks represent existing datasets, reflecting only known (often incomplete, noisy, and biased) data collected so far [4]. Hence, the information of, e.g., two unconnected proteins in a PPI network that might share the same biological pathway [15] is ignored in diffusion models. On the other hand, gene expression data can be used to calculate similarities among nodes representing biological entities [30] and is not limited by node connectivity. However, it ignores the structure of the network.

Some studies utilized gene expression data to detect disease genes, such as Hu and Agarwal [53], who begin by identifying the top genes having expression most strongly associated with each disease. Then, they perform enrichment analyses to find significant overlaps between these top genes and diseases. Another example is the DiseaseConnect web server [54] that utilizes gene expression profiles, gene-disease associations, and GWAS data to detect novel gene-disease associations.

Recently, biological network-based models have represented nodes using feature vectors of structural network properties, such as average distance to disease genes, and structural similarity with disease genes [17, 55]. These vectors are used for training supervised ML models such as Logistic Regression (LR) and SVM to classify genes as associated with a disease or not [50, 56].

2.4 Supervised machine learning methods

Current ML methods that are applied to biological graphs typically represent genes and proteins using feature vectors of structural network properties (e.g., degree) [55]. ML models that use PPI networks to predict unknown disease genes include, e.g., the gene ranking model described in [57] that prioritizes candidate genes using network analysis of their differential expression. Relying on network structure, it assumes that candidate disease genes are neighbors of highly differentially expressed genes. More recently, BRIDGE [58] was developed to prioritize disease genes by applying Lasso Regression to a variety of biological resources, including PPI, protein sequence, gene expression, pathway, and gene ontology data. Similarly, the IMRF [59] algorithm also utilizes diverse biological data such as PPI networks to rank disease genes by improving the Markov Random Field method. Focusing on gene-disease association data, the Know-GENE algorithm [60] prioritizes candidate genes associated with a disease by calculating gene-gene similarity using gene co-occurrence. The authors recommend [60] considering gene expression data in future work to detect genes without known disease associations, as done in the current study. Representing diverse biological data as heterogeneous graphs, Metagraph+ [61] predicts disease genes by analyzing a heterogeneous graph of PPI and gene keywords. Using gene ontology similarities, the dgMDL algorithm [62] predicts gene-disease associations in a heterogeneous PPI and gene graph. The Disjunctive Graph Integration model [63] predicts novel disease genes by applying SVM to features of a heterogeneous graph of gene co-expression, pathways, functional links, phenotype similarity, and PPI.

Other data sources than PPI networks, such as disease-phenotype associations, gene ontology annotations, and tissue-specific networks, have been utilized to predict disease genes [10, 17, 42]. The use of tissue-specific gene expression data is critical, because diseases are typically associated with a specific tissue [64]. For example, NetWAS [65] analyzes a network of genes and tissue expression data to identify disease associations.

Some of the reviewed ML models in the current section require the handcrafted generation of graph features (e.g., distance between a gene and a disease) for training a model to classify genes as likely to be associated with a disease or not. Handcrafted feature generation is time consuming and requires domain knowledge. In contrast, graph representation learning methods [19, 21, 22] automatically learn graph features, as discussed next.

2.5 Graph representation learning

Automatic feature learning from graphs [17, 19, 21–24] has been widely studied using methods such as matrix factorization and graph embeddings.

Matrix factorization methods are used for predicting previously unknown gene-disease edges. For example, the PCFM algorithm [24] uncovers hidden factors for genes and diseases from a gene-disease association matrix using a probability-based collaborative filtering model to predict disease genes. Manifold learning [66] utilizes a gene-disease association matrix to learn latent factors of genes and diseases, following the assumption that disease genes are closely located on the graph. Medusa [67] analyzes 16 heterogeneous graphs as matrices to establish connections between non-neighboring nodes in each graph. GeneHound [68] first integrates data including literature-based phenotype and gene information. Then, it performs Bayesian matrix factorization to uncover latent factors for genes and diseases to predict new gene-disease associations.

Graph embedding methods represent nodes as numerical vectors in a low-dimensional space while preserving node and graph topological similarity using neural networks [19]. The goal of graph embedding methods is to capture the topological information of nodes and edges. Graph embeddings were successfully used in capturing the biological structures of proteins [25] and reducing data noise in graphs [26] by using tasks such as node classification, link prediction, and clustering [21].

Examples of node embedding algorithms include the SkipGram algorithm [69] that constructs associations between a node and its neighbors via random walks. DeepWalk [23] expands SkipGram to perform random walks on a graph by treating nodes as words. It was used to learn node embeddings in biological graphs for tasks such as predicting drug-target associations [70] and protein function [71]. SmuDGE [28] expands SkipGram to predict novel disease genes by combining disease–phenotype and gene–phenotype associations to generate a corpus for SkipGram-based representation learning. Then, it predicts gene–disease associations using a neural network. Building upon SkipGram, HeteWalk [20] constructs a weighted heterogeneous network by joining six public data sources including PPI, miRNA similarity network, and disease phenotype similarity network, and then performs SkipGram-based network embedding. The HIN2Vec algorithm [72] generates node embeddings for heterogeneous networks based on random walks using a three-layer neural network model, but it samples only short paths, making it inefficient for large graphs [73].

The node2vec algorithm [21] finds an embedding function such that the conditional probability of observing the neighbors of a node is maximized. It extends DeepWalk, but employs more sophisticated random walks using four parameters to select the next visited nodes: 1) number of random walks from each node, 2) walk length, 3) P—the probability to return to a previously visited node, and 4) Q—the probability to explore undiscovered nodes. The node2vec algorithm is widely used for generating node embeddings, and it presents superior performance in node classification tasks on biological networks [74]. Leveraging node2vec, several biological studies combine node2vec embeddings with other features. For example, the N2VKO algorithm [75] integrates node2vec embeddings extracted from a PPI network with biological annotations for gene-disease association prediction.

In network-based models, the structure of the network must accurately represent biological knowledge such as gene-disease associations; otherwise, feature learning will be harmed. ML models for predicting unknown disease genes might be biased because of missing gene-disease edges in the graph, due to, e.g., data that has not yet been collected (unknown). Missing edges also cause those models to ignore global information from unconnected nodes since most network-based gene-disease prediction models are limited to local node-to-node propagation. In addition, to train an ML classifier for the task of link prediction, positive (existing gene-disease edges) and negative (non-existing gene-disease edges) examples are needed. Whereas sampling positive edges from a graph is straightforward, sampling negative edges involves sampling a pair of an unconnected gene and disease. Such pairs might be biologically associated but not yet known, thus falsely used as a negative example, leading to poor model performance.

Lack of studies that consider in a single model knowledge from both local network neighbors and non-neighbors; incorporate rich knowledge from several biological domains; and address the problem of sampling negative edges for link prediction raises the need for developing new models that will enable better prediction of disease genes. We address those limitations by developing the Heterogeneous Integrated Graph Model for Predicting Disease Genes in humans (HetIG-PreDiG) model to improve the prediction of disease genes, as detailed next.

3 Materials and Methods

3.1 HetIG-PreDiG model to predict gene-disease associations

The following five steps describe the construction of the developed Heterogeneous Integrated Graph Model for Predicting Disease Genes in humans (HetIG-PreDiG) model for predicting gene-disease associations.

Step 1—Construct a heterogeneous graph using two types of data: (i) gene-disease associations, and (ii) gene-tissue interactions.

Using the first data type, gene-disease associations are represented as a graph, denoted by G_gd = (D_gd, V_gd, E_gd). Nodes represent the sets of diseases D_gd and genes V_gd. Edges E_gd represent gene-disease associations.

Using the second data type, gene-tissue interactions are transformed into a gene-gene graph G_gg = (V_gg, E_gg). Nodes V_gg represent genes that are connected by an edge e_ij ∈ E_gg if genes v_i, v_j ∈ V_gg were reported in the same tissue.

Finally, both graphs are integrated into an undirected gene-disease heterogeneous graph G(D, V, E) = G_gd ∪ G_gg. In G, nodes represent the sets of diseases D and genes V. Edges E represent gene-disease associations and gene-gene associations. To integrate both graphs into a heterogeneous graph, gene-gene edges e_ij ∈ E_gg were excluded if both gene v_i and gene v_j are not in G_gd. More formally, we define: G(D, V, E) = G_gd ∪ G_gg = {(D_gd, V_gd ∪ V_gg, E_gd ∪ E_gg)∣∀e_ij = (v_i, v_j) ∈ E_gg, ∃v_i, v_j ∈ V_gd}. This allows us to include additional knowledge where: (i) at least one gene in V_gg is in V_gd and the second gene of the edge is not in V_gd, or (ii) both genes in V_gg are in V_gd.

Step 2—Generate a labeled set. We aim to predict missing links between unconnected gene and disease nodes. Given a network with missing links (due to yet-undiscovered knowledge), we aim to predict these missing links.

Data imbalance is a known challenge when designing machine learning models, as in the case of predicting disease genes. The abundant (majority) class contains more data than the minority class. In a gene-disease network, the number of non-associated gene and disease nodes (expressed as missing edges in the majority class) far exceeds that of disease-associated genes in the minority class. The imbalanced data presents a challenge for identifying gene-disease associations. Most traditional machine learning methods are usually biased towards the majority class, and hence lead to loss of predictive performance for the minority class. Sampling methods for dealing with imbalanced datasets are frequently used [76, 77]. We applied a sampling method of positive and negative examples that is similar to other studies that used the node2vec algorithm for representing biological entities in a graph [21, 78]. The described sampling method addresses the imbalanced learning problem by sampling an equal number of negative samples and positive samples for training [78], thus ensuring that the model is not biased toward any class.

We generate the labeled dataset of edges by following three sub-steps. First, obtain positive examples by randomly selecting 20% of gene-disease edges for each disease node in G, and removing them from G, thus generating G′ = (D′, V′, E′). This sub-step results in N positive examples of gene-disease edges. Second, obtain negative examples by randomly sampling an equal number (N) of unconnected node pairs composed of N/2 gene pairs, and N/2 gene and disease pairs. For each disease, we select non-associated genes with the lowest GDPS score (described in detail in Step 4). This process is iteratively repeated until the desired number (N/2) of non-existing gene-disease associations is achieved. Finally, the labeled dataset is split into Train (70%) and Test (30%) sets.

Step 3—Learn node embedding vectors. Previous studies (e.g., [21, 78]) found that deep learning techniques that use embedding vectors obtain better representation of biomedical entities (such as genes and diseases), and thus, improve prediction performance. We map nodes in G′ to a low-dimensional feature space using node2vec. The vectors of each pair of nodes u, v in the Train and Test sets are aggregated into a single vector (u + v). We chose the node2vec algorithm since it was reported to obtain richer topological representation of the network than traditional methods [78].

Step 4—Compute a Gene-Disease Prioritization Score (GDPS) using a third type of data, gene-gene co-expression similarity. Compute GDPS using the developed Algorithm 1 (Fig 1) for each gene and disease pair, following [30]. GDPS uses a gene-gene co-expression similarity matrix to compute the average similarity of a gene to known disease genes as expressed by the structure of G′. The higher the GDPS value, the more likely a gene has similar functions to the disease genes, and the more likely that gene is to associate with the disease. We set GDPS for gene-gene edges to 0. When calculating the GDPS score, we use the graph G′, which is the original graph G with test edges removed. This means that only gene-disease associations that are observed in G′ are used to calculate GDPS. Fig 2 presents an example of Algorithm 1 (Fig 1).

Download:

Fig 1. The developed algorithm to calculate Gene-Disease Prioritization Score (GDPS).

https://doi.org/10.1371/journal.pone.0280839.g001

Download:

Fig 2. An illustrated example of Algorithm 1 (Fig 1).

Z is a gene-gene co-expression similarity matrix. Line 1 in Algorithm 1: given disease d that is known to associate with genes g₂ and g₃, columns 2 and 3 are selected in Z to create matrix Z′. Lines 4 to 6 in Algorithm 1: each row in Z′ is averaged into a Gene-Disease Prioritization Score. The prioritization score of g₁ for d is (0.1+0.4)/2 = 0.25, reflecting the average similarity of a non-associated gene g₁ to genes associated with d.

https://doi.org/10.1371/journal.pone.0280839.g002

Step 5—Train ML classifier. Concatenate the aggregated vector u + v (Step 3) with the Gene-Disease Prioritization Score (Step 4) into a single feature vector for training and testing an ML model to classify the pairs of nodes {u, v} in the Train and Test sets into one of two groups: a link will/not form. Concatenating the GDPS score to the embedding vector allows the model to learn non-linear relationships between a gene and a disease that take gene expression information into account, which is innovative [78]. To evaluate the contribution of GDPS, we compare the performance of the developed HetIG-PreDiG model with and without GDPS using 10-fold cross-validation. The best model is trained using all training examples, and then its performance is evaluated on the Test set.

3.2 Datasets used in model

Here we describe the datasets used for predicting gene-disease associations (Table 1).

Download:

Table 1. Source datasets.

https://doi.org/10.1371/journal.pone.0280839.t001

Dataset 1 (DS1) consists of DisGeNET V7.0 [32]. DS1 contains one of the largest publicly available collections of genes associated with human diseases curated from expert repositories, GWAS catalogs, and scientific literature. DS1 contains data including DiseaseSemanticType—the semantic type of the disease (e.g., ‘Anatomical Abnormality’, ‘Pathologic Function’, or ‘Disease or Syndrome’); Gene symbol; Disease id; Disease name; DiseaseType (‘disease’, ‘phenotype’, or ‘group’); Disease Specificity Index (DSI)—ranges from 0.25 to 1 and reflects if a gene is associated with few diseases; a gene that associates with multiple diseases has a lower DSI. Disease Pleiotropy Index (DPI) ranges from 0 to 1 and reflects if multiple diseases that associate with a gene are similar in terms of belonging to the same Medical Subject Headings (MeSH) disease class. A gene that associates with diseases of different MeSH classes has a high DPI index; and YearInitial ∈ [1940, 2020]—the year that the gene-disease association was first reported.

Dataset 2 (DS2) consists of the Human Protein Atlas version 20.1 and Ensembl version 92.38 with information on gene-tissue interactions. DS2 is a representative tissue-specific gene expression resource with a large and comprehensive distribution of protein-coding genes in human tissues and cells [33]. DS2 contains expression profiles for proteins in human tissues with Ensembl gene id, Tissue name, Expression level (‘High’, ‘Medium’, ‘Low’, and ‘Not Detected’), and the gene Reliability (‘Approved’, ‘Enhanced’, ‘Supported’, and ‘Uncertain’) of the expression value.

Dataset 3 (DS3), the ARCHS4 database [30], covers the majority of published RNA-seq data. It contains gene counts for humans and mice from the Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) platforms. In this study, only human gene expression data is considered. Specifically, this study uses the available pairwise Pearson correlation data of human genes across expression samples to create a gene-gene co-expression similarity matrix.

4 Analysis

This section begins by describing the implementation of the developed HetIG-PreDiG model, following the five Method steps in Section 3.1.

In Step 1, a gene-disease graph G_gd based on DS1 is constructed. For DiseaseType, only ‘disease’ was selected, and for DiseaseSemanticType, only ‘Disease or Syndrome’ was selected. We removed diseases with fewer than two genes, resulting in 5,417 diseases, 13,011 genes, and 179,860 gene-disease associations.

Next, we constructed a gene-gene graph based on gene-tissue data (DS2). To consider highly validated information, only gene-tissue associations that have an ‘Approved’ Reliability with ‘High’ Expression level, and genes associating with a tissue in more than a single cell type were included. This resulted in 25 tissues, 1,661 genes, and 5,191 gene-tissue interactions. The distribution of the number of genes per tissue presents an exponential decay shape with an average of 206.64 genes per tissue, a median of 195, a maximum of 463 genes associated with the tonsil tissue, and a minimum of 2 genes associated with retina. Next, we converted the selected data into a gene-gene graph G_gg such that genes found in the same tissue are connected by an edge. Finally, both graphs (G_gd and G_gg) are combined into a heterogeneous graph G with 5,417 diseases, 13,637 genes, 179,860 gene-disease associations, and 444,094 gene-gene associations.

In Step 2, given a graph G, we labeled a set of positive and negative examples as described in Section 3.1. This step results in the creation of graph G′ by deleting positive examples (edges) from G. Next, the examples are shuffled, and split into Train (70%) and Test (30%) sets.

In Step 3, node embeddings are learned by applying node2vec to G′ using the following parameters: embedding dimensions = 64, walk length = 5, number of random walks = 10, p = 1, and q = 1. For p = 1, the algorithm is less likely to revisit a node, resulting in moderate exploration and avoiding 2-hop redundancy. Setting q = 1, the algorithm is not biased towards visiting closer or farther nodes to the current node.

In Step 4, GDPS is computed using DS3 for each pair of gene and disease in the Train and Test sets.

To summarize, DS1 and DS2 were used to create nodes and edges in the analyzed network. DS3 was used to create the GDPS score. Specifically, we used gene expression level in DS2 only for selecting data for further analysis. DS3 data of gene counts (expression) is used for generating GDPS and is not used for graph structure.

Finally, in Step 5, we trained a logistic regression classifier using the Train set to classify pairs of nodes into one of two classes: 1) link formation (i.e., a gene is associated with the disease), or 2) no link formation.

4.1 Comparison with baseline models

To evaluate the performance of the developed HetIG-PreDiG model (Section 3.1), the Train and Test sets were used to train and evaluate the following algorithms for predicting gene-disease associations as baseline models: RWRH, N2VKO, and HIN2Vec. These models were selected since they are frequently used and were reported to produce good performance on biological datasets [21, 47, 75]. The evaluated models were assessed using the implementations described by their authors, with their suggested parameters. In addition, we performed an ablation study by comparing the performance of HetIG-PreDiG with and without GDPS using 10-fold cross-validation.

4.2 Model evaluation: Predicting novel disease genes

Instead of predicting specific gene-disease edges in the Test set (as in Section 3.1, Step 5), we adopt a more realistic approach in which, for a given disease, all candidate (non-associating) genes are to be examined. Stated differently, given a disease and a set of candidate genes, we aim to predict gene-disease associations. We replicate a scenario at a point in time when cumulative knowledge exists and other knowledge is missing (e.g., has yet to be discovered). Cumulative knowledge is represented by G′ = (D′, V′, E′), reflecting gene-disease associations in the Train set. We define missing knowledge as gene-disease associations discovered in the year 2020 (DS₂₀₂₀) (the most recent gene-disease discovery year in DS3) that appear in the Test set. We predict those associations by considering all candidate genes in G′ that are not associated with a disease d ∈ D′ ∩ DS₂₀₂₀. More formally, we seek to predict the following gene-disease associations {(g, d)|g ∈ V′ ∧ d ∈ D′ ∩ DS₂₀₂₀ ∧ (g, d) ∉ E′}.

We assume that gene-disease associations in DS₂₀₂₀ are unknown during the prediction process only. When the prediction process is complete, we evaluate our model for predicting novel disease genes (candidate genes predicted to associate with d), using the developed Overlap measure (1): (1)

P_f%—The set of candidate genes predicted to associate with d ∈ DS₂₀₂₀, located in the top f_% of a ranking, based on the gene’s prediction probability.

I_d—The set of genes associated with d ∈ DS₂₀₂₀.

The higher the Overlap score, the more successful the developed model at identifying genes that associate with a disease. When a disease d ∈ DS₂₀₂₀ has a high Overlap score, the genes predicted to associate with d can be further validated in wet-lab experiments. An Overlap of 1 indicates that all genes that are known to associate with d were identified, and an Overlap of 0 indicates that no genes associating with d were identified.

Increasing f_% will result in the inclusion of more genes (e.g., f_% = 1 considers all candidate genes) leading to a higher Overlap score, but also requires experimenting with more candidate genes in a wet lab. To optimize the set of candidate genes, we define the Ratio (2) between the number of diseases with Overlap = 1 and f_%. We aim to maximize the number of diseases with Overlap = 1 and minimize f_%, i.e., find the highest Ratio. (2)

We construct the set DS₂₀₂₀ as follows: first, we select all gene-disease associations in DS1 with YearInitial = 2020; diseaseType = ‘disease’; and diseaseSemanticType = ‘Disease or Syndrome’. Second, we remove diseases with only a single gene association discovered in the year 2020. Third, we keep only gene-disease edges that are not in G′. Fourth, we discard disease associations with highly connected genes (which are more trivial to predict) by selecting genes in DS1 with a DSI above the average DSI of all genes, and with DPI below the average DPI of all genes.

Next, we follow five steps to predict novel gene-disease associations. First, given G′ = (D′, V′, E′) and a disease d ∈ D′ ∩ DS₂₀₂₀, generate embedding vectors of each candidate gene g ∈ V′ not associated with d (i.e., (g, d) ∉ E′), and embeddings of d. Second, sum embedding vectors of each candidate gene and d. Third, for each candidate gene, compute its GDPS score with d and concatenate it to the aggregated embedding vector in the Second step. Fourth, apply HetIG-PreDiG for each candidate gene-disease pair to predict via classification whether a link will form between them (the ‘link formation’ class), or not. Fifth, keep only novel genes that are predicted to associate with d (i.e., classified into the ‘link formation’ class) with their probability of affiliating with that class (prediction probability is used later in Section 5.3). Fig 3 summarizes the steps of model evaluation.

Download:

Fig 3. An illustration of the developed model using an example: Predict if gene g₁ is associated with disease d₁.

(A) Steps 1,2 (Section 3.1): Heterogeneous graph (G′) with genes and diseases after removal of selected Train/Test gene-disease edges. (B) Step 3: Learn node embeddings. The embeddings of d₁, g₁ are aggregated into a single vector denoted by Σ. (C) Steps 4, 5: Compute Gene-Disease Prioritization Scores. Compute GDPS(g₁, d₁) in the illustrated example. Gene expression data involving gene function is integrated into the model via the GDPS prioritization score that provides expression similarity across multiple tissues. Next, the GDPS prioritization score is concatenated with the vector Σ and fed into a classifier that is trained to predict associations between pairs of genes and diseases. In the illustrated example, GDPS(g₁, d₁) (colored in orange) is concatenated to vector Σ. Finally, in the current example, the developed model predicts a new edge between g₁ and d₁. (D) Assess success in predicting novel genes: cluster a disease based on its gene prediction probabilities.

https://doi.org/10.1371/journal.pone.0280839.g003

Next, given a disease, we estimate the success of our ‘HetIG-PreDiG with GDPS’ model’s success in predicting gene-disease associations without the ability to calculate the Overlap score.

4.3 Model evaluation: Real-world scenario

In a real-world scenario, there is no test set. We can predict novel gene-disease associations similar to Section 4.2, but we cannot calculate the Overlap score (I_d is unknown). Hence, we cannot evaluate our predictions’ correctness. Given a single disease, we aim to estimate the prediction performance of our model in identifying novel gene-disease associations. While model performance is typically evaluated based on the prediction success of the entire test set (e.g., F1 score), a single prediction might be wrong (e.g., false positive); thus, we do not know if our single prediction is successful.

To overcome this challenge, we identify via classification, diseases that are more likely to have higher Overlap values. We learn the patterns of the prediction probabilities of genes that were predicted to associate with diseases in Section 4.2. Prediction probabilities have been found informative for biological problems such as estimating the plasma effect-site equilibration rate constant [79] and anesthetic depth [80]. Under the assumption that prediction probabilities contain meaningful information about the Overlap score, we learn their patterns by: (i) calculating the mean prediction probability of gene-disease associations for each disease d ∈ DS₂₀₂₀; (ii) analyzing the mean prediction probabilities for diseases d ∈ DS₂₀₂₀ to automatically detect the optimal number of clusters using the Ckmeans.1d.dp R library [81] that is a variant of K-means for one-dimensional data; and (iii) classifying each d ∈ DS₂₀₂₀ to clusters of different Overlap range using the Ckmeans.1d.dp R library.

To summarize, using the ‘HetIG-PreDiG with GDPS’ model, we first predict the genes associated with a disease. Then, using the developed classifier in Step iii, we can classify the disease based on its prediction probabilities and estimate its Overlap score based on d’s cluster affiliation. For high Overlap (∼ 1), experiments in a wet lab may be warranted.

We make predictions for each d ∈ DS₂₀₂₀ using all candidate genes in the dataset. Then, we determine whether the top-ranked genes predicted by the developed model are novel by conducting an automated literature search to find papers indexed by PubMed that support the predicted gene-disease associations. Using the PubMed API, we collect papers containing the disease’s name and its top-ranked predicted genes together within the title and/or abstract fields. We further validate the reported papers by manually examining the complete set of results for a subset of the diseases.

4.4 Comparison of gene-disease association predictions with the literature

In this analysis, the predictions of HetIG-PreDiG are evaluated by demonstrating its capability to predict novel gene-disease associations that are not present in DS1. Studies are presented that support the highest top-ranked genes predicted to associate with a disease. We focus on the diseases in DS₂₀₂₀. For each of the Top 10 predicted genes for each disease, we searched for supporting literature to determine whether those genes are novel using automated and manual searches.

Using an automated search, we surveyed the literature using PubMed’s API as described in Section 4.3 to find existing associations between each disease and its Top 10 predicted genes. To evaluate the developed model’s success in detecting novel disease genes, we calculated the Success Rate defined by the ratio between the number of genes with supporting literature evidence and the number of top-ranked genes (i.e., 10).

5 Results

5.1 Comparison with baseline models

The prediction performance of the HetIG-PreDiG model outperformed the prediction performances of the baseline models, as listed in Table 2, which reports the average and standard deviation of F1 score across the 10-fold cross-validation.

Download:

Table 2. Comparison of the developed model (HetIG-PreDiG) with the baseline models using 10-fold cross-validation for prediction of the top 30% of predicted disease genes.

The source code of baseline models is listed. The Micro-F1 score column indicates average and standard deviation model performance of predicting gene-disease associations.

https://doi.org/10.1371/journal.pone.0280839.t002

Regarding the ablation study, HetIG-PreDiG with GDPS outperformed HetIG-PreDiG without GDPS (Table 2). When evaluated on the Test set, HetIG-PreDiG with GDPS outperformed (Recall 0.93, Precision 0.97, and Micro-F1 score 0.95) the HetIG-PreDiG model without GDPS (Recall 0.87, Precision 0.88, and Micro-F1 score 0.88). Fig 4 shows model performances evaluated using the Receiver Operating Characteristic (ROC) curve, and the Area Under the ROC Curve (AUC). The addition of the GDPS score improves the prediction performance of model ‘HetIG-PreDiG with GDPS’ compared with model ‘HetIG-PreDiG without GDPS’ (Fig 4). The HetIG-PreDiG model developed in this study considers information not only from the structure of the network using node embeddings but also from biological information not reflected in the network, using gene co-expression similarity.

Download:

Fig 4. An ablation study.

Demonstrating the effectiveness of using the developed Gene-Disease Prioritization Score (GDPS) in the developed HetIG-PreDiG model that incorporates a logistic regression (LR) classifier. (A) Receiver operating characteristic (ROC) curves of the developed model with (blue), and without (red) GDPS. The area under each ROC curve (AUC) is indicated. (B) A zoomed view of the top left corner of Fig 4A.

https://doi.org/10.1371/journal.pone.0280839.g004

The next section demonstrates the capability of the developed HetIG-PreDiG model for predicting novel gene-disease associations.

5.2 Model evaluation: Predicting novel disease genes

To evaluate the model’s success in predicting novel gene-disease associations, we followed the steps described in Secion 4.2. This resulted in 30 diseases in DS₂₀₂₀ (Table 3) associating with 11,667 genes. For each disease d, we computed the Overlap score for different f_% ∈ [10, 100] rankings, with steps of 10. We focus on f_% = 30% since it presented the highest Ratio in Eq. (2) as illustrated in Fig 5. The Overlap score was 1 for 10 diseases; it was in the range (0.5, 1) for 2 diseases; and it was in the range [0, 0.5] for 18 diseases, indicating the identification of all, most, and few associating genes predicted for f_30%, respectively. As expected, we find that the Overlap score increases as the f_% increases. In contrast to f_% = 30%, at f_% = 100%, the model predicts an Overlap = 1 for 22 diseases, Overlap ∈(0.5, 1) for 4 diseases, and an Overlap ∈[0.5, 1] for 4 diseases.

Download:

Fig 5. Ratio between the number of diseases with Overlap = 1 and f_%.

The highest ratio is achieved at f_% = 30%.

https://doi.org/10.1371/journal.pone.0280839.g005

Table 3 provides a summary of the literature evidence for the Top 10 predicted genes not associated with a disease for each of the selected 30 diseases. The Success Rate column is the ratio between the number of predicted genes with PubMed supporting evidence divided by 10. For example, for the disease Anemia, 9 of the Top 10 predicted genes for this disease had PubMed literature support, with a total of 453 PubMed entries supporting these 9 gene-disease predictions.

Download:

Table 3. A summary of literature evidence for the Top 10 predicted genes that are not associating with a disease for each of the selected 30 diseases.

The Success Rate column is the ratio between the number of predicted genes with PubMed supporting evidence divided by 10.

https://doi.org/10.1371/journal.pone.0280839.t003

5.3 Model evaluation: Real-world scenario

Using the prediction probability results of Step 5 in Section 5.2, we learn the patterns of the prediction probabilities of candidate genes predicted to associate with diseases d. Following Step i (Section 4.3), we calculate the mean prediction probability of gene-disease associations for each d ∈ D′ ∩ DS₂₀₂₀. In Step ii, we analyze the mean prediction probabilities for diseases d and automatically detect the optimal number of clusters using the Ckmeans.1d.dp R library [81]. The search for the optimal number of clusters was performed in the range of 2 to 9. Three clusters were identified. Lastly, in Step iii, we assign each d to a cluster based on its mean prediction probability.

Based on the Overlap scores (calculated in Section 5.2) of diseases in each cluster, the clusters were assigned a respective range of Overlap: (a) 0 ≤ Overlap ≤ 0.5, (b) 0.5 < Overlap < 1, and (c) Overlap = 1 to capture (a) few, (b) most, and (c) all associating genes predicted in P_f%, respectively.

Fig 6A presents the prediction probabilities distributions of an Overlap score for the top f_% = 30% in each of the three clusters. Using a Kruskal-Wallis rank (KW) test [82], we find significant differences (P_value < 2.2e⁻¹⁶) in the density for the three classes. To detect where those differences lie, we conducted three additional KW tests between each pair of classes, and again find significant differences. Fig 6B presents a breakdown of the diseases that affiliate with each cluster in Fig 6A.

Download:

Fig 6. Box plots of the top f_% = 30% prediction probabilities of genes predicted to associate with a disease.

(A) three levels of Overlap clusters, and (B) a breakdown showing the diseases affiliating with each cluster in (A).

https://doi.org/10.1371/journal.pone.0280839.g006

To summarize, given a disease, in a real-world scenario where the Overlap score cannot be calculated, we first predict disease genes, and then assign the disease to one of the three Overlap clusters based on the mean gene prediction probabilities. While all models make mistakes, our clustering method enables us to systematically estimate the model’s success in predicting novel disease genes.

While the developed model can successfully predict gene-disease associations, other predicted genes for a given disease d that were not yet validated might also associate with d, as demonstrated next.

5.4 Comparison of gene-disease association predictions with the literature

Table 3 presents a summary of the 30 diseases in DS₂₀₂₀, their corresponding Success Rates, and the number of supporting PubMed papers reported. Using a manual search, we reviewed a subset of the 30 diseases in Table 3 for which ≤100 publications were reported.

Representative of the developed HetIG-PreDiG model’s gene-disease association prediction quality, Table 4 provides reported literature support for the Top 10 predicted genes from the developed model, for three of the diseases listed in Table 3: Hepatitis B, Chagas Disease, and Diabetic Nephropathy. We selected those three diseases since they represent a wide range of organs and disease mechanisms involved; for organs, (i) Hepatitis B involves the liver; (ii) Chagas Disease involves multiple systems including cardiovascular and digestive; and (iii) Diabetic Nephropathy involves the kidney.

Download:

Table 4. Literature review for Top 10 predicted genes for 3 example diseases.

Genes are listed in descending ranked order. For example, for Hepatitis B, CASP1 and BAK1 are ranked 1 (best) and 10, respectively.

https://doi.org/10.1371/journal.pone.0280839.t004

Hepatitis B and Diabetic Nephropathy have literature support for 6 of the Top 10 predicted genes, and Chagas Disease is supported for 4 of its Top 10 predicted genes. All PubMed studies listed in Table 4 were manually vetted to confirm their relevance to associate each disease with its top predicted genes.

The predicted genes in Table 4 without associated literature evidence are especially interesting because these represent the HetIG-PreDig model’s predicted novel gene-disease associations. We manually examined each such predicted gene-disease association, and nearly all of these predictions proved to be reasonable based on the known functions of the genes. For example, for Hepatitis B, predicted gene #2, F5; predicted gene #6, DECR1; predicted gene #7, SLC27A5, and predicted gene #9, DPP4, are all associated with liver function by the GeneCards platform [83].

Chagas Disease, caused by infection by the parasite Trypanosoma cruzi, involves immune response and inflammatory lesions, and can cause heart failure, arrhythmias, and dysfunction of the digestive system [84]. Predicted gene #1, VEGFA, is involved with vasculature, and this disease often involves impairment of cardiac and vascular function [84, 85]. Predicted gene #3, APOE, has an established role in cardiac function [86]. Predicted gene #4, ICAM1, has been reported to have both cardiac [87] and digestive [88] roles. Predicted gene #10, NFKB1, has reported functions in the cardiac [89, 90], vascular [91], immune [92], and digestive [93] systems.

For Diabetic Nephropathy, gene F2 was the top predicted gene; although PubMed does not report this association, the GeneCards platform [83] associates abnormality of the kidney with this gene. Similarly, GeneCards also reports an association of kidney function with other top predicted Diabetic Neuropathy genes having no PubMed support, including HAMP (predicted gene #2) and CHDH (predicted gene #8). Predicted gene #10, MMP14, is a member of the matrix metallopeptidase/metalloproteinase family that is well established to have a role in Diabetic Nephropathy.

6 Discussion

This study presents a model for predicting human gene-disease associations using automatic feature learning in a heterogeneous graph with Gene-Disease Prioritization Scores, based on gene co-expression data. The developed HetIG-PreDiG model outperforms baseline models, showing that a model that incorporates this study’s novel Gene-Disease Prioritization Score (GDPS) achieves better prediction performance than models without GDPS. Biological data contained within the GDPS, based on gene co-expression data, allow the model to better capture the association between a gene and a disease.

Although the performance of the HetIG-PreDiG model for predicting gene-disease associations is promising, we note the following limitations: Theoretically, it is possible that two or more genes might have identical GDPS scores, which could result in no additional knowledge being added to the model. In addition, when manually analyzing the literature, we found that, with some exceptions, the reported PubMed publications were generally relevant to support the predicted gene-disease associations. However, we identified several challenges to the automation of this analysis, including the ambiguity of some gene names (e.g., some genes with symbols identical to acronyms that refer to an unrelated concept); the occasional reporting of a lack of association among genes and diseases; and the gene symbol and disease name appearing within the article’s title and/or abstract, but not being associated to each other within the study. Accordingly, because most of the diseases in Table 3 that are not included in Table 4 were not manually vetted, the Supporting PubMed Papers values in Table 3 should be interpreted with the caveat that the relevant papers represent a subset of the reported counts.

The top predicted genes for each disease warrant further examination, and potential experimental vetting. Of note is that the developed model’s predicted gene-disease associations allow both positive and negative relationships; for example, a predicted gene might either cause or increase the probability of developing the disease, or alternatively might prevent or protect against that disease. Future analyses can separate positive from negative associations, and because these effects are often context-dependent, this will require careful analysis that may necessitate the collection of additional experimental data. Future work might also enhance existing gene-disease datasets by considering novel findings not covered within the datasets that were used in this study. Future work might also benefit from adding the co-expression data as another edge layer to the network.

Advancing beyond previous models intended to predict gene-disease associations, this study presents five main contributions. First, biomedical data is incorporated in the form of a Gene-Disease Prioritization Score (GDPS) based on gene co-expression similarity, allowing the model to better evaluate the degree of association between a gene and a disease. Second, network structure is considered by node embeddings using graph representation learning, and this analysis extends beyond network structure by accounting for biological similarity between unconnected nodes using GDPS. Third, in contrast to most existing studies that randomly select non-associating gene and disease nodes that have the potential to be gene-disease associations not yet reported in the literature, we select negative training samples by favoring non-existing gene-disease edges with lower GDPS. Fourth, we show that network data combined with gene co-expression similarity data can effectively predict gene-disease associations compared with baseline models, and demonstrate this via literature analysis. Fifth, we provide a method to evaluate the developed model’s success in predicting novel disease genes.

The developed HetIG-PreDiG model can be applied to similar tasks that can be represented using networks and that benefit from incorporating gene expression data. Such tasks might include the prediction of drug-disease associations and the prediction of drug-drug interactions.

7 Conclusion

We have presented the Heterogeneous Integrated Graph for Predicting Disease Genes (HetIG-PreDiG) model that uses gene-gene, gene-disease, and gene-tissue associations data to generate accurate gene-disease association predictions, improving upon existing baseline models. Our model addresses the limitations of previous models for disease-gene prediction. This model has potential utility for other tasks that can be represented as networks and that involve gene expression data, including predicting drug-disease associations and drug-drug interactions.

Acknowledgments

We thank Alexander Lachmann, Ph.D. and Vasileios Stathias, Ph.D. for their insightful suggestions on this manuscript.

References

1. Oprea TI, Bologa CG, Brunak S, Campbell A, Gan GN, Gaulton A, et al. Unexplored therapeutic opportunities in the human genome. Nature reviews Drug discovery. 2018;17(5):317. pmid:29472638
- View Article
- PubMed/NCBI
- Google Scholar
2. Kann MG. Advances in translational bioinformatics: computational approaches for the hunting of disease genes. Briefings in bioinformatics. 2010;11(1):96–110. pmid:20007728
- View Article
- PubMed/NCBI
- Google Scholar
3. Barabasi AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nature reviews genetics. 2004;5(2):101–113. pmid:14735121
- View Article
- PubMed/NCBI
- Google Scholar
4. Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ. Next-generation machine learning for biological networks. Cell. 2018;173(7):1581–1592. pmid:29887378
- View Article
- PubMed/NCBI
- Google Scholar
5. Yao V, Kaletsky R, Keyes W, Mor DE, Wong AK, Sohrabi S, et al. An integrative tissue-network approach to identify and test human disease genes. Nature biotechnology. 2018;36(11):1091–1099. pmid:30346941
- View Article
- PubMed/NCBI
- Google Scholar
6. Peng J, Hui W, Shang X. Measuring phenotype-phenotype similarity through the interactome. BMC bioinformatics. 2018;19(5):114. pmid:29671400
- View Article
- PubMed/NCBI
- Google Scholar
7. Erten S, Bebek G, Koyutürk M. Vavien: an algorithm for prioritizing candidate disease genes based on topological similarity of proteins in interaction networks. Journal of computational biology. 2011;18(11):1561–1574. pmid:22035267
- View Article
- PubMed/NCBI
- Google Scholar
8. Killock D. HotNet2—see the wood for the trees. Nature Reviews Clinical Oncology. 2015;12(2):66–66. pmid:25560530
- View Article
- PubMed/NCBI
- Google Scholar
9. Leiserson MD, Vandin F, Wu HT, Dobson JR, Eldridge JV, Thomas JL, et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nature genetics. 2015;47(2):106–114. pmid:25501392
- View Article
- PubMed/NCBI
- Google Scholar
10. Cowen L, Ideker T, Raphael BJ, Sharan R. Network propagation: a universal amplifier of genetic associations. Nature Reviews Genetics. 2017;18(9):551. pmid:28607512
- View Article
- PubMed/NCBI
- Google Scholar
11. Ahmed R, Baali I, Erten C, Hoxha E, Kazan H. MEXCOwalk: mutual exclusion and coverage based random walk to identify cancer modules. Bioinformatics. 2020;36(3):872–879. pmid:31432076
- View Article
- PubMed/NCBI
- Google Scholar
12. Del Sol A, Balling R, Hood L, Galas D. Diseases as network perturbations. Current opinion in biotechnology. 2010;21(4):566–571. pmid:20709523
- View Article
- PubMed/NCBI
- Google Scholar
13. Chattopadhyay A, Lu TP. Gene-gene interaction: the curse of dimensionality. Annals of translational medicine. 2019;7(24). pmid:32042829
- View Article
- PubMed/NCBI
- Google Scholar
14. Hu Y, Shmygelska A, Tran D, Eriksson N, Tung JY, Hinds DA. GWAS of 89,283 individuals identifies genetic variants associated with self-reporting of being a morning person. Nature communications. 2016;7(1):1–9. pmid:26835600
- View Article
- PubMed/NCBI
- Google Scholar
15. Safari-Alighiarloo N, Taghizadeh M, Rezaei-Tavirani M, Goliaei B, Peyvandi AA. Protein-protein interaction networks (PPI) and complex diseases. Gastroenterology and Hepatology from bed to bench. 2014;7(1):17. pmid:25436094
- View Article
- PubMed/NCBI
- Google Scholar
16. Zhou X, Menche J, Barabási AL, Sharma A. Human symptoms–disease network. Nature communications. 2014;5(1):1–10. pmid:24967666
- View Article
- PubMed/NCBI
- Google Scholar
17. Ata SK, Wu M, Fang Y, Ou-Yang L, Kwoh CK, Li XL. Recent advances in network-based methods for disease gene prediction. Briefings in bioinformatics. 2021;22(4):bbaa303. pmid:33276376
- View Article
- PubMed/NCBI
- Google Scholar
18. Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Information Fusion. 2019;50:71–91. pmid:30467459
- View Article
- PubMed/NCBI
- Google Scholar
19. Cai H, Zheng VW, Chang KCC. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering. 2018;30(9):1616–1637.
- View Article
- Google Scholar
20. Xiong Y, Guo M, Ruan L, Kong X, Tang C, Zhu Y, et al. Heterogeneous network embedding enabling accurate disease association predictions. BMC medical genomics. 2019;12(10):1–17. pmid:31865913
- View Article
- PubMed/NCBI
- Google Scholar
21. Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining; 2016. p. 855–864.
22. Cui P, Wang X, Pei J, Zhu W. A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering. 2018;31(5):833–852.
- View Article
- Google Scholar
23. Perozzi B, Al-Rfou R, Skiena S. Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining; 2014. p. 701–710.
24. Zeng X, Ding N, Rodríguez-Patón A, Zou Q. Probability-based collaborative filtering model for predicting gene–disease associations. BMC medical genomics. 2017;10(5):45–53. pmid:29297351
- View Article
- PubMed/NCBI
- Google Scholar
25. Alanis-Lobato G, Mier P, Andrade-Navarro M. The latent geometry of the human protein interaction network. Bioinformatics. 2018;34(16):2826–2834. pmid:29635317
- View Article
- PubMed/NCBI
- Google Scholar
26. Wang B, Pourshafeie A, Zitnik M, Zhu J, Bustamante CD, Batzoglou S, et al. Network enhancement as a general method to denoise weighted biological networks. Nature communications. 2018;9(1):1–8. pmid:30082777
- View Article
- PubMed/NCBI
- Google Scholar
27. Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics. 2018;34(13):i457–i466. pmid:29949996
- View Article
- PubMed/NCBI
- Google Scholar
28. Alshahrani M, Hoehndorf R. Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes. Bioinformatics. 2018;34(17):i901–i907. pmid:30423077
- View Article
- PubMed/NCBI
- Google Scholar
29. Ballouz S, Verleyen W, Gillis J. Guidance for RNA-seq co-expression network construction and analysis: safety in numbers. Bioinformatics. 2015;31(13):2123–2130. pmid:25717192
- View Article
- PubMed/NCBI
- Google Scholar
30. Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, et al. Massive mining of publicly available RNA-seq data from human and mouse. Nature communications. 2018;9(1):1–10. pmid:29636450
- View Article
- PubMed/NCBI
- Google Scholar
31. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research. 2002;30(1):207–210. pmid:11752295
- View Article
- PubMed/NCBI
- Google Scholar
32. Piñero J, Bravo À, Queralt-Rosinach N, Gutiérrez-Sacristán A, Deu-Pons J, Centeno E, et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research. 2016; p. gkw943. pmid:27924018
- View Article
- PubMed/NCBI
- Google Scholar
33. Thul PJ, Lindskog C. The human protein atlas: a spatial map of the human proteome. Protein Science. 2018;27(1):233–244. pmid:28940711
- View Article
- PubMed/NCBI
- Google Scholar
34. Digre A, Lindskog C. The human protein atlas—spatial localization of the human proteome in health and disease. Protein Science. 2021;30(1):218–233. pmid:33146890
- View Article
- PubMed/NCBI
- Google Scholar
35. Bauer-Mehren A, Bundschus M, Rautschka M, Mayer MA, Sanz F, Furlong LI. Gene-disease network analysis reveals functional modules in mendelian, complex and environmental diseases. PloS one. 2011;6(6). pmid:21695124
- View Article
- PubMed/NCBI
- Google Scholar
36. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic acids research. 2017;45(D1):D896–D901. pmid:27899670
- View Article
- PubMed/NCBI
- Google Scholar
37. Yoon S, Nguyen HCT, Yoo YJ, Kim J, Baik B, Kim S, et al. Efficient pathway enrichment and network analysis of GWAS summary data using GSA-SNP2. Nucleic acids research. 2018;46(10):e60–e60. pmid:29562348
- View Article
- PubMed/NCBI
- Google Scholar
38. Navlakha S, Kingsford C. The power of protein interaction networks for associating genes with diseases. Bioinformatics. 2010;26(8):1057–1063. pmid:20185403
- View Article
- PubMed/NCBI
- Google Scholar
39. Peng J, Bai K, Shang X, Wang G, Xue H, Jin S, et al. Predicting disease-related genes using integrated biomedical networks. BMC genomics. 2017;18(1):1–11. pmid:28198675
- View Article
- PubMed/NCBI
- Google Scholar
40. Valdeolivas A, Tichit L, Navarro C, Perrin S, Odelin G, Levy N, et al. Random walk with restart on multiplex and heterogeneous biological networks. Bioinformatics. 2019;35(3):497–505. pmid:30020411
- View Article
- PubMed/NCBI
- Google Scholar
41. Cheng L, Wang P, Tian R, Wang S, Guo Q, Luo M, et al. LncRNA2Target v2. 0: a comprehensive database for target genes of lncRNAs in human and mouse. Nucleic acids research. 2019;47(D1):D140–D144. pmid:30380072
- View Article
- PubMed/NCBI
- Google Scholar
42. Shim JE, Hwang S, Lee I. Pathway-dependent effectiveness of network algorithms for gene prioritization. PLoS One. 2015;10(6):e0130589. pmid:26091506
- View Article
- PubMed/NCBI
- Google Scholar
43. Köhler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes. The American Journal of Human Genetics. 2008;82(4):949–958. pmid:18371930
- View Article
- PubMed/NCBI
- Google Scholar
44. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010;6(1):e1000641. pmid:20090828
- View Article
- PubMed/NCBI
- Google Scholar
45. Le DH, Kwon YK. Neighbor-favoring weight reinforcement to improve random walk-based disease gene prioritization. Computational biology and chemistry. 2013;44:1–8. pmid:23434623
- View Article
- PubMed/NCBI
- Google Scholar
46. Zhu J, Qin Y, Liu T, Wang J, Zheng X. Prioritization of candidate disease genes by topological similarity between disease and protein diffusion profiles. In: BMC bioinformatics. vol. 14. Springer; 2013. p. 1–11.
47. Li Y, Patra JC. Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network. Bioinformatics. 2010;26(9):1219–1224. pmid:20215462
- View Article
- PubMed/NCBI
- Google Scholar
48. Yang P, Li X, Wu M, Kwoh CK, Ng SK. Inferring gene-phenotype associations via global protein complex network propagation. PloS one. 2011;6(7):e21502. pmid:21799737
- View Article
- PubMed/NCBI
- Google Scholar
49. Wu X, Jiang R, Zhang MQ, Li S. Network-based global inference of human disease genes. Molecular systems biology. 2008;4(1):189. pmid:18463613
- View Article
- PubMed/NCBI
- Google Scholar
50. Singh-Blom UM, Natarajan N, Tewari A, Woods JO, Dhillon IS, Marcotte EM. Prediction and validation of gene-disease associations using methods inspired by social network analyses. PloS one. 2013;8(5):e58977. pmid:23650495
- View Article
- PubMed/NCBI
- Google Scholar
51. Xie M, Xu Y, Zhang Y, Hwang T, Kuang R. Network-based phenome-genome association prediction by bi-random walk. PloS one. 2015;10(5):e0125138. pmid:25933025
- View Article
- PubMed/NCBI
- Google Scholar
52. Zeng X, Ding N, Zou Q. Latent factor model with heterogeneous similarity regularization for predicting gene-disease associations. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2016. p. 682–687.
53. Hu G, Agarwal P. Human disease-drug network based on genomic expression profiles. PloS one. 2009;4(8):e6536. pmid:19657382
- View Article
- PubMed/NCBI
- Google Scholar
54. Liu CC, Tseng YT, Li W, Wu CY, Mayzus I, Rzhetsky A, et al. DiseaseConnect: a comprehensive web server for mechanism-based disease–disease connections. Nucleic acids research. 2014;42(W1):W137–W146. pmid:24895436
- View Article
- PubMed/NCBI
- Google Scholar
55. Ideker T, Sharan R. Protein networks in disease. Genome research. 2008;18(4):644–652. pmid:18381899
- View Article
- PubMed/NCBI
- Google Scholar
56. Mordelet F, Vert JP. ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples. BMC bioinformatics. 2011;12(1):1–15. pmid:21977986
- View Article
- PubMed/NCBI
- Google Scholar
57. Nitsch D, Gonçalves JP, Ojeda F, De Moor B, Moreau Y. Candidate gene prioritization by network analysis of differential expression using machine learning approaches. BMC bioinformatics. 2010;11(1):1–16. pmid:20840752
- View Article
- PubMed/NCBI
- Google Scholar
58. Chen Y, Wu X, Jiang R. Integrating human omics data to prioritize candidate genes. BMC medical genomics. 2013;6(1):1–12.
- View Article
- Google Scholar
59. Chen B, Wang J, Li M, Wu FX. Identifying disease genes by integrating multiple data sources. BMC medical genomics. 2014;7(2):1–12. pmid:25350511
- View Article
- PubMed/NCBI
- Google Scholar
60. Zhou H, Skolnick J. A knowledge-based approach for predicting gene–disease associations. Bioinformatics. 2016;32(18):2831–2838. pmid:27283949
- View Article
- PubMed/NCBI
- Google Scholar
61. Ata SK, Fang Y, Wu M, Li XL, Xiao X. Disease gene classification with metagraph representations. Methods. 2017;131:83–92.
- View Article
- Google Scholar
62. Luo P, Li Y, Tian LP, Wu FX. Enhancing the prediction of disease–gene associations with multimodal deep learning. Bioinformatics. 2019;35(19):3735–3742. pmid:30825303
- View Article
- PubMed/NCBI
- Google Scholar
63. Tran VD, Sperduti A, Backofen R, Costa F. Heterogeneous networks integration for disease–gene prioritization with node kernels. Bioinformatics. 2020;36(9):2649–2656. pmid:31990289
- View Article
- PubMed/NCBI
- Google Scholar
64. Kim P, Park A, Han G, Sun H, Jia P, Zhao Z. TissGDB: tissue-specific gene database in cancer. Nucleic acids research. 2018;46(D1):D1031–D1038. pmid:29036590
- View Article
- PubMed/NCBI
- Google Scholar
65. Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, et al. Understanding multicellular function and disease with human tissue-specific networks. Nature genetics. 2015;47(6):569. pmid:25915600
- View Article
- PubMed/NCBI
- Google Scholar
66. Luo P, Tian LP, Chen B, Xiao Q, Wu FX. Predicting gene-disease associations with manifold learning. In: International Symposium on Bioinformatics Research and Applications. Springer; 2018. p. 265–271.
67. Zitnik M, Zupan B. Jumping across biomedical contexts using compressive data fusion. Bioinformatics. 2016;32(12):i90–i100. pmid:27307649
- View Article
- PubMed/NCBI
- Google Scholar
68. Zakeri P, Simm J, Arany A, ElShal S, Moreau Y. Gene prioritization using Bayesian matrix factorization with genomic and phenotypic side information. Bioinformatics. 2018;34(13):i447–i456. pmid:29949967
- View Article
- PubMed/NCBI
- Google Scholar
69. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013;.
70. Zong N, Kim H, Ngo V, Harismendy O. Deep mining heterogeneous networks of biomedical linked data to predict novel drug–target associations. Bioinformatics. 2017;33(15):2337–2344. pmid:28430977
- View Article
- PubMed/NCBI
- Google Scholar
71. Kulmanov M, Khan MA, Hoehndorf R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics. 2018;34(4):660–668. pmid:29028931
- View Article
- PubMed/NCBI
- Google Scholar
72. Fu Ty, Lee WC, Lei Z. Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management; 2017. p. 1797–1806.
73. Hu B, Fang Y, Shi C. Adversarial learning on heterogeneous information networks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2019. p. 120–129.
74. Liu R, Mancuso CA, Yannakopoulos A, Johnson KA, Krishnan A. Supervised learning is an accurate method for network-based gene classification. Bioinformatics. 2020;36(11):3457–3465. pmid:32129827
- View Article
- PubMed/NCBI
- Google Scholar
75. Ata SK, Ou-Yang L, Fang Y, Kwoh CK, Wu M, Li XL. Integrating node embeddings and biological annotations for genes to predict disease-gene associations. BMC systems biology. 2018;12(9):31–44. pmid:30598097
- View Article
- PubMed/NCBI
- Google Scholar
76. Zeng M, Zou B, Wei F, Liu X, Wang L. Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. In: 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS). IEEE; 2016. p. 225–228.
77. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–357.
- View Article
- Google Scholar
78. Zeng M, Li M, Wu FX, Li Y, Pan Y. DeepEP: a deep learning framework for identifying essential proteins. BMC bioinformatics. 2019;20(16):1–10. pmid:31787076
- View Article
- PubMed/NCBI
- Google Scholar
79. Ellerkmann RK, Bruhn J, Soehle M, Kehrer M, Hoeft A, Kreuer S. Maximizing prediction probability PK as an alternative semiparametric approach to estimate the plasma effect-site equilibration rate constant ke0. Anesthesia & Analgesia. 2009;109(5):1470–1478. pmid:19713250
- View Article
- PubMed/NCBI
- Google Scholar
80. Jordan D, Steiner M, Kochs EF, Schneider G. A program for computing the prediction probability and the related receiver operating characteristic graph. Anesthesia & Analgesia. 2010;111(6):1416–1421. pmid:21059744
- View Article
- PubMed/NCBI
- Google Scholar
81. Wang H, Song M. Ckmeans. 1d. dp: optimal k-means clustering in one dimension by dynamic programming. The R journal. 2011;3(2):29. pmid:27942416
- View Article
- PubMed/NCBI
- Google Scholar
82. Vargha A, Delaney HD. The Kruskal-Wallis test and stochastic homogeneity. Journal of Educational and Behavioral Statistics. 1998;23(2):170–192.
- View Article
- Google Scholar
83. Stelzer G, Rosen N, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, et al. The GeneCards suite: from gene data mining to disease genome sequence analyses. Current protocols in bioinformatics. 2016;54(1):1–30. pmid:27322403
- View Article
- PubMed/NCBI
- Google Scholar
84. Coura JR, Borges-Pereira J. Chagas disease: 100 years after its discovery. A systemic review. Acta tropica. 2010;115(1-2):5–13. pmid:20382097
- View Article
- PubMed/NCBI
- Google Scholar
85. Monterroso J. Chagas disease: A review. Journal of Alternative Medicine Research. 2021;13(2):117–125.
- View Article
- Google Scholar
86. Ellulu MS, Patimah I, Khaza’ai H, Rahmat A, Abed Y, Ali F. Atherosclerotic cardiovascular disease: a review of initiators and protective factors. Inflammopharmacology. 2016;24(1):1–10. pmid:26750181
- View Article
- PubMed/NCBI
- Google Scholar
87. Lin QY, Lang PP, Zhang YL, Yang XL, Xia YL, Bai J, et al. Pharmacological blockage of ICAM-1 improves angiotensin II-induced cardiac remodeling by inhibiting adhesion of LFA-1+ monocytes. American Journal of Physiology-Heart and Circulatory Physiology. 2019;317(6):H1301–H1311. pmid:31729904
- View Article
- PubMed/NCBI
- Google Scholar
88. Sumagin R, Brazil J, Nava P, Nishio H, Alam A, Luissint A, et al. Neutrophil interactions with epithelial-expressed ICAM-1 enhances intestinal mucosal wound healing. Mucosal immunology. 2016;9(5):1151–1162. pmid:26732677
- View Article
- PubMed/NCBI
- Google Scholar
89. Coto E, Reguero JR, Avanzas P, Pascual I, Martín M, Hevia S, et al. Gene variants in the NF-KB pathway (NFKB1, NFKBIA, NFKBIZ) and risk for early-onset coronary artery disease. Immunology letters. 2019;208:39–43. pmid:30902734
- View Article
- PubMed/NCBI
- Google Scholar
90. Jin SY, Luo JY, Li XM, Liu F, Ma YT, Gao XM, et al. NFKB1 gene rs28362491 polymorphism is associated with the susceptibility of acute coronary syndrome. Bioscience reports. 2019;39(4). pmid:30910844
- View Article
- PubMed/NCBI
- Google Scholar
91. Yenmis G, Oner T, Cam C, Koc A, Kucuk O, Yakicier M, et al. Association of NFKB 1 and NFKBIA Polymorphisms in Relation to Susceptibility of B ehçet’s Disease. Scandinavian journal of immunology. 2015;81(1):81–86. pmid:25367031
- View Article
- PubMed/NCBI
- Google Scholar
92. Kaustio M, Haapaniemi E, Göös H, Hautala T, Park G, Syrjänen J, et al. Damaging heterozygous mutations in NFKB1 lead to diverse immunologic phenotypes. Journal of Allergy and Clinical Immunology. 2017;140(3):782–796. pmid:28115215
- View Article
- PubMed/NCBI
- Google Scholar
93. Borm M, Van Bodegraven A, Mulder C, Kraal G, Bouma G. A NFKB1 promoter polymorphism is involved in susceptibility to ulcerative colitis. International journal of immunogenetics. 2005;32(6):401–405. pmid:16313306
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Oprea TI, Bologa CG, Brunak S, Campbell A, Gan GN, Gaulton A, et al. Unexplored therapeutic opportunities in the human genome. Nature reviews Drug discovery. 2018;17(5):317. pmid:29472638
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Kann MG. Advances in translational bioinformatics: computational approaches for the hunting of disease genes. Briefings in bioinformatics. 2010;11(1):96–110. pmid:20007728
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Barabasi AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nature reviews genetics. 2004;5(2):101–113. pmid:14735121
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ. Next-generation machine learning for biological networks. Cell. 2018;173(7):1581–1592. pmid:29887378
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Yao V, Kaletsky R, Keyes W, Mor DE, Wong AK, Sohrabi S, et al. An integrative tissue-network approach to identify and test human disease genes. Nature biotechnology. 2018;36(11):1091–1099. pmid:30346941
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Peng J, Hui W, Shang X. Measuring phenotype-phenotype similarity through the interactome. BMC bioinformatics. 2018;19(5):114. pmid:29671400
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Erten S, Bebek G, Koyutürk M. Vavien: an algorithm for prioritizing candidate disease genes based on topological similarity of proteins in interaction networks. Journal of computational biology. 2011;18(11):1561–1574. pmid:22035267
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Killock D. HotNet2—see the wood for the trees. Nature Reviews Clinical Oncology. 2015;12(2):66–66. pmid:25560530
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Leiserson MD, Vandin F, Wu HT, Dobson JR, Eldridge JV, Thomas JL, et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nature genetics. 2015;47(2):106–114. pmid:25501392
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Cowen L, Ideker T, Raphael BJ, Sharan R. Network propagation: a universal amplifier of genetic associations. Nature Reviews Genetics. 2017;18(9):551. pmid:28607512
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Ahmed R, Baali I, Erten C, Hoxha E, Kazan H. MEXCOwalk: mutual exclusion and coverage based random walk to identify cancer modules. Bioinformatics. 2020;36(3):872–879. pmid:31432076
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Del Sol A, Balling R, Hood L, Galas D. Diseases as network perturbations. Current opinion in biotechnology. 2010;21(4):566–571. pmid:20709523
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Chattopadhyay A, Lu TP. Gene-gene interaction: the curse of dimensionality. Annals of translational medicine. 2019;7(24). pmid:32042829
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Hu Y, Shmygelska A, Tran D, Eriksson N, Tung JY, Hinds DA. GWAS of 89,283 individuals identifies genetic variants associated with self-reporting of being a morning person. Nature communications. 2016;7(1):1–9. pmid:26835600
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Safari-Alighiarloo N, Taghizadeh M, Rezaei-Tavirani M, Goliaei B, Peyvandi AA. Protein-protein interaction networks (PPI) and complex diseases. Gastroenterology and Hepatology from bed to bench. 2014;7(1):17. pmid:25436094
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Zhou X, Menche J, Barabási AL, Sharma A. Human symptoms–disease network. Nature communications. 2014;5(1):1–10. pmid:24967666
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref17] 17. Ata SK, Wu M, Fang Y, Ou-Yang L, Kwoh CK, Li XL. Recent advances in network-based methods for disease gene prediction. Briefings in bioinformatics. 2021;22(4):bbaa303. pmid:33276376
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref18] 18. Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Information Fusion. 2019;50:71–91. pmid:30467459
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref19] 19. Cai H, Zheng VW, Chang KCC. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering. 2018;30(9):1616–1637.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref20] 20. Xiong Y, Guo M, Ruan L, Kong X, Tang C, Zhu Y, et al. Heterogeneous network embedding enabling accurate disease association predictions. BMC medical genomics. 2019;12(10):1–17. pmid:31865913
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref21] 21. Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining; 2016. p. 855–864.

[ref22] 22. Cui P, Wang X, Pei J, Zhu W. A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering. 2018;31(5):833–852.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref23] 23. Perozzi B, Al-Rfou R, Skiena S. Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining; 2014. p. 701–710.

[ref24] 24. Zeng X, Ding N, Rodríguez-Patón A, Zou Q. Probability-based collaborative filtering model for predicting gene–disease associations. BMC medical genomics. 2017;10(5):45–53. pmid:29297351
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref25] 25. Alanis-Lobato G, Mier P, Andrade-Navarro M. The latent geometry of the human protein interaction network. Bioinformatics. 2018;34(16):2826–2834. pmid:29635317
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref26] 26. Wang B, Pourshafeie A, Zitnik M, Zhu J, Bustamante CD, Batzoglou S, et al. Network enhancement as a general method to denoise weighted biological networks. Nature communications. 2018;9(1):1–8. pmid:30082777
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref27] 27. Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics. 2018;34(13):i457–i466. pmid:29949996
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref28] 28. Alshahrani M, Hoehndorf R. Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes. Bioinformatics. 2018;34(17):i901–i907. pmid:30423077
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref29] 29. Ballouz S, Verleyen W, Gillis J. Guidance for RNA-seq co-expression network construction and analysis: safety in numbers. Bioinformatics. 2015;31(13):2123–2130. pmid:25717192
View Article
PubMed/NCBI
Google Scholar

[106] View Article

[107] PubMed/NCBI

[108] Google Scholar

[ref30] 30. Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, et al. Massive mining of publicly available RNA-seq data from human and mouse. Nature communications. 2018;9(1):1–10. pmid:29636450
View Article
PubMed/NCBI
Google Scholar

[110] View Article

[111] PubMed/NCBI

[112] Google Scholar

[ref31] 31. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research. 2002;30(1):207–210. pmid:11752295
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref32] 32. Piñero J, Bravo À, Queralt-Rosinach N, Gutiérrez-Sacristán A, Deu-Pons J, Centeno E, et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic acids research. 2016; p. gkw943. pmid:27924018
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref33] 33. Thul PJ, Lindskog C. The human protein atlas: a spatial map of the human proteome. Protein Science. 2018;27(1):233–244. pmid:28940711
View Article
PubMed/NCBI
Google Scholar

[122] View Article

[123] PubMed/NCBI

[124] Google Scholar

[ref34] 34. Digre A, Lindskog C. The human protein atlas—spatial localization of the human proteome in health and disease. Protein Science. 2021;30(1):218–233. pmid:33146890
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref35] 35. Bauer-Mehren A, Bundschus M, Rautschka M, Mayer MA, Sanz F, Furlong LI. Gene-disease network analysis reveals functional modules in mendelian, complex and environmental diseases. PloS one. 2011;6(6). pmid:21695124
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref36] 36. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic acids research. 2017;45(D1):D896–D901. pmid:27899670
View Article
PubMed/NCBI
Google Scholar

[134] View Article

[135] PubMed/NCBI

[136] Google Scholar

[ref37] 37. Yoon S, Nguyen HCT, Yoo YJ, Kim J, Baik B, Kim S, et al. Efficient pathway enrichment and network analysis of GWAS summary data using GSA-SNP2. Nucleic acids research. 2018;46(10):e60–e60. pmid:29562348
View Article
PubMed/NCBI
Google Scholar

[138] View Article

[139] PubMed/NCBI

[140] Google Scholar

[ref38] 38. Navlakha S, Kingsford C. The power of protein interaction networks for associating genes with diseases. Bioinformatics. 2010;26(8):1057–1063. pmid:20185403
View Article
PubMed/NCBI
Google Scholar

[142] View Article

[143] PubMed/NCBI

[144] Google Scholar

[ref39] 39. Peng J, Bai K, Shang X, Wang G, Xue H, Jin S, et al. Predicting disease-related genes using integrated biomedical networks. BMC genomics. 2017;18(1):1–11. pmid:28198675
View Article
PubMed/NCBI
Google Scholar

[146] View Article

[147] PubMed/NCBI

[148] Google Scholar

[ref40] 40. Valdeolivas A, Tichit L, Navarro C, Perrin S, Odelin G, Levy N, et al. Random walk with restart on multiplex and heterogeneous biological networks. Bioinformatics. 2019;35(3):497–505. pmid:30020411
View Article
PubMed/NCBI
Google Scholar

[150] View Article

[151] PubMed/NCBI

[152] Google Scholar

[ref41] 41. Cheng L, Wang P, Tian R, Wang S, Guo Q, Luo M, et al. LncRNA2Target v2. 0: a comprehensive database for target genes of lncRNAs in human and mouse. Nucleic acids research. 2019;47(D1):D140–D144. pmid:30380072
View Article
PubMed/NCBI
Google Scholar

[154] View Article

[155] PubMed/NCBI

[156] Google Scholar

[ref42] 42. Shim JE, Hwang S, Lee I. Pathway-dependent effectiveness of network algorithms for gene prioritization. PLoS One. 2015;10(6):e0130589. pmid:26091506
View Article
PubMed/NCBI
Google Scholar

[158] View Article

[159] PubMed/NCBI

[160] Google Scholar

[ref43] 43. Köhler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes. The American Journal of Human Genetics. 2008;82(4):949–958. pmid:18371930
View Article
PubMed/NCBI
Google Scholar

[162] View Article

[163] PubMed/NCBI

[164] Google Scholar

[ref44] 44. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010;6(1):e1000641. pmid:20090828
View Article
PubMed/NCBI
Google Scholar

[166] View Article

[167] PubMed/NCBI

[168] Google Scholar

[ref45] 45. Le DH, Kwon YK. Neighbor-favoring weight reinforcement to improve random walk-based disease gene prioritization. Computational biology and chemistry. 2013;44:1–8. pmid:23434623
View Article
PubMed/NCBI
Google Scholar

[170] View Article

[171] PubMed/NCBI

[172] Google Scholar

[ref46] 46. Zhu J, Qin Y, Liu T, Wang J, Zheng X. Prioritization of candidate disease genes by topological similarity between disease and protein diffusion profiles. In: BMC bioinformatics. vol. 14. Springer; 2013. p. 1–11.

[ref47] 47. Li Y, Patra JC. Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network. Bioinformatics. 2010;26(9):1219–1224. pmid:20215462
View Article
PubMed/NCBI
Google Scholar

[175] View Article

[176] PubMed/NCBI

[177] Google Scholar

[ref48] 48. Yang P, Li X, Wu M, Kwoh CK, Ng SK. Inferring gene-phenotype associations via global protein complex network propagation. PloS one. 2011;6(7):e21502. pmid:21799737
View Article
PubMed/NCBI
Google Scholar

[179] View Article

[180] PubMed/NCBI

[181] Google Scholar

[ref49] 49. Wu X, Jiang R, Zhang MQ, Li S. Network-based global inference of human disease genes. Molecular systems biology. 2008;4(1):189. pmid:18463613
View Article
PubMed/NCBI
Google Scholar

[183] View Article

[184] PubMed/NCBI

[185] Google Scholar

[ref50] 50. Singh-Blom UM, Natarajan N, Tewari A, Woods JO, Dhillon IS, Marcotte EM. Prediction and validation of gene-disease associations using methods inspired by social network analyses. PloS one. 2013;8(5):e58977. pmid:23650495
View Article
PubMed/NCBI
Google Scholar

[187] View Article

[188] PubMed/NCBI

[189] Google Scholar

[ref51] 51. Xie M, Xu Y, Zhang Y, Hwang T, Kuang R. Network-based phenome-genome association prediction by bi-random walk. PloS one. 2015;10(5):e0125138. pmid:25933025
View Article
PubMed/NCBI
Google Scholar

[191] View Article

[192] PubMed/NCBI

[193] Google Scholar

[ref52] 52. Zeng X, Ding N, Zou Q. Latent factor model with heterogeneous similarity regularization for predicting gene-disease associations. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2016. p. 682–687.

[ref53] 53. Hu G, Agarwal P. Human disease-drug network based on genomic expression profiles. PloS one. 2009;4(8):e6536. pmid:19657382
View Article
PubMed/NCBI
Google Scholar

[196] View Article

[197] PubMed/NCBI

[198] Google Scholar

[ref54] 54. Liu CC, Tseng YT, Li W, Wu CY, Mayzus I, Rzhetsky A, et al. DiseaseConnect: a comprehensive web server for mechanism-based disease–disease connections. Nucleic acids research. 2014;42(W1):W137–W146. pmid:24895436
View Article
PubMed/NCBI
Google Scholar

[200] View Article

[201] PubMed/NCBI

[202] Google Scholar

[ref55] 55. Ideker T, Sharan R. Protein networks in disease. Genome research. 2008;18(4):644–652. pmid:18381899
View Article
PubMed/NCBI
Google Scholar

[204] View Article

[205] PubMed/NCBI

[206] Google Scholar

[ref56] 56. Mordelet F, Vert JP. ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples. BMC bioinformatics. 2011;12(1):1–15. pmid:21977986
View Article
PubMed/NCBI
Google Scholar

[208] View Article

[209] PubMed/NCBI

[210] Google Scholar

[ref57] 57. Nitsch D, Gonçalves JP, Ojeda F, De Moor B, Moreau Y. Candidate gene prioritization by network analysis of differential expression using machine learning approaches. BMC bioinformatics. 2010;11(1):1–16. pmid:20840752
View Article
PubMed/NCBI
Google Scholar

[212] View Article

[213] PubMed/NCBI

[214] Google Scholar

[ref58] 58. Chen Y, Wu X, Jiang R. Integrating human omics data to prioritize candidate genes. BMC medical genomics. 2013;6(1):1–12.
View Article
Google Scholar

[216] View Article

[217] Google Scholar

[ref59] 59. Chen B, Wang J, Li M, Wu FX. Identifying disease genes by integrating multiple data sources. BMC medical genomics. 2014;7(2):1–12. pmid:25350511
View Article
PubMed/NCBI
Google Scholar

[219] View Article

[220] PubMed/NCBI

[221] Google Scholar

[ref60] 60. Zhou H, Skolnick J. A knowledge-based approach for predicting gene–disease associations. Bioinformatics. 2016;32(18):2831–2838. pmid:27283949
View Article
PubMed/NCBI
Google Scholar

[223] View Article

[224] PubMed/NCBI

[225] Google Scholar

[ref61] 61. Ata SK, Fang Y, Wu M, Li XL, Xiao X. Disease gene classification with metagraph representations. Methods. 2017;131:83–92.
View Article
Google Scholar

[227] View Article

[228] Google Scholar

[ref62] 62. Luo P, Li Y, Tian LP, Wu FX. Enhancing the prediction of disease–gene associations with multimodal deep learning. Bioinformatics. 2019;35(19):3735–3742. pmid:30825303
View Article
PubMed/NCBI
Google Scholar

[230] View Article

[231] PubMed/NCBI

[232] Google Scholar

[ref63] 63. Tran VD, Sperduti A, Backofen R, Costa F. Heterogeneous networks integration for disease–gene prioritization with node kernels. Bioinformatics. 2020;36(9):2649–2656. pmid:31990289
View Article
PubMed/NCBI
Google Scholar

[234] View Article

[235] PubMed/NCBI

[236] Google Scholar

[ref64] 64. Kim P, Park A, Han G, Sun H, Jia P, Zhao Z. TissGDB: tissue-specific gene database in cancer. Nucleic acids research. 2018;46(D1):D1031–D1038. pmid:29036590
View Article
PubMed/NCBI
Google Scholar

[238] View Article

[239] PubMed/NCBI

[240] Google Scholar

[ref65] 65. Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, et al. Understanding multicellular function and disease with human tissue-specific networks. Nature genetics. 2015;47(6):569. pmid:25915600
View Article
PubMed/NCBI
Google Scholar

[242] View Article

[243] PubMed/NCBI

[244] Google Scholar

[ref66] 66. Luo P, Tian LP, Chen B, Xiao Q, Wu FX. Predicting gene-disease associations with manifold learning. In: International Symposium on Bioinformatics Research and Applications. Springer; 2018. p. 265–271.

[ref67] 67. Zitnik M, Zupan B. Jumping across biomedical contexts using compressive data fusion. Bioinformatics. 2016;32(12):i90–i100. pmid:27307649
View Article
PubMed/NCBI
Google Scholar

[247] View Article

[248] PubMed/NCBI

[249] Google Scholar

[ref68] 68. Zakeri P, Simm J, Arany A, ElShal S, Moreau Y. Gene prioritization using Bayesian matrix factorization with genomic and phenotypic side information. Bioinformatics. 2018;34(13):i447–i456. pmid:29949967
View Article
PubMed/NCBI
Google Scholar

[251] View Article

[252] PubMed/NCBI

[253] Google Scholar

[ref69] 69. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013;.

[ref70] 70. Zong N, Kim H, Ngo V, Harismendy O. Deep mining heterogeneous networks of biomedical linked data to predict novel drug–target associations. Bioinformatics. 2017;33(15):2337–2344. pmid:28430977
View Article
PubMed/NCBI
Google Scholar

[256] View Article

[257] PubMed/NCBI

[258] Google Scholar

[ref71] 71. Kulmanov M, Khan MA, Hoehndorf R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics. 2018;34(4):660–668. pmid:29028931
View Article
PubMed/NCBI
Google Scholar

[260] View Article

[261] PubMed/NCBI

[262] Google Scholar

[ref72] 72. Fu Ty, Lee WC, Lei Z. Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management; 2017. p. 1797–1806.

[ref73] 73. Hu B, Fang Y, Shi C. Adversarial learning on heterogeneous information networks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2019. p. 120–129.

[ref74] 74. Liu R, Mancuso CA, Yannakopoulos A, Johnson KA, Krishnan A. Supervised learning is an accurate method for network-based gene classification. Bioinformatics. 2020;36(11):3457–3465. pmid:32129827
View Article
PubMed/NCBI
Google Scholar

[266] View Article

[267] PubMed/NCBI

[268] Google Scholar

[ref75] 75. Ata SK, Ou-Yang L, Fang Y, Kwoh CK, Wu M, Li XL. Integrating node embeddings and biological annotations for genes to predict disease-gene associations. BMC systems biology. 2018;12(9):31–44. pmid:30598097
View Article
PubMed/NCBI
Google Scholar

[270] View Article

[271] PubMed/NCBI

[272] Google Scholar

[ref76] 76. Zeng M, Zou B, Wei F, Liu X, Wang L. Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. In: 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS). IEEE; 2016. p. 225–228.

[ref77] 77. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–357.
View Article
Google Scholar

[275] View Article

[276] Google Scholar

[ref78] 78. Zeng M, Li M, Wu FX, Li Y, Pan Y. DeepEP: a deep learning framework for identifying essential proteins. BMC bioinformatics. 2019;20(16):1–10. pmid:31787076
View Article
PubMed/NCBI
Google Scholar

[278] View Article

[279] PubMed/NCBI

[280] Google Scholar

[ref79] 79. Ellerkmann RK, Bruhn J, Soehle M, Kehrer M, Hoeft A, Kreuer S. Maximizing prediction probability PK as an alternative semiparametric approach to estimate the plasma effect-site equilibration rate constant ke0. Anesthesia & Analgesia. 2009;109(5):1470–1478. pmid:19713250
View Article
PubMed/NCBI
Google Scholar

[282] View Article

[283] PubMed/NCBI

[284] Google Scholar

[ref80] 80. Jordan D, Steiner M, Kochs EF, Schneider G. A program for computing the prediction probability and the related receiver operating characteristic graph. Anesthesia & Analgesia. 2010;111(6):1416–1421. pmid:21059744
View Article
PubMed/NCBI
Google Scholar

[286] View Article

[287] PubMed/NCBI

[288] Google Scholar

[ref81] 81. Wang H, Song M. Ckmeans. 1d. dp: optimal k-means clustering in one dimension by dynamic programming. The R journal. 2011;3(2):29. pmid:27942416
View Article
PubMed/NCBI
Google Scholar

[290] View Article

[291] PubMed/NCBI

[292] Google Scholar

[ref82] 82. Vargha A, Delaney HD. The Kruskal-Wallis test and stochastic homogeneity. Journal of Educational and Behavioral Statistics. 1998;23(2):170–192.
View Article
Google Scholar

[294] View Article

[295] Google Scholar

[ref83] 83. Stelzer G, Rosen N, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, et al. The GeneCards suite: from gene data mining to disease genome sequence analyses. Current protocols in bioinformatics. 2016;54(1):1–30. pmid:27322403
View Article
PubMed/NCBI
Google Scholar

[297] View Article

[298] PubMed/NCBI

[299] Google Scholar

[ref84] 84. Coura JR, Borges-Pereira J. Chagas disease: 100 years after its discovery. A systemic review. Acta tropica. 2010;115(1-2):5–13. pmid:20382097
View Article
PubMed/NCBI
Google Scholar

[301] View Article

[302] PubMed/NCBI

[303] Google Scholar

[ref85] 85. Monterroso J. Chagas disease: A review. Journal of Alternative Medicine Research. 2021;13(2):117–125.
View Article
Google Scholar

[305] View Article

[306] Google Scholar

[ref86] 86. Ellulu MS, Patimah I, Khaza’ai H, Rahmat A, Abed Y, Ali F. Atherosclerotic cardiovascular disease: a review of initiators and protective factors. Inflammopharmacology. 2016;24(1):1–10. pmid:26750181
View Article
PubMed/NCBI
Google Scholar

[308] View Article

[309] PubMed/NCBI

[310] Google Scholar

[ref87] 87. Lin QY, Lang PP, Zhang YL, Yang XL, Xia YL, Bai J, et al. Pharmacological blockage of ICAM-1 improves angiotensin II-induced cardiac remodeling by inhibiting adhesion of LFA-1+ monocytes. American Journal of Physiology-Heart and Circulatory Physiology. 2019;317(6):H1301–H1311. pmid:31729904
View Article
PubMed/NCBI
Google Scholar

[312] View Article

[313] PubMed/NCBI

[314] Google Scholar

[ref88] 88. Sumagin R, Brazil J, Nava P, Nishio H, Alam A, Luissint A, et al. Neutrophil interactions with epithelial-expressed ICAM-1 enhances intestinal mucosal wound healing. Mucosal immunology. 2016;9(5):1151–1162. pmid:26732677
View Article
PubMed/NCBI
Google Scholar

[316] View Article

[317] PubMed/NCBI

[318] Google Scholar

[ref89] 89. Coto E, Reguero JR, Avanzas P, Pascual I, Martín M, Hevia S, et al. Gene variants in the NF-KB pathway (NFKB1, NFKBIA, NFKBIZ) and risk for early-onset coronary artery disease. Immunology letters. 2019;208:39–43. pmid:30902734
View Article
PubMed/NCBI
Google Scholar

[320] View Article

[321] PubMed/NCBI

[322] Google Scholar

[ref90] 90. Jin SY, Luo JY, Li XM, Liu F, Ma YT, Gao XM, et al. NFKB1 gene rs28362491 polymorphism is associated with the susceptibility of acute coronary syndrome. Bioscience reports. 2019;39(4). pmid:30910844
View Article
PubMed/NCBI
Google Scholar

[324] View Article

[325] PubMed/NCBI

[326] Google Scholar

[ref91] 91. Yenmis G, Oner T, Cam C, Koc A, Kucuk O, Yakicier M, et al. Association of NFKB 1 and NFKBIA Polymorphisms in Relation to Susceptibility of B ehçet’s Disease. Scandinavian journal of immunology. 2015;81(1):81–86. pmid:25367031
View Article
PubMed/NCBI
Google Scholar

[328] View Article

[329] PubMed/NCBI

[330] Google Scholar

[ref92] 92. Kaustio M, Haapaniemi E, Göös H, Hautala T, Park G, Syrjänen J, et al. Damaging heterozygous mutations in NFKB1 lead to diverse immunologic phenotypes. Journal of Allergy and Clinical Immunology. 2017;140(3):782–796. pmid:28115215
View Article
PubMed/NCBI
Google Scholar

[332] View Article

[333] PubMed/NCBI

[334] Google Scholar

[ref93] 93. Borm M, Van Bodegraven A, Mulder C, Kraal G, Bouma G. A NFKB1 promoter polymorphism is involved in susceptibility to ulcerative colitis. International journal of immunogenetics. 2005;32(6):401–405. pmid:16313306
View Article
PubMed/NCBI
Google Scholar

[336] View Article

[337] PubMed/NCBI

[338] Google Scholar

Figures

Abstract

1 Introduction

1.1 Organization

2 Related work

2.1 Biomedical data

2.2 Predicting gene-disease associations

2.3 Network diffusion methods

2.4 Supervised machine learning methods

2.5 Graph representation learning

3 Materials and Methods

3.1 HetIG-PreDiG model to predict gene-disease associations

3.2 Datasets used in model

4 Analysis

4.1 Comparison with baseline models

4.2 Model evaluation: Predicting novel disease genes

4.3 Model evaluation: Real-world scenario

4.4 Comparison of gene-disease association predictions with the literature

5 Results

5.1 Comparison with baseline models

5.2 Model evaluation: Predicting novel disease genes

5.3 Model evaluation: Real-world scenario

5.4 Comparison of gene-disease association predictions with the literature

6 Discussion

7 Conclusion

Acknowledgments

References