Gene expression prediction based on neighbour connection neural network utilizing gene interaction graphs

Xuanyu Li; Xuan Zhang; Wenduo He; Deliang Bu; Sanguo Zhang

doi:10.1371/journal.pone.0281286

Abstract

Having observed that gene expressions have a correlation, the Library of Integrated Network-based Cell-Signature program selects 1000 landmark genes to predict the remaining gene expression value. Further works have improved the prediction result by using deep learning models. However, these models ignore the latent structure of genes, limiting the accuracy of the experimental results. We therefore propose a novel neural network named Neighbour Connection Neural Network(NCNN) to utilize the gene interaction graph information. Comparing to the popular GCN model, our model incorperates the graph information in a better manner. We validate our model under two different settings and show that our model promotes prediction accuracy comparing to the other models.

Citation: Li X, Zhang X, He W, Bu D, Zhang S (2023) Gene expression prediction based on neighbour connection neural network utilizing gene interaction graphs. PLoS ONE 18(2): e0281286. https://doi.org/10.1371/journal.pone.0281286

Editor: Sathishkumar V E, Hanyang University, KOREA, REPUBLIC OF

Received: October 17, 2022; Accepted: January 19, 2023; Published: February 6, 2023

Copyright: © 2023 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The third-party data used for the training are publicly available at https://cbcl.ics.uci.edu/public_data/D-GEX/. The authors had no special access privileges, and other researchers would be able to access this data in the same manner. The data-preprocessing and python implementation are publicly available at https://github.com/Xuanyu-Li/NCNN.

Funding: This work was supported by the National Natural Science Foundation of China 374 (12171454), and the Key R&D Program of Guangxi (2020AB10023). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Gene expression data, which describe the process of converting DNA materials into functional products [1], has been an important tool for medical diagnosis and gaining insights into complex disease [2, 3]. With the advance in DNA microarray [4] and RNA-seq technologies [5, 6], the cellular response can be studied through thousands of expression data under a wide variety of conditions such as diseases, genetic mutations and intake of medicines and drugs. The corresponding study is called gene expression profiling.

Although lots of gene expression data have been collected and deposited [7, 8], whole genome profiling is still too expensive for broad use since it requires the collection of data with a large number of genes through various conditions. For example, The initial phase of the CMap project produced only 564 genome-wide gene expression profiles [9]. One of the solutions to reduce the expense of whole genome profiling is to utilize the high correlation among different genes [10] and select a group of genes to represent overall genome expression. Researchers from the LINCS program performed principal components analysis(PCA) and found that 1,000 carefully chosen genes(named landmark genes) were sufficient to recover 80% of the information in the whole genome [11]. Then they developed the L1000 Luminex bead technology to measure the expression profiles of these 1000 genes at a much lower cost. Lots of literature have been proposed then based on this cost-effective strategy [12, 13].

Despite the low cost of the L1000 program, one of the natural questions is how to infer other genes, named target genes, based on these landmark genes. The original paper proposed by the LINCS program adopts simple linear regression. Although classic and computationally efficient, linear regression can not capture the nonlinear relationship between landmark genes and target genes. With the development of deep learning methods, Li et al. [10] proposed a full connection neural netword-based method D-GEX and achieved better results than linear regression in both DNA-microarray and RNA sequencing data.

Although D-GEX performs much better than traditional methods, it may be further improved. D-GEX uses full connection neural network model which implicitly assuames that landmark genes are interchangeable. In other words, the landmark gene expression data can be fed into the full connection neural network in any order without affecting the final result. The motivation of this paper comes from considering whether this implicit assumption holds for the gene expression data. As shown in many biology studies [14, 15], the genes have an inherent structure, at the same time, cells can coordinate the regulation of many genes at once. Thus, the D-GEX model neglects the latent structure of the landmark genes, and it is beneficial to incorporate exterior information which gives the structure of genes into the deep learning method. The gene interaction graph, which depicts such coordination by giving functional biological interaction between two genes is a perfect candidate. In the gene interaction graph, nodes represent genes, and edges represent the functional biological interaction between two genes. There have been many gene interaction graphs constructed from different molecular levels(Szklarczyk et al. [16]; Warde-Farley et al. [17];): protein-protein interaction, transcription factors, and gene co-expression are the common material to construct gene interaction graph. Another aspect is that in the deep learning literature, the processing of graph data has recently drawn a major interest [18]. Any neural network working on the graph data can be categorized as a graph neural network(GNN). In particular, graph convolutional network(GCN) has been a predominant approach [19] among the graph neural networks.

In this paper, we will briefly introduce the classical graph convolutional network structure and then compare the GCN with our method which tackles some deficiencies of the original structure and successfully improves the prediction accuracy. Our main contribution is to propose a novel neural network architecture to utilize the gene interaction graph. We give some explanation of why our method is better than the popular graph covolutional network model and D-GEX. We further run experiments on two different datasets to validate that our model improves the prediction accuracy with fewer paramters.

Related works

Deep neural network for gene expression prediction

D-GEX is a full connection neural network using approximately 1000 landmark genes as input and approximately 9000 target genes as output. Due to computational reasons, the target genes were separated into two parts, and D-GEX was independently trained in two parts.

D-GEX was trained with MSE(mean squared error) as a loss function based on GEO expression data which we will further introduce in detail in the next section. The hyperbolic tangent (TANH) function is chosen as the activation function and other training techniques including dropout, and momentum acceleration are also applied. Candidate models include neural networks with 1 to 3 hidden layers each with 3000, 6000 and 9000 hidden sizes validated based on The GTEx expression data and MAE(Mean absolute error) is chosen as the model evaluation metric.

Recently, several deep neural networks have been used for gene expression prediction. Wang et al [20] use Conditional Generative Adversarial Networks to model the conditional probability of target genes given landmark genes. Kunc and Kléma [21] substitutes the hyperbolic tangent function with transformative adaptive activation functions to improve the prediction accuracy. Wang et al [22] use a recurrent neural network called L-GEPM to model the non-linear features of the landmark genes. However, just like D-GEX, these methods do not take the structure of the target genes into consideration. Our work mitigatea this issue by utilizing gene interaction graphs.

Graph neural networks using external information

There have been works done to use the gene interaction graph for gene expression prediction, yet their tasks somewhat differ from ours.

Not restricting the genes to the 943 landmark genes and 9520 target genes, Dutil et al. [23] use GNN to do one gene expression prediction, which is termed a Single Gene Inference task. The output in their task is only one gene expression and its input genes are selected by picking the closest genes to the output gene in the gene interaction graph, which is different from the one in D-GEX [10]. Because the latter restricts input and output to the landmark genes and target genes predefined in the literature.

A line of research has been done following Dutil’s work. Bertin et al. [24] find that the effect of incorporating randomly-generated networks to improve the prediction accuracy can be almost as well as that of biological networks. This finding suggests that with respect to the gene expression data, the biological networks may not be a good prior knowledge. However, Crawford et al. [25] further investigate that after removing the low-degree genes, the biological network brings better effects than random graphs. Trebacz et al. [26] uses the ontology embedding of genes to improve Single Gene Inference task accuracy.

To the best of our knowledge, there have not been any studies to incorporate the gene interaction graph to improve prediction accuracy in D-GEX. So in this paper, we will incorporate the gene interaction graph as prior information into the deep learning model for better prediction accuracy.

Method

Data

Two datasets were used named as GEO dataset [10] and GTEx dataset [27]which were also included in the D-GEX paper. After a similar pre-processing protocol with D-GEX, the GEO dataset consists of 111009 gene expression profiles from the Affymetrix microarray platform and The GTEx dataset consists of 2921 gene expression profiles from the Illumina RNA-Seq platform. Both datasets consist of 943 landmark genes as input and 9520 target genes as output.

In the following section, we will conduct experiments on these two datasets and compare the performance of D-GEX, GCN, linear regression and our method. We conduct the task in two settings:

For the microarray platform, we split the GEO dataset into 80% for training, 10% for validation and 10% for testing.
For the RNA-Seq platform, we directly split the GTEx dataset into 80% for training, 10% for validation and 10% for testing.

The first setting in our paper is similar to D-GEX to illustrate that the proposed method generates better results in the microarray platform. However, the second setting is significantly different from D-GEX which directly tests GTEx dataset with the neural network trained with GEO data. It can be seen from the original paper of D-GEX and the results in our paper that the prediction error is much higher if the train set and test set are from different platforms. To further state that our proposed methods perform better results on different platforms and on a different number of data, we directly split GTEx dataset as a train set and a test set.

Besides, we additionally take the gene interaction graph data as prior biological information to construct the model.

In this paper, we use STRING database [16], which is available online at https://string-db.org. The String database saves all publicly available sources of protein-protein interaction graphs. The edges in the STRING database have seven types: three prediction edges based on genomics context information (see below), and one type respectively for co-expression, text-mining, previously curated pathway and protein-complex knowledge (‘databases’). In this paper, we use the co-expression protein interaction graph. With the gene-protein mapping rules, we can further obtain the gene interaction graph as prior information. The original co-expression graph contains 2945888 edges between 18520 gene nodes. We obtain the sub-graph by restricting the 943 landmark genes and the corresponding edges, which is needed in our task.

Neighbour connection neural network

The effectiveness of machine learning methods largely depends on the nice representation of the data. A nice representation of the data should be concise but informative to better serve the machine learning task. Traditionally, representation learning is based on human effort which is called feature engineering [28]. Recently, the deep learning method dominates the traditional machine learning method in many areas. The success of deep learning is mainly due to its ability to automatically learn the representation of the original data. So the key factor in promoting a new successful neural network architecture is to better learn the data representation.

For the machine learning task of the image data, it is very important to learn a good representation of the image. Proposed by Lecun et al. [29], the Convolutional Neural Networks have achieved outstanding performance in machine learning tasks in image data [30].

By capturing the local pattern of the image, CNN learns good representation. To be general, the image data can be viewed as a special graph that consists of grids of nodes and RGB node attributes. So it is desired to generalize the local pattern extraction technique used in CNN to the graph data to gain better data representation.

However, the main difference between image data and graph data is that the pixel has a regular structure. Any pixel is adjacent to the 8 pixels surrounding it, so it is possible to use weight sharing for every local connection feature map. On the other hand, the graph does not possess this property because the degree between nodes is usually different. So we proposed a new method named Neighbour Connection Neural Network(NCNN), which intuitively captures the local pattern by only connecting the neighbour of a node.

We denote the graph as , where is the set of d nodes and is the edge set. A is a d × d adjacency matrix, with its entries A_ij(binary or weighted) denoting the strength of connectivity between node i and node j. In this paper, we consider a gene interaction graph where nodes represent genes and edges represent the biological association between genes. The attribute on each node is a uni-variate gene expression level, so we assume that the attribute on a node is one-dimensional, and denote the node attributes on the graph as X = (X₁, X₂, …, X_d)^T.

In this paper, the original gene interaction graph we use has 943 nodes and 26545 edges with weighted attributes. In this case, d = 943, , and A_ij denotes the biological association between node i and node j. The edge number is so large that connecting the neighbour will not result in local pattern extraction. So in order to capture the local pattern of the graph, we can firstly drop the weak signal edge to control the sparsity of the network.

A natural way to sparsify the gene network is by setting a threshold to filter the weak connection between nodes. Formally, we set: (1)

The threshold is selected through running experiments by picking the threshold performing best in the prediction accuracy. With cross-validation from candidate set {120, 121, …, 150}, we pick 137 as the threshold δ. We use to denote the neighbours of node i.

Given a gene expression X = (X₁, …, X_d)^T as input, we embed it in a sparse graph . Just like the multi-layer feed-forward neural network, the neighbour connection neural network has one input layer, many hidden layers and one output layer. However, the forward propagation is done on the graph . Starting from the input layer, we denote the node attribute as layer 0 node representation H⁽⁰⁾ = X ∈ R^d×1. Assuming there are L hidden layers, the node representation of layer l is denoted as which is different from the D-GEX. In D-GEX, the hidden representation is a vector. But in our method, the hidden feature is learned on the node of the graph and in the form of the matrix. ( means the hidden feature m on the node i). We denote the hidden feature m of all the nodes on the graph as .

In CNN, the feature is learned by locally connecting the adjacent pixel and the pixel itself to the next layer. So in our method, the representation on a node is propagated with the node neighbours.

The Neighbour Connection forward propagation goes as: (2) where W^{m, k, l} ∈ R^d×d is the sparse weight matrix connecting neighbours of a node along with the node itself in the hidden layer l to the node in the hidden layer l + 1. m stands for the mth dimension of the hidden feature vector in the hidden layer l + 1 and k stands for the kth dimension of the hidden feature vector in the hidden layer l. For every pair of the one dimension of hidden feature in hidden layer l and the one dimension of hidden feature in hidden layer l + 1, there is a weight matrix W^{m, k, l} connecting them. The entries in W^{m, k, l} are set to 0 with respect to the sparse adjacency matrix A.

If A_{i, j} = 0, .
If A_{i, j} > 0, is a trainable parameter.

If A_{i, j} = 0, it means that node i is not a neighbour of node j, so there is no connection between them, then . If A_{i, j} > 0, it means that node i is the neighbour of node j and the representation on the node j in the layer l + 1 will incorporate this node information, thus is a trainable parameter. The hidden feature in layer l + 1 is then learned by first linearly suming the hidden feature in layer l scaled by different weight matrix and a bias term b^{m, k, l}, then transformed the linear sum with a non-linear activation function σ.

This formula can also be written in the node-wise form: (3)

It can be seen more clearly in the node-wise form and the illustration figure that by using a sparse linear transformation matrix W^{j, k, l}, the representation of node i in layer l + 1 is learned by locally connecting the representation of its neighbour along with itself in the layer l. Moreover, for each connection from node j feature k in layer l to the node i in the layer l + 1 feature j, there is a unique trainable parameter to scale the corresponding neighbour information. To better present our method, we illustrate the Neighbour Connection of the node i in Fig 1.

Download:

Fig 1. Detailed illustration of neighbour connection.

This figure illustrates the Neighbour Connection of node i in a more detailed way. The node i has only one neighbour j. And the figure draws the linear scaling factor from the hidden feature 1 in layer l to the hidden feature 1 in layer l + 1 with respect to the node i and its neighbouring node j.

https://doi.org/10.1371/journal.pone.0281286.g001

To fully make use of the representation learned through neighbour connection layer, we then flatten all the attributes and connect them to a full connection neural network model.

In this paper, we just consider using one layer of neighbour connection because:

As suggested before, the gene interaction signal is weaker than the expression level. So performing the neighbour connection too many layers will deteriorate the gene expression information. A similar problem also exists in GCN, which is called ‘over-smoothing’.
Practically, one layer of neighbour connection is enough to capture the local pattern and shows its significant improvement over D-GEX, so we perform one neighbour connection layer for simplicity.
In order to compare with full connection neural network, we set our model to have the same layer including the neighbour connection although with much fewer parameters. The D-GEX has three layers in total, and if we use more than one neighbour connection layer there will be only one layer for full connection. Although the local feature is extracted, it is hard to use only one hidden layer to learn the global feature and get a good prediction.

Because we only use one neighbour connection layer and the input representation on each node is a one-dimensional gene expression level. We rewrite the formula of NCNN when l = 0 and h₀ = 1.

In this case, h₀ in Eq (3) is 1, the node-wise Neighbour Connection is: (4)

Then we flatten matrix into a vector H⁽¹⁾: (5)

The hidden feature in layer 1 is H⁽¹⁾ and will be further connected to an MLP model.

The structure of our NCNN model is illustrated in Fig 2.

Download:

Fig 2. Structure illustration.

The structure of the NCNN model is composed of the neighbour connection layer with a flattening operation and the full connection layer.

https://doi.org/10.1371/journal.pone.0281286.g002

In order to train our model better, we adopt several training techniques including:

Instead of using mean squared loss to train the model, L₁ loss is adopted as the loss function. This is because L₂ loss is vulnerable to outliers and will yield a smoothing effect [31], yet the L₁ loss is roboust to outliers and can lead to accurate prediction.
We change the activation function in the MLP part of our model to log sigmoid function and find that it can better train the NCNN.

Comparison of GCN

In the line of research on Single Gene Inference task [23–26], the authors use Graph Convolutional Networks(GCN) to utilize the graph information. Our method is somewhat motivated by the GCN but improves its drawbacks, so we will briefly introduce this method and illustrate the difference between the GCN and our proposed method.

There are a lot of Graph Convolutional Networks proposed [19, 32–34]. We just explain the Graph Convolutional Network proposed by Moris et al. [34]. It is one of the most widely used graph neural network architectures because of its simplicity and effectiveness.

The node representation in the layer l + 1 in the graph convolutional network propagates in the following way: (6)

Here is the adjacency matrix of the graph with self-connection, which can incorporate the node representation itself in the propagation. is a diagonal matrix with . σ is the activation function like Hyperbolic tangent function, Relu function or Sigmoid function. (h_l and h_{l+ 1} are respectively the dimension of node representation in the layer l and layer l + 1.) are two linear transformation matrix whose entries are all trainable parameters.

The propagation can be understood more clearly in the node-wise form: (7)

From this formula, we know that the graph convolutional neural network propagates the representation on a node by combining the neighbours’ representation scaled with the edge attribute A_{i, j}, then the weight matrix W^l+1 transforms the linear-combined d_l dimensional feature into the d_l+1 dimensional feature.

Just like the Eq (4), we consider the case when l = 0 and h₀ = 1, the GCN formulation is: (8)

We can see from Eqs (4) and (8) that the NCNN uses more parameters than the GCN: every node in NCNN has a different linear transformation matrix . Yet in GCN, all the nodes share the same matrix W⁰.

However, the graph convolutional neural network is designed to learn the representation of a graph. In the setting where GCN has nice performance like social network [19], recommendation system [35] and quantum chemistry [36], the edge is a key component of the graph. In other words, the edge is a strong signal in these settings. So by simply scaling node attributes with edge attributes and combining them, the representation is learned well and efficiently.

In our task, gene expression data plays a much more important role while the edge is only side information. It is not a strong connection relationship signal [16]. Linearly combining adjacent representation scaled by edge attribute will deteriorate the important signal of the original node input, because the node feature information is deteriorated by scaling with the less important edge feature.

Results and discussion

GEO dataset

Firstly, we compare the results of different models on setting 1 including D-GEX, NCNN, GCN and LR(Linear Regression). To show the supremacy of our proposed NCNN model over others, we conduct the experiments under approximately the same hidden nodes and hidden layers with different sets of node numbers and layer numbers.

Precisely, as mentioned above, we only use one layer of neighbour connection in order to compare with the D-GEX. After one layer of neighbour connection, the representation is fed into an MLP model just the same as D-GEX.

To have approximately the same hidden nodes and hidden layers, say H hidden nodes and L hidden layers, we perform one layer of neighbour connection, whose h₁ is set as . So that after neighbour connection and flattening, the length of H⁽¹⁾ is , approximately equaling the hidden nodes H in the D-GEX model. Then H⁽¹⁾ is fed into an MLP model with H hidden nodes and L − 1 hidden layers. The GCN model setting is similar to the NCNN.

It is very clear in Table 1 that our proposed model has better prediction performance than the other models. To begin with, the linear regression model has the worst performance because the model only captures the linear relationship between the input and output variables. While the deep learning models including the D-GEX, NCNN and GCN model the non-linearity between the variables and can better make predictions.

Download:

Table 1. MAE comparison between D-GEX and NCNN model in the prediction of GEO data when varying the number of hidden layers and hidden size.

https://doi.org/10.1371/journal.pone.0281286.t001

Among the three different deep learning methods, the best result of D-GEX is achieved with 3 hidden layers and 9000 nodes, and the corresponding mean average error is 0.3204. GCN achieves its best results with 2 hidden layers and 9000 hidden nodes, and the corresponding mean average error is 0.3473, which is worse than the original D-GEX. Our proposed model NCNN performs best with 2 hidden layers and 9000 nodes, with the mean average error 0.3023. NCNN has shown its supremacy over other methods not only in the prediction accuracy, but in the number of parameters as well. Just as we have analyzed, although the GCN model utilizes the gene interaction information, the way it incorporates this side information will deteriorate the original gene expression signal. At the same time, NCNN uses a distinct parameter to scale the neighbour information so that it better uses the gene interaction graph. It only uses 2 hidden layers including one neighbour connection layer and one full connection layer to achieve the best results.

Fig 3 shows the density plot of the mean average error per gene of four models. This figure shows that the peak of the density function of NCNN has smaller MAE than the comparing methods. And the density function of NCNN is bigger than the comparing methods when MAE is small and smaller than the comparing methods when MAE is large. Showing that it has precise prediction on more genes and vague prediction on less genes.

Download:

Fig 3. Density plot of different models with the best structure.

https://doi.org/10.1371/journal.pone.0281286.g003

Figs 4 and 5 show the comparative error figure of the two models on each gene. It is clear that the NCNN outperforms other models significantly at the gene-wise level. It makes better predictions on 99.66% genes than MLP, and on 99.57% genes than GCN.

Download:

Fig 4. The comparison of mean average error of predictions on each gene in GEO dataset.

The y value is the mean average error from NCNN model on each target gene; the x value is the mean average error from MLP model on each target gene. Each dot below the diagonal means that NCNN has lower prediction error on this gene than the comparing method.

https://doi.org/10.1371/journal.pone.0281286.g004

Download:

Fig 5. The comparison of mean average error of predictions on each gene in GEO dataset.

The y value is the mean average error from NCNN model on each target gene; the x value is the mean average error from GCN model on each target gene. Each dot below the diagonal means that NCNN has lower prediction error on this gene than the comparing method.

https://doi.org/10.1371/journal.pone.0281286.g005

We analyze the complexity of linear transformation parameters in three models as shown in Table 2. In the full connection layer used in D-GEX, if the input length is d and the output length is n, then the number of parameters is dn. In the NCNN, we embed the d dimensional input into a graph with d nodes and perform neighbour connection. The parameters we use are , normally we use sparse connection graph, so is in the complexity of O(d), then the overall parameters we use has the complexity of O(n), this is less than the full connection parameters dn. The GCN also embeds the d dimensional input into a graph with d nodes and performs convolution. It only uses 2[n/d] = O(n/d) parameters, which is the least among the three methods.

Download:

Table 2. The parameter complexity of different models and the number of parameters they actually use.

https://doi.org/10.1371/journal.pone.0281286.t002

GTEx dataset

We then compare the results of different models on setting 2 including D-GEX, NCNN, GCN and LR(Linear Regression) in Table 3. Just like setting 1, we conduct the experiments under approximately the same hidden nodes and hidden layers with different sets of node numbers and layer numbers.

Download:

Table 3. MAE comparison between D-GEX and NCNN model in the prediction of GTEx data when varying the number of hidden layers and hidden size.

https://doi.org/10.1371/journal.pone.0281286.t003

Just like the case in setting 1, the deep learning model shows its supremacy over the traditional linear regression model.

Among the three different deep learning methods, the best result of D-GEX is achieved with 2 hidden layers and 6000 nodes, with the corresponding mean average error 0.2084. GCN achieves its best results with 1 hidden layer and 6000 hidden nodes, with the corresponding mean average error 0.2166, which is still worse than the original D-GEX. Our proposed model NCNN performs best with 3 hidden layers and 9000 nodes, with a mean average error 0.2043.

The best performance structure of the three deep learning models in setting 2 differs from that in setting 1. The D-GEX and GCN models achieve the best performance under fewer nodes and layers. This is maybe due to the low sample amount of the dataset, which leads to over-fitting. As we have mentioned, the MLP model neglects the order of the input variables and is easy to over-fit. The GCN model takes into account its order, though it uses few parameters and deteriorates the neighbour information by scaling with edge attributes. Our proposed NCNN not only takes the order of variables into account, but has distinct parameters for the neighbour of each node on the graph. So it is not so easy to get overfitted. In setting 2 where the sample is few, it achieves the best results using more layers, indicating that it learns a better representation that can be used to make better predictions.

The best structure result indicates that the GEO dataset and GTEx dataset are two distinct datasets. However, our proposed NCNN model still outperforms other models, showing its supremacy on different platforms.

We also analyze the mean average error on each gene in GTEx dataset. Fig 6 shows the density plot of the mean average error per gene of four models. This figure shows that the peak of the density function of NCNN still has smaller MAE than the comparing methods. However, the difference between the MLP and NCNN is not as big as that in the GEO datasets. This is mainly because the GTEx data is generated from a totally different gene platform from the GEO data. The density function of NCNN is bigger than the comparing methods when MAE is small and smaller than the comparing methods when MAE is large. Showing that it has precise prediction on more genes and vague prediction on less genes.

Download:

Fig 6. Density plot of different models with the best structure.

https://doi.org/10.1371/journal.pone.0281286.g006

Figs 7 and 8 show the comparative error figure of the two models on each gene. The NCNN still outperforms other models at the gene-wise level. It makes better predictions on 59.64% genes than MLP, and on 81.51% genes than GCN. The superiority is not as big as that in the GEO data as well. And we attribute this to the distinction between the two datasets collection platform.

Download:

Fig 7. The comparison of mean average error of predictions on each gene in GTEx dataset.

The y value is the mean average error from NCNN model on each target gene; the x value is the mean average error from MLP model on each target gene. Each dot below the diagonal means that NCNN has lower prediction error on this gene than the comparing method.

https://doi.org/10.1371/journal.pone.0281286.g007

Download:

Fig 8. The comparison of mean average error of predictions on each gene in GTEx dataset.

The y value is the mean average error from NCNN model on each target gene; the x value is the mean average error from GCN model on each target gene. Each dot below the diagonal means that NCNN has lower prediction error on this gene than the comparing method.

https://doi.org/10.1371/journal.pone.0281286.g008

Conclusion and future work

According to the special structure of genes, this paper proposes a novel neural network called the Neighbour Connection Neural Network utilizing gene graph interaction information. Our method not only theoretically learns the good representation of the gene expression value by having a reasonable inductive bias on the gene data, but outperforms the comparing methods which do not consider the structure of genes like MLP or which incorporate the structure information in a bad manner like GCN.

We validate the performance of our model on two datasets, the GEO dataset and GTEx dataset. To achieve the best results on the GEO dataset, the proposed NCNN has much fewer parameter than the D-GEX. The NCNN only uses 1.18% parameters of that in D-GEX in the first weight matrix and one less hidden layer, yet improves the mean average error by 5.6%. On the gene-wise level, our model has lower mean average error on 99.66% of the total genes than D-GEX. Although our model NCNN has more parameters than the GCN, the mean average error of NCNN is 12.8% lower than the GCN. On the gene-wise level, our model has lower mean average error on 99.57% of the total genes than GCN.

To achieve the best results on the GTEx dataset, the proposed NCNN has more parameters than the D-GEX. The NCNN only uses 1.18% parameters of that in D-GEX in the first weight matrix, but uses one more hidden layer and more hidden nodes. NCNN improves the mean average error of D-GEX by 5.6%. On the gene-wise level, NCNN has lower mean average error on 60.23% of the total genes than D-GEX. Comparing to the GCN model, the mean average error of NCNN is 4.33% lower. On the gene-wise level, NCNN has lower mean average error on 99.78% of the total genes than GCN.

This indicates that gene interaction values may have an inner structure and this information can be exploited to help the deep learning model learn better representation.

The proposed model NCNN has the advantage of better representation learning and fewer paramters than the full connection, but it also has the drawback that in practical implementation the running time is rather long due to the lack of an efficient algorithm. Besides, we only do preliminary research on how to utilize the gene graph information with our model. There are a number of future works to be done to improve our model:

There are many types of gene networks. In this paper, we only use the String network. Thus, integrating different types of graphs is a further direction.
We only perform one neighbour connection layer for simplicity, although our method has fewer parameters than the full connection layer, our implementation with the Pytorch library is not efficient in the computing sense. Further research can be done to accelerate the training process.
This paper only considers the graph structure in the input layer, yet it can be further researched whether there is a way to incorporate the topology of the output variables into the neural network for further improvement.
Our method generalizes the local pattern capture of CNN, while there are other components of CNN like pooling or sub-sampling not generalized which can be further researched.

References

1. Pirim H, Ekşioğlu B, Perkins AD,Yüceer Ç. Clustering of high throughput gene expression data. Computers & operations research. 2012;39(12):3046–3061. pmid:23144527
- View Article
- PubMed/NCBI
- Google Scholar
2. Lee WC, Diao L, Wang J, Zhang J, Roarty EB, Varghese S, et al. Multiregion gene expression profiling reveals heterogeneity in molecular subtypes and immunotherapy response signatures in lung cancer. Modern Pathology. 2018;31(6):947–955. pmid:29410488
- View Article
- PubMed/NCBI
- Google Scholar
3. Eaves IA, Wicker LS, Ghandour G, Lyons PA, Peterson LB, Todd JA, et al. Combining mouse congenic strains and microarray gene expression analyses to study a complex trait: the NOD model of type 1 diabetes. Genome research. 2002;12(2):232–243. pmid:11827943
- View Article
- PubMed/NCBI
- Google Scholar
4. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270(5235):467–470. pmid:7569999
- View Article
- PubMed/NCBI
- Google Scholar
5. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320(5881):1344–1349. pmid:18451266
- View Article
- PubMed/NCBI
- Google Scholar
6. Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008;453(7199):1239–1243. pmid:18488015
- View Article
- PubMed/NCBI
- Google Scholar
7. Shin G, Kang TW, Yang S, Baek SJ, Jeong YS, Kim SY. GENT: gene expression database of normal and tumor tissues. Cancer informatics. 2011;10:CIN–S7226. pmid:21695066
- View Article
- PubMed/NCBI
- Google Scholar
8. Zahn JM, Poosala S, Owen AB, Ingram DK, Lustig A, Carter A, et al. AGEMAP: a gene expression database for aging in mice. PLoS genetics. 2007;3(11):e201. pmid:18081424
- View Article
- PubMed/NCBI
- Google Scholar
9. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. science. 2006;313(5795):1929–1935. pmid:17008526
- View Article
- PubMed/NCBI
- Google Scholar
10. Chen Y, Li Y, Narayan R, Subramanian A, Xie X. Gene expression inference with deep learning. Bioinformatics. 2016;32(12):1832–1839. pmid:26873929
- View Article
- PubMed/NCBI
- Google Scholar
11. Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017;171(6):1437–1452. pmid:29195078
- View Article
- PubMed/NCBI
- Google Scholar
12. Choobdar S, Ahsen ME, Crawford J, Tomasoni M, Fang T, Lamparter D, et al. Assessment of network module identification across complex diseases. Nature methods. 2019;16(9):843–852. pmid:31471613
- View Article
- PubMed/NCBI
- Google Scholar
13. Niepel M, Hafner M, Duan Q, Wang Z, Paull EO, Chung M, et al. Common and cell-type specific responses to anti-cancer drugs revealed by high throughput transcript profiling. Nature communications. 2017;8(1):1–11. pmid:29084964
- View Article
- PubMed/NCBI
- Google Scholar
14. Yip AM, Horvath S. Gene network interconnectedness and the generalized topological overlap measure. BMC bioinformatics. 2007;8(1):1–14. pmid:17250769
- View Article
- PubMed/NCBI
- Google Scholar
15. Dong J, Horvath S. Understanding network concepts in modules. BMC systems biology. 2007;1(1):1–20. pmid:17547772
- View Article
- PubMed/NCBI
- Google Scholar
16. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic acids research. 2019;47(D1):D607–D613. pmid:30476243
- View Article
- PubMed/NCBI
- Google Scholar
17. Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic acids research. 2010;38(suppl_2):W214–W220. pmid:20576703
- View Article
- PubMed/NCBI
- Google Scholar
18. Wu L, Cui P, Pei J, Zhao L, Guo X. Graph Neural Networks: Foundation, Frontiers and Applications. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; 2022. p. 4840–4841.
19. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:160902907. 2016;.
20. Wang X, Ghasedi Dizaji K, Huang H. Conditional generative adversarial network for gene expression inference. Bioinformatics. 2018;34(17):i603–i611. pmid:30423066
- View Article
- PubMed/NCBI
- Google Scholar
21. Kunc V, Kléma J. On transformative adaptive activation functions in neural networks for gene expression inference. Plos one. 2021;16(1):e0243915. pmid:33444316
- View Article
- PubMed/NCBI
- Google Scholar
22. Wang H, Li C, Zhang J, Wang J, Ma Y, Lian Y. A new LSTM-based gene expression prediction model: L-GEPM. Journal of Bioinformatics and Computational Biology. 2019;17(04):1950022. pmid:31617459
- View Article
- PubMed/NCBI
- Google Scholar
23. Dutil F, Cohen JP, Weiss M, Derevyanko G, Bengio Y. Towards gene expression convolutions using gene interaction graphs. arXiv preprint arXiv:180606975. 2018;.
24. Bertin P, Hashir M, Weiss M, Frappier V, Perkins TJ, Boucher G, et al. Analysis of Gene Interaction Graphs as Prior Knowledge for Machine Learning Models. arXiv preprint arXiv:190502295. 2019;.
25. Crawford J, Greene CS. Graph biased feature selection of genes is better than random for many genes. Biorxiv. 2020;.
- View Article
- Google Scholar
26. Trebacz M, Shams Z, Jamnik M, Scherer P, Simidjievski N, Terre HA, et al. Using ontology embeddings for structural inductive bias in gene expression data analysis. arXiv preprint arXiv:201110998. 2020;.
27. Consortium G, Ardlie KG, Deluca DS, Segrè AV, Sullivan TJ, Young TR, et al. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348(6235):648–660.
- View Article
- Google Scholar
28. Scott S, Matwin S. Feature engineering for text classification. In: ICML. vol. 99. Citeseer; 1999. p. 379–388.
29. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998;86(11):2278–2324.
- View Article
- Google Scholar
30. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Communications of the ACM. 2017;60(6):84–90.
- View Article
- Google Scholar
31. Xue Y, Xu T, Zhang H, Long LR, Huang X. SegAN: adversarial network with multi-scale L1 loss for medical image segmentation. Neuroinformatics. 2018;16(3):383–392. pmid:29725916
- View Article
- PubMed/NCBI
- Google Scholar
32. Defferrard M, Bresson X, Vandergheynst P. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems. 2016;29.
- View Article
- Google Scholar
33. Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs. Advances in neural information processing systems. 2017;30.
- View Article
- Google Scholar
34. Morris C, Ritzert M, Fey M, Hamilton WL, Lenssen JE, Rattan G, et al. Weisfeiler and leman go neural: Higher-order graph neural networks. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33; 2019. p. 4602–4609.
35. Ying R, He R, Chen K, Eksombatchai P, Hamilton WL, Leskovec J. Graph convolutional neural networks for web-scale recommender systems. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining; 2018. p. 974–983.
36. Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry. In: International conference on machine learning. PMLR; 2017. p. 1263–1272.

[ref1] 1. Pirim H, Ekşioğlu B, Perkins AD,Yüceer Ç. Clustering of high throughput gene expression data. Computers & operations research. 2012;39(12):3046–3061. pmid:23144527
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Lee WC, Diao L, Wang J, Zhang J, Roarty EB, Varghese S, et al. Multiregion gene expression profiling reveals heterogeneity in molecular subtypes and immunotherapy response signatures in lung cancer. Modern Pathology. 2018;31(6):947–955. pmid:29410488
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Eaves IA, Wicker LS, Ghandour G, Lyons PA, Peterson LB, Todd JA, et al. Combining mouse congenic strains and microarray gene expression analyses to study a complex trait: the NOD model of type 1 diabetes. Genome research. 2002;12(2):232–243. pmid:11827943
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270(5235):467–470. pmid:7569999
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320(5881):1344–1349. pmid:18451266
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008;453(7199):1239–1243. pmid:18488015
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Shin G, Kang TW, Yang S, Baek SJ, Jeong YS, Kim SY. GENT: gene expression database of normal and tumor tissues. Cancer informatics. 2011;10:CIN–S7226. pmid:21695066
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Zahn JM, Poosala S, Owen AB, Ingram DK, Lustig A, Carter A, et al. AGEMAP: a gene expression database for aging in mice. PLoS genetics. 2007;3(11):e201. pmid:18081424
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. science. 2006;313(5795):1929–1935. pmid:17008526
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Chen Y, Li Y, Narayan R, Subramanian A, Xie X. Gene expression inference with deep learning. Bioinformatics. 2016;32(12):1832–1839. pmid:26873929
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017;171(6):1437–1452. pmid:29195078
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Choobdar S, Ahsen ME, Crawford J, Tomasoni M, Fang T, Lamparter D, et al. Assessment of network module identification across complex diseases. Nature methods. 2019;16(9):843–852. pmid:31471613
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Niepel M, Hafner M, Duan Q, Wang Z, Paull EO, Chung M, et al. Common and cell-type specific responses to anti-cancer drugs revealed by high throughput transcript profiling. Nature communications. 2017;8(1):1–11. pmid:29084964
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Yip AM, Horvath S. Gene network interconnectedness and the generalized topological overlap measure. BMC bioinformatics. 2007;8(1):1–14. pmid:17250769
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Dong J, Horvath S. Understanding network concepts in modules. BMC systems biology. 2007;1(1):1–20. pmid:17547772
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic acids research. 2019;47(D1):D607–D613. pmid:30476243
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref17] 17. Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic acids research. 2010;38(suppl_2):W214–W220. pmid:20576703
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref18] 18. Wu L, Cui P, Pei J, Zhao L, Guo X. Graph Neural Networks: Foundation, Frontiers and Applications. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; 2022. p. 4840–4841.

[ref19] 19. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:160902907. 2016;.

[ref20] 20. Wang X, Ghasedi Dizaji K, Huang H. Conditional generative adversarial network for gene expression inference. Bioinformatics. 2018;34(17):i603–i611. pmid:30423066
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref21] 21. Kunc V, Kléma J. On transformative adaptive activation functions in neural networks for gene expression inference. Plos one. 2021;16(1):e0243915. pmid:33444316
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref22] 22. Wang H, Li C, Zhang J, Wang J, Ma Y, Lian Y. A new LSTM-based gene expression prediction model: L-GEPM. Journal of Bioinformatics and Computational Biology. 2019;17(04):1950022. pmid:31617459
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref23] 23. Dutil F, Cohen JP, Weiss M, Derevyanko G, Bengio Y. Towards gene expression convolutions using gene interaction graphs. arXiv preprint arXiv:180606975. 2018;.

[ref24] 24. Bertin P, Hashir M, Weiss M, Frappier V, Perkins TJ, Boucher G, et al. Analysis of Gene Interaction Graphs as Prior Knowledge for Machine Learning Models. arXiv preprint arXiv:190502295. 2019;.

[ref25] 25. Crawford J, Greene CS. Graph biased feature selection of genes is better than random for many genes. Biorxiv. 2020;.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref26] 26. Trebacz M, Shams Z, Jamnik M, Scherer P, Simidjievski N, Terre HA, et al. Using ontology embeddings for structural inductive bias in gene expression data analysis. arXiv preprint arXiv:201110998. 2020;.

[ref27] 27. Consortium G, Ardlie KG, Deluca DS, Segrè AV, Sullivan TJ, Young TR, et al. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348(6235):648–660.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref28] 28. Scott S, Matwin S. Feature engineering for text classification. In: ICML. vol. 99. Citeseer; 1999. p. 379–388.

[ref29] 29. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998;86(11):2278–2324.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref30] 30. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Communications of the ACM. 2017;60(6):84–90.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref31] 31. Xue Y, Xu T, Zhang H, Long LR, Huang X. SegAN: adversarial network with multi-scale L1 loss for medical image segmentation. Neuroinformatics. 2018;16(3):383–392. pmid:29725916
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref32] 32. Defferrard M, Bresson X, Vandergheynst P. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems. 2016;29.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref33] 33. Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs. Advances in neural information processing systems. 2017;30.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref34] 34. Morris C, Ritzert M, Fey M, Hamilton WL, Lenssen JE, Rattan G, et al. Weisfeiler and leman go neural: Higher-order graph neural networks. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33; 2019. p. 4602–4609.

[ref35] 35. Ying R, He R, Chen K, Eksombatchai P, Hamilton WL, Leskovec J. Graph convolutional neural networks for web-scale recommender systems. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining; 2018. p. 974–983.

[ref36] 36. Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry. In: International conference on machine learning. PMLR; 2017. p. 1263–1272.

Figures

Abstract

Introduction

Related works

Deep neural network for gene expression prediction

Graph neural networks using external information

Method

Data

Neighbour connection neural network

Comparison of GCN

Results and discussion

GEO dataset

GTEx dataset

Conclusion and future work

References