GCNCDA: A new method for predicting circRNA-disease associations based on Graph Convolutional Network Algorithm

Numerous evidences indicate that Circular RNAs (circRNAs) are widely involved in the occurrence and development of diseases. Identifying the association between circRNAs and diseases plays a crucial role in exploring the pathogenesis of complex diseases and improving the diagnosis and treatment of diseases. However, due to the complex mechanisms between circRNAs and diseases, it is expensive and time-consuming to discover the new circRNA-disease associations by biological experiment. Therefore, there is increasingly urgent need for utilizing the computational methods to predict novel circRNA-disease associations. In this study, we propose a computational method called GCNCDA based on the deep learning Fast learning with Graph Convolutional Networks (FastGCN) algorithm to predict the potential disease-associated circRNAs. Specifically, the method first forms the unified descriptor by fusing disease semantic similarity information, disease and circRNA Gaussian Interaction Profile (GIP) kernel similarity information based on known circRNA-disease associations. The FastGCN algorithm is then used to objectively extract the high-level features contained in the fusion descriptor. Finally, the new circRNA-disease associations are accurately predicted by the Forest by Penalizing Attributes (Forest PA) classifier. The 5-fold cross-validation experiment of GCNCDA achieved 91.2% accuracy with 92.78% sensitivity at the AUC of 90.90% on circR2Disease benchmark dataset. In comparison with different classifier models, feature extraction models and other state-of-the-art methods, GCNCDA shows strong competitiveness. Furthermore, we conducted case study experiments on diseases including breast cancer, glioma and colorectal cancer. The results showed that 16, 15 and 17 of the top 20 candidate circRNAs with the highest prediction scores were respectively confirmed by relevant literature and databases. These results suggest that GCNCDA can effectively predict potential circRNA-disease associations and provide highly credible candidates for biological experiments.


Introduction
As a new type of endogenous non-coding RNA, circular RNA (circRNA) has a closed-loop structure without a 5'and 3'polyadenylated tails [1][2][3]. As early as 1971, researchers discovered the viroids genome composed of single-stranded closed RNA molecules in potatoes [4]. In 1979, Hsu et al. [5] observed the presence of circRNA in the cytoplasm of eukaryotic cells by electron microscopy. In 1995, the researchers [6] found that the mouse sperm determinant gene Sry has circular transcription during transcription. But these findings did not attract much attention of researchers at the time. Until 2012, Salzman et al. [7] reported about 80 cir-cRNAs for the first time with the help of high-throughput sequencing technology. Since then, a large number of circRNA molecules have been identified.
With the rapid development of bioinformatics and the continuous innovation of highthroughput sequencing technology, a large number of endogenous circRNA have been found in eukaryotic cells. CircRNA has the characteristics of universality, conservativeness, tissuespecificity and stability. Its unique sequence structure makes it have the functions of micro-RNA sponge [8], regulators of RNA binding proteins [9] and transcription of parental genes [10]. In addition, it is involved in the development and progression of diseases such as cancer [11,12], diabetes [13], nervous system diseases [14] and atherosclerosis [15]. For example, Burd et al. [16] found that the expression of cANRIL (circular antisense non-coding RNA in the INK4 locus) is an antisense transcript of INK4/ARF gene, which can inhibit the expression of INK4/ARF through specific multi comb family complex, thereby affecting the risk of atherosclerosis. Du et al. [17] found that circ-Foxo3, a member of the transcription factor foxo3, is highly expressed in myocardial samples from elderly patients and rats. It can prevent and reposition ID-1, E2F1, FAK and H1F1a in the cytoplasm and prevent their anti-aging function. By establishing the HT22 cell model of oxygen-glucose deprivation/reoxygenation (OGD/R), Lin et al. [18] found that the expression of mmu-circRNA-015947 was higher than that of normal cells, indicating that the expression of circRNA was involved in OGD/R-induced neuron injury. Lukiw [19] found that in the hippocampal CA1 region of Alzheimer's disease (AD), there is a dysregulation of the miRNA-circRNA system. When the expression of CDRlas (CiRS-7) decreased or the ability to adsorb microRNA-7 weakened, the expression of miR-7 is increased and directly leads to down-regulation of ubiquitin ligase an expression in the human central nervous system, thereby affecting the normal function of the central nervous system and causing serious damage to brain tissue. Numerous studies have shown that circRNA can be a new clinical diagnostic marker or a potential target for human disease treatment. Therefore, the identification of disease-related circRNA may help to reveal the mechanism of disease occurrence and development, and further promote the understanding of complex human diseases.
As the number of detected circRNAs increases, multiple databases have been created to store information on circRNAs, such as Circ2Traits [20], circBase [21], deepBase [22] and Cir-cNet [23]. Furthermore, researchers have gradually collected circRNA-disease associations supported by experiments and established databases, such as circR2Disease [24], circRNADb [25], circRNADisease [26] and Circ2Disease [27]. The accumulation of these data provides an opportunity for computational methods to predict potential circRNA-disease associations. For example, Xiao et al. [28] proposed an integrated computational framework called MRLDC to identify disease-associated circRNAs based on the hypothesis that circRNAs with similar functions are usually associated with similar diseases, and vice versa. Yan et al. [29] developed the DWNN-RLS method using Regularized Least Squares of Kronecker product kernel to predict circRNA-disease associations. In the experiment, this method achieved AUC of 0.8854, 0.9205 and 0.9701 in 5-fold CV, 10-fold CV and LOOCV, respectively. Fan et al. [30] proposed the KATZHCDA model for predicting circRNA-disease associations based on a heterogeneous network constructed by disease phenotype similarity, circRNA expression profiles and Gaussian interaction profile kernel similarity. As a result, KATZHCDA reached the AUC values of 0.7936 and 0.8469 in 5-fold cross-validation and LOOCV, respectively. Although the above models play important roles in the development of circRNA-disease association prediction computational methods and have achieved fruitful results, they are limited by certain problems: (1) the existing data are derived from incompletely related biological information, which cannot fully describe the complex association between circRNA and disease. (2) The experimentally verified circRNA-disease associations are limited in number and have some noise information, which easily leads to many false negative associations predicted by the model.
The purpose of this study is to propose a new computational model to predict the potential circRNA-disease associations in an attempt to overcome these problems. The proposed model GCNCDA has the following advantages: (1) Comprehensive use of disease semantic similarity information, disease GIP kernel similarity information, circRNA GIP kernel similarity information and known circRNA-disease association information to accurately predict potential circRNA-disease associations. (2) The advanced features of circRNA-disease associations are extracted by the deep learning FastGCN algorithm to reduce false negative associations and improve model performance. In the 5-fold cross-validation experiment on the benchmark dataset, GCNCDA achieved an AUC value of 90.90%. The results of comparative experiments show that GCNCDA is superior to other competing models and can effectively predict potential circRNA-disease associations. Furthermore, case studies show that GCNCDA can identify new circRNA-disease associations, which are validated by the latest literature and databases. It is worth noting that the performance of GCNCDA is underestimated due to experimentally verified limitations on the number of circRNA-disease associations.

Evaluation criteria
In this study, we used the 5-fold cross-validation (5-fold CV) method to evaluate the performance of the model. This method can not only reduce over-fitting to a certain extent but also obtain as much effective information as possible from limited data [31]. More concretely, we first randomly divide the initial dataset into five sub-data sets. When the method is executed, a separate sub-data set is reserved for validating the model and the other four sub-data sets are used to train the model. This process is repeated 5 times until each sub-data set is verified once and only verified once. Finally, the average results of these 5 times are used as the performance indicators of the model. General evaluation criteria are used in this study to evaluate the performance of GCNCDA, including accuracy (Accu.), Sensitivity (Sen.), precision (Prec.), F1-Score (F1) and Matthews Correlation Coefficient (MCC). They are defined as: Sen: Prec: MCC ¼ TP � TN À FP � FN ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi Here, TP means true positive, TN means true negative, FP means false positive, and FN means false negative. Furthermore, we also plot the Receiver Operating Characteristic (ROC) [32,33] curves of the 5-fold CV generated by GCNCDA and calculate their average area under the ROC curve (AUC) [34].

Model performance evaluation
In the experiment, GCNCDA is implemented on the benchmark dataset circR2Disease to evaluate its ability to predict potential circRNA-disease associations. The detailed results of 5-fold CV are summarized in Table 1. As can be seen from the table, GCNCDA achieved an average accuracy of 91.20% and a standard deviation of 0.74%, of which the accuracy of 5-fold experiments was 91.86%, 91.19%, 90.85%, 90.17% and 91.95%, respectively. In terms of accuracy, sensitivity, precision, F1-Score, Matthews correlation coefficient and area under ROC curve, GCNCDA obtained 92.78%, 90.03%, 91.33%, 82.55% and 90.90%, with standard deviations of 3.03%, 2.37%, 0.78%, 1.60% and 0.81%, respectively. Fig 1 plots the ROC curve generated by GCNCDA using 5-fold CV on the circR2Disease dataset. From the experimental results, we can observe that GCNCDA performs well and can effectively predict the potential diseaserelated circRNAs.

Comparison of different classifier models
To evaluate the impact of the Forest PA classifier on the overall performance of GCNCDA, we compared different classifier models in this experiment. Specifically, when constructing different classifier models, we keep the other parts of the model unchanged, including the composition of descriptors and feature extraction, and only replace the Forest PA classifier with stateof-the-art Support Vector Machine (SVM) and Random Forest (RF) classifiers, respectively. The SVM model and the RF model are thus constructed and implemented on the circR2Disease dataset using 5-fold CV. Table 2 lists the results of the 5-fold CV experiments performed by these two models.

Comparison of different feature extraction algorithms
In order to evaluate the effect of the FastGCN feature extraction algorithm on the overall performance of GCNCDA, we compared different feature extraction algorithm models in this experiment. Similar to the experiment with different classifiers, when we construct different feature extraction algorithm models, the other parts of the model are unchanged, including the composition of the descriptors and classifier. Only the Auto Covariance (AC) [35] and fast Fourier transform (FFT) [36] extraction algorithms are used instead of the FastGCN algorithm. The AC model and the FFT model are thus constructed and implemented on the circR2Disease dataset using 5-fold CV. Table 3 summarizes the results of the 5-fold CV obtained by the two models.   extract the advanced features of the fusion descriptor, thus helping to improve the performance of the model. In addition, from the comparison experiments of different classifiers and extraction algorithms, we can also see that the FastGCN algorithm is more helpful to the performance improvement of the model than the Forest PA classifier. This suggests that the FastGCN algorithm is the key to the GCNCDA model and plays an important role in predicting potential disease-associated circRNAs.

Comparison with other existing methods
At present, some researchers have established models for predicting circRNA-disease associations based on the benchmark dataset circR2Disease, including DWNN-RLS [29], KATZHCDA [30], PWCDA [37], GHICD [37] and RWRHCD [37]. To evaluate the performance of GCNCDA, we compared it to the 5-fold CV AUC results of these models. Table 4 summarizes the 5-fold CV AUC scores generated by the various models on the same benchmark dataset circR2Disease. From the table we can see that GCNCDA is outperforms other existing methods. This indicates that the GCNCDA model, which uses the FastGCN algorithm to extract circRNA and disease fusion information features and combines the Forest PA classifier, can effectively improve the predictive performance of circRNA-disease associations.

Case studies
To demonstrate the capability of GCNCDA to predict new disease-associated circRNAs based on known circRNA-disease associations, the performance of GCNCDA was further evaluated. Specifically, all known circRNA-disease associations in benchmark dataset R þ were used to train GCNCDA, and the remaining unknown circRNA-disease associations were considered candidates for testing. All candidates were then ranked based on GCNCDA predictive scores in diseases including Breast Cancer, Glioma, and Colorectal Cancer. Finally, the predicted disease-circRNA associations were confirmed by searching the latest published literature and cir-cRNA-disease databases.
Breast cancer is one of the most common malignant tumors in the world, and its incidence has been increasing since the late 1970s. There is increasing evidence that circRNAs can be used as effective biomarkers for the diagnosis of breast cancer. Therefore, we chose breast cancer for testing to verify the predictive ability of GCNCDA. The prediction results are shown in Table 5, from which we can see that 16 of the top 20 candidates with the highest prediction scores were confirmed by relevant literature and datasets. For example, the hsa_circ_0007534 with the highest prediction score was confirmed by Zhou et al. [38], which can suppresses the migration and invasion of breast cancer cells line MCF-7 by down-regulating targeting RFC3. Glioma is one of the most common primary intracranial tumors, accounting for approximately 30% of all brain tumors and central nervous system tumors, and 80% of all malignant brain tumors. Table 6 lists the top 20 glioma related candidate circRNAs predicted by GCNCDA with the highest scores, 15 of which were confirmed by relevant literature and datasets. For example, Barbagallo et al. [39] identified CDR1-AS as the downstream target of miR-671-5p in human glioblastoma multiforme (GBM) by combining in silico and in vitro approach, which participated in the biopathological changes of GBM cells. This result is consistent with our prediction of the candidate with the second highest score.
Colorectal cancer is one of the common types of cancer in women, and its morbidity and mortality are among the highest in the world. According to statistics, colorectal cancer patients are widely distributed, especially in economically developed regions. We summarize in Table 7 the top 20 circRNAs predicted by the GCNCDA with the highest scores related to colorectal cancer, of which 17 were confirmed by relevant literature and datasets. For example, circ-KLDHC10 with the highest predicted score was confirmed by Yan et al. [40], and its expression level in cancer serum was significantly higher than that in the normal control group, which indicates that circ-KLDHC10 is enriched and stable in exosomes and can be a promising biomarker for cancer diagnosis.

Method overview
In this study, we propose a computational method called GCNCDA to predict potential cir-cRNA-disease associations. The execution process of GCNCDA is divided into the following steps, and its framework is shown in Fig 6. Specifically, we first construct the disease semantic similarity matrix and disease Gaussian interaction profile (GIP) similarity matrix according to disease semantic similarity network and circRNA-disease adjacency matrix. Then, according to circRNA similarity network and circRNA-disease adjacency matrix, construct the circRNA GIP similarity matrix. Next, the disease similarity matrix and circRNA similarity matrix are fused by the fusion strategy to get a unified numerical descriptor. In the fourth step, we use the FastGCN algorithm of deep learning to effectively extract the high-level features of the fusion data and generate the most expressive descriptor. Finally, we feed the extracted high-level features into Forest PA classifier to accurately predict the potential association between circRNAs

Benchmark dataset
In this study, we used the recently established experimentally verified circRNA-disease association dataset circR2Disease [24] as the benchmark dataset to evaluate the performance of various models. CircR2Disease is a dedicated database and comprehensive platform that collects disease-related circRNAs from experimental support. The database currently hosts 739 entries from published literature, including 661 circRNAs, and 100 diseases. The benchmark dataset can be expressed as: where [ denotes the union symbol in set theory, R þ represents the positive dataset, which contains 739 circRNA-disease associations with experimentally verified, R À represents the negative dataset, which contains 739 circRNA-disease associations without experimentally verified. The circR2Disease dataset can be available on the website http://bioinfo.snnu.edu.cn/ CircR2Disease/. In the circR2Disease dataset, there were a total of 661 × 100 − 739 = 65361 circRNA-disease associations without experimental verified. If they are all treated as negative samples, they will form an unbalanced dataset. In order to avoid bias in the prediction results caused by unbalanced data, we solve this problem by reducing the number of negative samples by the downsampling method. Specifically, we select 739 negative samples from all negative samples using random sampling without replacement, and then combine the positive samples to form a distributed equilibrium dataset. In theory, there may be unconfirmed circRNA-disease associations in these 65361 negative samples. But in the 739 negative samples we selected, this probability is much less than 739 � (661 × 100 − 739) � 1.13%. Thus, we constructed the dataset containing 1478 samples in this way, in which the number of positive samples is the same as that of negative samples. Known circRNA-disease associations and their names obtatined from circR2Disease database can be seen in Supplementary S1-S3 Tables. The source code and data of GCNCDA model have been uploaded to https://github.com/look0012/GCNCDA/ for researchers to download and use.
Based on the circR2Disease dataset, we constructed 661 × 100 dimensional adjacency matrix AM, where 661 represents the number of circRNAs, and 100 represents the number of diseases. When circRNA c(i) is associated with disease d(i), element AM(i, j) of matrix AM is assigned a value of 1. Otherwise, it is assigned a value of 0.

Construction of CircRNA similarity model
In this study, we used the Gaussian interaction profile (GIP) kernel similarity to construct the similarity model of circRNA. Based on the hypothesis that circRNAs with similar function are often associated with similar diseases, and vice versa, we established the GIP kernel similarity model of circRNA according to the known circRNA-disease association network. Specifically, we define the binary vector V(c(i)) to represent the interaction profiles of circRNA c(i). The dimension of the vector V(c(i)) is 100, which corresponds to 100 diseases in adjacent matrix AM. When circRNA c(i) is associated with one of 100 diseases, the corresponding bit in vector V(c(i)) is set to 1. Otherwise, it is set to 0. That is to say, the interaction profiles binary vector V (c(i)) is the row vector of the row corresponding to circRNA c(i) in the adjacency matrix AM. Thus, we can get the circRNA GIP kernel similarity GC(c(i), c(j)) of circRNA c(i) and circRNA c(j): where θ c is the width parameter, which can be calculated using the normalized original parameters of the following formula: where n is the column number of adjacent matrix AM.

Construction of disease similarity model
The disease similarity model consists of two parts: the disease GIP kernel similarity and the disease semantic similarity. For the disease GIP kernel similarity, our construction method is similar to the GIP kernel similarity of circRNA. More concretely, we define a binary vector V (d(i)) to represent the interaction profiles of disease d(i) according to the adjacent matrix AM provided by circR2Disease dataset. The dimension of the vector V(d(i)) is 661, which corresponds to 661 circRNAs in adjacent matrix AM. When disease d(i) is associated with one of 661 circRNAs, the corresponding bit in vector V(d(i)) is set to 1. Otherwise, it is set to 0. That is to say, the interaction profiles binary vector V(d(i)) is the column vector of the column corresponding to disease d(i) in the adjacency matrix AM. Through the above definition, we can calculate the disease GIP kernel similarity GD(d(i), d(j)) of disease d(i) and disease d(j): GDðdðiÞ; dðjÞÞ ¼ expðÀ y d kVðdðiÞÞ À VðdðjÞÞk 2 Þ ð9Þ where θ d is the width parameter and m is the row number of adjacent matrix AM.
For disease semantic similarity, we construct it through the MeSH database [41][42][43] from the National Library of Medicine (NLM). It can be downloaded at https://www.nlm.nih.gov/. The MeSH database gives a rigorous disease classification system that uses a Directed Acyclic Graph (DAG) to reflect relationships between different diseases. The MeSH dataset can be seen in Supplementary S4 Table. In DAG, a node represents disease, and an edge represents the relationship between diseases. Given a disease d whose structure can be expressed as DAG d = (d, N d , E d ), where N d represents the set of diseases associated with d including disease d itself, and E d represents the relationship between these diseases. For a disease s within DAG d , its contribution value D d (s) can be calculated by the following formula: where μ indicates the semantic contribution factor between disease s and its child disease s 0 . According to the previous study by Wang et al. [44], we set the semantic contribution factor μ to the optimal value of 0.5. Thus, by accumulating the contribution values of all children with disease d, we can get their semantic values DV(d): In general, the more nodes that are shared between DAGs of different diseases, the more similar they are. Based on this assumption, we construct the first disease semantic similarity model SV 1  In disease semantic similarity model SV 1 , we mainly consider the hierarchical relationship of disease DAG, that is, the disease in the same layer in the DAG contributes the same value to the disease d. However, the number of different diseases in DAGs can also affect the semantic similarity of disease. The fewer diseases appear in DAGs, the more important they are. Therefore, we constructed the second method for calculating the disease contribution value based on this hypothesis: where num(DAGs(s)) denotes the number of DAGs that contain disease s, and num(diseases) denotes the number of all diseases. Thus, the second disease semantic similarity model SV 2 (d (i), d(j)) of disease d(i) and disease d(j) can be calculated as follows: where DV(d(i)) and DV(d(j)) have the same meaning as disease semantic similarity model SV 1 , which can be calculated from formula 7.

Multi-source data fusion
In order to make full use of information from different sources, we used the fusion method to fuse circRNA similarity information and disease similarity information with known circRNAdisease associations. The fused information can absorb the characteristics of different data sources, thus describing the complex relationship between circRNAs and diseases more comprehensively.
For the circRNA, we use the constructed circRNA GIP kernel similarity GR directly to represent the circRNA descriptor RSim. For the disease, we need to fuse the disease semantic similarity model SV 1 and SV 2 , and disease GIP kernel similarity GD. Since the MeSH database provides a strict disease association, we use it as much as possible. More specifically, if there is the semantic similarity between disease d(i) and disease d(j), then the disease semantic similarity is used to construct the descriptor DSim. Otherwise, it is constructed using disease GIP kernel similarity. This construction rule can be described by the following formula: Finally, we match circRNA similarity RSim with disease similarity DSim based on known circRNA-disease associations to form a complete fusion descriptor. The fusion descriptor FV(c (i), d(j)) of circRNA c(i) and disease d(j) can be described as follows: where RSim(i) indicates the i row vector of circRNA c(i) in the circRNA similarity matrix RSim, and DSim(j) indicates the j column vector of disease d(j) in the disease similarity matrix DSim.

Feature extraction by fast learning with Graph Convolutional Networks
After getting the fusion descriptors, we used the Fast learning with Graph Convolutional Networks (FastGCN) algorithm to extract their features to remove noise information and improve the performance of the model. FastGCN is an efficient algorithm based on the original GCN and realized by importance sampling. It interprets graph convolutions as integral transforms of embedding functions under probability measure. To be specific, FastGCN interprets the graph vertices as independent and identically distributed (i.i.d.) samples of some probability distributions, and integrates loss and each convolution layer as vertex embedding functions. The integrals are then calculated by Monte Carlo approximation to determine the sample loss and sample gradient. Finally, important sampling is used to reduce the approximate variance.
FastGCN not only eliminates the reliance on test data but also produces a controllable cost for each batch of computation. Suppose there is a graph G 0 with the vertex set V 0 associated with a probability space (V 0 , F, P). For the given graph G, it is a subgraph of G 0 whose vertices are i.i.d. samples of V 0 obtained from the probability measure P. For the probability space, V 0 is used as the sample space, and F can be any event space. The probability measure P defines a sample distribution. Thus, the function generalization can be expressed as: where the function h (l) represents an embedding function from the lth layer, u and v are independent random variables that have the same probability measure P. The embedding functions of two consecutive layers are correlated by convolution and expressed by an integral transforma, where the kernelÂðv; uÞ corresponds to the (v, u) element of the matrixÂ. The loss L is the expected value of g(h (M) ) that is finally embedded in h (M) , and can be expressed as: For the lth layer, the t 1 i.i.d. sample u ðlÞ 1 ; . . . ; u ðlÞ t 1 � P is used to approximatively estimate the integral transformation: Here, h ð0Þ t 0 is h (0) . Therefore, the loss L is transformed into:

Prediction by forest PA classifier
In the experiment, we send the extracted features into the Forest by Penalizing Attributes (Forest PA) classifier for classification, so as to obtain accurate circRNA-disease association prediction results. Forest PA is a novel decision forest building algorithm recently proposed by Adnan et al. [45]. The Forest PA algorithm uses the complete attribute set to generate decision trees by imposing penalties on attributes participating in the latest decision tree. Besides, the participating attributes obtain random weights from the range of weights associated with the respective levels in the tree, thereby maintaining the decision tree generated by the algorithm with individually accuracy and diversity. The execution steps of the Forest PA algorithm are as follows: 1. The Forest PA first generates a bootstrap sample D i from the original training data set D.
2. The Forest PA then uses the weight of attributes to generate decision trees from the bootstrap sample. When choosing the splitting attributes, Forest PA uses the CART algorithm with merit values, whose value is obtained by multiplying its classification ability with its weight.
3. The incremental values of attribute weights and gradient weight in the latest tree are updated iteratively. Here, the weights of the attributes appear in the latest tree will be updated. The weights of attributes that do not appear in the latest tree remain unchanged.
Considering that the weight of attribute is determined by the level λ of test attributes in the latest tree, if an attribute appears on the root node, their value of λ is 1; if an attribute appears on the child node, their value of λ is 2. According to the value of λ, the weight of randomly generated attributes within a Weight-Range WR is defined as follows: 4. Update weights of the applicable attributes with the corresponding weight increment values that do not exist in the latest tree.

Conclusion
In this study, we proposed a new computational method called GCNCDA to predict potential circRNA-disease associations. The method makes full use of the disease semantic similarity, disease and circRNA GIP kernel similarity, the known circRNA-disease association information, and extracts the high-level abstract features from them by deep learning FastGCN algorithm. The cross-validation results show that GCNCDA performs well on the benchmark dataset circR2Disease. In comparison with different classifier models, feature extraction algorithm models, and other state-of-the-art methods, GCNCDA has exhibited strong competitiveness. Furthermore, we also predicted new circRNA-disease associations based on known associations. As a result, 16, 15 and 17 of the top 20 candidate circRNAs with the highest prediction scores in disease including breast cancer, glioma and colorectal cancer were respectively confirmed by relevant literature and databases. These experimental results indicate that GCNCDA is an effective method for predicting circRNA-disease associations and can provide highly reliable candidates for biological experiments. In future research, we will improve the FastGCN algorithm to help the model achieve better performance.
Supporting information S1 Author Contributions