iCDA-CGR: Identification of circRNA-disease associations based on Chaos Game Representation

Found in recent research, tumor cell invasion, proliferation, or other biological processes are controlled by circular RNA. Understanding the association between circRNAs and diseases is an important way to explore the pathogenesis of complex diseases and promote disease-targeted therapy. Most methods, such as k-mer and PSSM, based on the analysis of high-throughput expression data have the tendency to think functionally similar nucleic acid lack direct linear homology regardless of positional information and only quantify nonlinear sequence relationships. However, in many complex diseases, the sequence nonlinear relationship between the pathogenic nucleic acid and ordinary nucleic acid is not much different. Therefore, the analysis of positional information expression can help to predict the complex associations between circRNA and disease. To fill up this gap, we propose a new method, named iCDA-CGR, to predict the circRNA-disease associations. In particular, we introduce circRNA sequence information and quantifies the sequence nonlinear relationship of circRNA by Chaos Game Representation (CGR) technology based on the biological sequence position information for the first time in the circRNA-disease prediction model. In the cross-validation experiment, our method achieved 0.8533 AUC, which was significantly higher than other existing methods. In the validation of independent data sets including circ2Disease, circRNADisease and CRDD, the prediction accuracy of iCDA-CGR reached 95.18%, 90.64% and 95.89%. Moreover, in the case studies, 19 of the top 30 circRNA-disease associations predicted by iCDA-CGR on circRDisease dataset were confirmed by newly published literature. These results demonstrated that iCDA-CGR has outstanding robustness and stability, and can provide highly credible candidates for biological experiments.

Introduction Circular RNA (circRNA) is a type of non-coding RNA without 5' end caps or a 3' end poly (A) tails [1]. Since the discovery of circular RNA (circRNA) in RNA viruses 40 years ago, more than 100,000 circRNAs have been found in cells [2]. With the rapid development of RNA sequencing (RNA-seq) technology and bioinformatics, more and more studies have shown that circRNA plays an important role in many cell activities including effecting on arteriosclerosis, involving in the regulation of mRNA expression and regulating alternative splicing [3][4][5][6][7][8]. In addition, some evidence suggests that some diseases may be related to abnormal expression of circRNA. Zhou et al. found miR-141 is suppressed by circRNA_010567 through targeting TGF-beta1 to promote myocardial fibrosis [9]. Meanwhile, Liang et al. discovered that breast cancer proliferation and progression can be promoted by circ-ABCB10 through sponging miR-1271 [10]. Many scholars believe that many circRNAs can be used as tumor markers and therapeutic targets in clinical applications [11]. Based on the above reasons, confirming the potential association has gradually become a research hotspot in recent years. However, the high experimental cost and long experimental circle restrict the traditional experimental methods from verifying the association between circRNA and diseases on a large scale. In order to solve this problem, the calculation method rises in response to the proper time and conditions [12][13][14][15][16].
In recent years, in order to unify the standards of circRNAs obtained by experiment, many databases were established as circBase, CIRCpedia, deepBase, CircNet and circRNADb [17][18][19][20][21]. These databases provided biological essential information about circRNA, such as sequencing data and gene target. What's more, there are many databases that choose to collect circRNAs that have been shown to be associated with various diseases, including CircR2Disease, circRNADisease, circFunBase, and Circ2Disease [22][23][24][25]. These databases provide data support for selecting candidates of potential circRNA-disease associations by computational methods. For example, Xiao et al. proposed a weighted dual-manifold regularized-based calculation model named MRLDC which integrates geometric information and intrinsic diversity of circRNA and disease feature spaces [26]. Although this method has achieved good results, there are only 331 association for training model. A small number of training samples may lead to insufficient robustness of the model. In addition, MRLDC only describes the behavior information in circRNA-disease association network, and cannot directly and accurately measure circRNA similarity and disease similarity from the attributes of circRNA and disease. Fan et al. proposed a computational model of KATZ measures for human circRNA-disease association prediction (KATZHCDA) using a heterogeneous network [27]. Similarly, this model also does not have enough training samples. Among them, 275 circRNAs, 36 diseases, and 312 associations were used. Although KATZHCDA uses circRNA expression profile information, its performance is still limited. Compared with the above two models, GHICD and RWRHCD have relatively sufficient training samples. They used 541 circRNAs, 83 diseases, and 592 associations [28]. It is worth noting that although they have achieved some effects and used the cir-cRNA-gene association network to describe the attribute information of circRNA, the accuracy is still limited because the association network formed by circRNA and genes is very sparse.
Through the above analysis, we can see that although the current computing models have achieved good results, they also have some defects. First, it is not difficult to see that the training data used by the current model is limited, which has an impact on the robustness of the model. At the same time, the lack of training data also brings the problem of limited coverage. The potential associations that these models can predict are all around 10,000. Secondly, they are mainly based on a single data description method, which does not integrate circRNA and disease behavior information and attribute information in the network to comprehensively define the feature of circRNA and disease, resulting in limited prediction performance. Finally, they did not take the circRNA sequence information into account and cannot accurately measure the circRNA similarity. Therefore, in order to improve the drawbacks of the current computational models, we propose iCDA-CGR model to identify CircRNA-Disease Associations based on Chaos Game Representation. By introducing the circFunBase database and sequence information, the problems of limited model coverage and limited predictive performance are solved. The iCDA-CGR integrates multi-source information, including circRNA sequence information, gene-circRNA associations information, circRNA-disease associations information and the disease semantic information. In particular, iCDA-CGR extracts the biological sequence position information and quantifies the biological sequence nonlinear relationship of circRNA by Chaos Game Representation (CGR) technology [29]. Specifically, iCDA-CGR first figures the disease semantic similarity and disease Gaussian interaction profile kernel (GAS) kernel similarity and combines them to construct disease fusional similarity. Secondly, the method quantizes position and nonlinear sequence information through Chaos Game Representation (CGR) technology to calculate the similarity and difference of circRNAs by Pearson correlation coefficient. Thirdly, circRNA sequence-based similarity, circRNA gene-based similarity and circRNA GAS similarity are integrated into circRNA fusional similarity. Fourthly, feature descriptors are formed by circRNA fusional similarity and disease fusional similarity. Finally, the iCDA-CGR put feature descriptors into support vector machines to predict potential circRNA-disease association. The workflow of iCDA-CGR is shown as Fig 1. We verify the reliability of the method with the five-fold cross-validation on the CircR2Disease database. The average prediction area under curve (AUC) of our method is of 85.14% and the prediction accuracy is 81.12%. Our source code and data can be downloaded on GitHub (https://github.com/look0012/iCDA-CGR). It contains the datasets, the algorithm code and the models. It is worth mentioning that in order to make it more convenient for readers, we provide an easy-to-use version. The user only needs to enter the predicted cir-cRNA and disease name in the following code to perform the prediction operation. The list of circRNAs and diseases is also in the published document, and users can use the list to find the associations they need. There are two models in this version, trained on circR2Disease and CircFunBase respectively. Among them, iCDA-CGR (circR2Disease) can predict 46,825 unconfirmed associations. iCDA-CGR (CircFunBase) can provide predictive scores for approximately 170,000 unconfirmed associations. We hope that these improvements will better serve circRNA researchers as a way to advance the field.

Data sets
Benchmark database of circRNA-disease associations. In the past year, a number of benchmark databases have been proposed for collecting circRNA-disease associations, such as circR2Disease, circRNADisease, circFunBase, and Circ2Disease, which contain the association between experimentally validated diseases and circRNAs [22][23][24]. In this article, circR2Disease and circFunBase are used as the benchmark data set. The detailed description is as follows: circR2Disease. To evaluate the reliability of our method, the widely used benchmark set cir-cR2Disease was selected. The dataset was preprocessed due to its repetitiveness and nonhuman circRNA disease association. Specifically, we obtained 612 confirmed circRNA-disease associations consisting of 533 circRNA and 89 diseases after removing the circRNAs in which the gene symbol could not be found, as shown in Table 1. The base dataset circR2Disease can be defined as: where Z 1 p is a positive subset constructed by 612 confirmed circRNA-disease associations, Z 1 n is a negative subset containing 612 associations which are selected from all 47437 unconfirmed associations between diseases and circRNAs.
[ is the union of set theory. Known circRNA-disease associations and their names obtained from circR2Disease database can be seen in S1-S3 Tables. circFunBase. CircFunBase is a database that provides high-quality functional circRNA resources and few models are used. In order to improve the problem of small coverage predicted by the current model, we also performed experiments on this dataset. After removing circRNAs that did not match the gene symbols, 2984 confirmed circRNA-disease associations were obtained, including 2597 circRNAs and 67 diseases, as shown in Table 1. The Benchmark database circFunBase can be defined as: where Z 2 p is a positive subset constructed by 2984 confirmed circRNA-disease associations, Z 2 n is a negative subset containing 2984 associations which are selected from all 168031 unconfirmed associations between diseases and circRNAs. CircRNAs and their sequence information. Sequence information and gene symbols information for circRNAs are provided by many public databases such as circBase, CIRCpedia, deepBase, CircNet and circRNADb [17][18][19][20][21]. To be able to construct a more complete circRNA sequence dataset, we downloaded circRNA sequence information from a database, circBase. The database is accessible free of charge via the web server http://www.circbase.org/.

Related work
Chaos Game Representation (CGR). It is an iterative mapping technique for processing sequences [29]. The first advantage of this algorithms is that the original sequence information can be completely recovered from the coordinates. It means that information is not lost in mapping. Secondly, each sequence has a unique mapping, which means that positional information is preserved. For these reasons, the CGR is suitable for transformation of nucleotide sequence. The position P i was figured by: Where ν is the nucleotide contribution factor and we set it to be 0.5. g i is the nucleotide position factor. A, C, G, T are corresponding to (0,0), (0,1), (1,1), (1,0) respectively. n seq is the length of the sequence and P 0 = (0.5,0.5).

Similarity between diseases
Disease semantic similarity. The Medical Subject Headings (MeSH) database categorizes the disease rigorously, which helps to calculate the semantic similarity of the disease. It can be download from https://www.nlm.nih.gov/ [30]. We can express a disease as a directed acyclic graph (DAG) based on semantic information from the MeSH database. The nodes in DAG represent the diseases, and the edges represent their relationships. If the disease is pathologically similar, more parts of DAG will be shared. Wang et al. [31] proposed a method that has been widely used to calculate the semantic similarity of diseases in recent years. We defined a model for calculating disease contribution values, which is as follows: We define the amount of DAGs which includes disease r as n(DAGs(r)) and the quantity of all diseases as n(disease). Therefore, the semantic similarity score S D sem of the disease d(i) and the disease d(j) is described as follows: where N d(i) is defined as all diseases that appear in the disease d(i)'s DAG. Disease GAS similarity. Many researches have applied Gaussian interaction profile kernel (GAS) to measure the similarity between diseases, according to that pathologically similar diseases tend to be associated with functionally similar circRNAs. In this study, the S D GAS was used to describe the disease similarity information as follow: Where We define the parameter as the width parameter of the function, τ d . The quantity of diseases and circRNAs are defined as m and n represently. Association adjacency matrix A cd represents the positive subset Z p . If circRNA r(i) and disease r(j) have an association, element t i,j is set to be 1, otherwise 0. A cd (d(i)) is association profiles of disease d(i). Here, we utilize the ith column vector of the adjacency matrix to describe A cd (d(i)).
Disease fusional similarity. By analyzing the disease similarity measures form multiple perspectives, we gain the similarity matrices, including S D sem and S D GAS . However, some of semantic similarity are unable to be calculated if the disease does not have its own DAG. To compensate for this deficiency, we will fuse S D sem and S D GAS like the previous researches [32][33][34]. The disease fusional similarity S D between disease d(i) and d(j) is defined as follow, and the final disease similarity matrix can be seen in S4

Similarity between circRNAs
CircRNA gene-based similarity. Circular RNA regulates the activity of RNA polymerase and promotes parental genes' transcription found in previous researches. Because if RNA affects the same human disease, their functions tend to be similar [35][36][37]. In this work, we downloaded gene-circRNA association information from crcR2Disease database. The cir-cRNA gene-based similarity matrix was constructed as follow: Where the elements in S C gene is functional similarity scores between circRNAs. Association adjacency matrix A cg represents the association between genes and circRNA. If gene target and cir-cRNA have an association, the element of A cg is set to be 1, otherwise 0. The gene's GAS similarity matrix S G gas is constructed by Association adjacency matrix A cg . T is the transpose operator.
CircRNA GAS similarity. Many researches chose to utilize gaussian interaction profile kernel (GAS) to measure the similarity between biomolecules [38]. Because if RNA affects the same human disease, their functions tend to be similar [35][36][37]. In this study, the S C GAS was used to describe the circRNA similarity information as follow: Where S C GAS ðcðiÞ; cðjÞÞ is the GAS similarity value between circRNAs c(i) and circRNAs c(j). The i -th row vector in the adjacency matrix A cd is defined as the association profile A cd (c(i)) of circRNA c(i), which is a vector composed of the relationship between circRNA c(i) and all diseases. τ c is the width parameter.
circRNA sequence-based similarity. Existing sequence alignment algorithms only quantify position information or non-linear information, and few algorithms that can combine both are proposed. Therefore, a new CGR-based method is proposed to quantify the similarity and difference between position and non-linear information using Pearson correlation coefficient. The specific calculation process is as follows.
Firstly, the CGR space is divided into N g grid (N g = 2 s ×2 s ,s = 3), as Fig 2. And, grid can be represented as formula 13.
Secondly, the abscissa point.x and ordinate point.y in each grid are accumulated respectively to quantify position information.
Thirdly, we calculate the z-scores of each grid Z i to quantify nonlinear information.
Num k N g ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ð16Þ Finally, each grid can be described as three attributes, and we fused the attributes to construct the descriptors descriptors(c(i)) to determine the sequence similarity S C seq ðcðiÞ; cðjÞÞ by Pearson correlation coefficient. Where c(i) represents the i -th cricRNA. The workflow is shown as Fig 3. S C seq cðiÞ; cðjÞ ð Þ ¼ CovðdescriptorsðcðiÞÞ; descriptorsðcðjÞÞÞ DðdescriptorsðcðiÞÞÞ � DðdescriptorsðcðjÞÞÞ ð18Þ where Cov(descriptors(c(i))) is the covariance of descriptors(c(i)), D(descriptors(c(i))) is the variance of descriptors(c(i)). The size of circRNA sequence similarity matrix S C seq ðcðiÞ; cðjÞÞ is n×n. All sequence information used in this article was downloaded from cir-cBase [17].
CircRNA fusional similarity. By analyzing circRNA's characteristics from different perspectives, we can obtain three similarity matrices, including S C gene (formula 8), S C GAS (formula 9), and S C seq (formula 16). Since the two adjacency matrices A cd and A cg are sparse, the two similarities S C gene and S C GAS obtained by collaborative filtering have no significant difference in value and can't effectively distinguish circRNA. In order to solve the small difference between circRNAs due to lack of data and availability, we try to describe circRNA from a different perspective to make it more informative. To this end, the sequence similarity is introduced. However, some circRNAs lack sequence information corresponding to the experiment. So, the completion of similarity information is accomplished by combining three matrices. The fusional similarity S C is defined as follow, and the final circRNA similarity matrix can be seen in S5  pattern recognition problems. Due to the training samples used in iCDA-CGR are small, SVM is selected to build a model of predicting potential circRNA-disease association. Prediction is mainly divided into three steps: 1. Construct positive and negative sample sets; 2. Form the association descriptors based on the characteristics of the circRNA and disease; 3. Train models based on descriptors to predict potential circRNA-disease associations. Each step will be described in detail below. Firstly, we built positive and negative sample sets. Specifically, 612 corresponding experimentally supported circRNA-disease pairs in circR2Disease were chosen as positive samples. Meantime, we randomly selected the same number of associations that without experimentally supported as negative samples.
Secondly, the association descriptors based on the characteristics of the circRNA and disease were formed. We calculated the semantic similarity S D sem and the GAS similarity S D GAS of the disease separately, and integrated them into a matrix S D , and used the similarity of the disease d(i d ) with all diseases including itself (the i d th row of the matrix S D ) as the characteristic descriptor of the disease defined as follow: where S D (d(i d )) represents the ith row of the matrix S D . v 1 is the similarity value of d(i d ) and d (1). The size of S D (d(i d )) is 1×m. At the same time, we calculated the gene-based similarity S C gene , the GAS similarity S C GAS and sequence-based similarity of the circRNA separately to form circRNA fusional similarity S C . Using the similarity of the circRNA c(i c ) with all circRNA including itself (the ith row of the matrix S C ) describes the characteristic descriptor of the cir-cRNA defined as follow: where S C (c(i c )) represents the ith row of the matrix S C . The similarity value between c(i c ) and c (1) is defined as w 1 . The size of S C (c(i c )) is 1×n. circRNA disease samples can be defined as 622-dimensional association descriptors combined S D (d(i)) and S C (c(i c )): where (f 1 ,f 2 ,f 3 ,. . .,f m ) is i d th row of the disease fusional similarity S D , the i c th row of the cir-cRNA fusional similarity S C is defined as (f m+1 ,f m+2 ,f m+3 ,. . .,f m+n ). Finally, support vector machines (SVM) is utilized to train samples to build predictive models. More specifically. Firstly, we set the label of the training set. If the samples are in Z p , the label is defined as 1. Meanwhile, if the samples are in Z n , the label is defined as 0. Secondly, we fed the training data into support vector machines (SVM) to get prediction model. By predicting, the higher the score of the circRNA-disease association, the more likely it is the candidate for the potential association.

Performance Evaluation
The five-fold cross-validation(5-CV). In this work, the five-fold cross-validation (5-CV) is selected to evaluate the effectiveness of iCDA-CGR in predicting disease-related circRNAs. We separated the base dataset Z into five parts on average: (

PLOS COMPUTATIONAL BIOLOGY
where ; is empty set.
[ and \ are the union and intersection of set theory. Subset Z i , Z p , Z n can be defined as: 8 > < > : The relationship between the ith positive subset Z p i or the ith negative Z n i can be expressed as: ( where the quantity of sample in the ith positive subset Z p i are described as numðZ p i Þ. In same way, we described the quantity of sample in the ith negative subset Z n i as numðZ n i Þ. In the iCDA-CGR, we utilized four of the positive subset and negative Z n i as the training set and the remaining one as the test set as a cross-validation. The cross-validation is repeated 5 times, and each test set is verified once, with an average of 5 results, and finally a final estimate is obtained.
Evaluation criteria. Three evaluation criteria were introduced for assessing the performance of iCDA-CGR. Accu. is the ratio of the number of samples correctly classified by the classifier to the total number of samples.
where TP and FP are the number of true positive and false positive samples, respectively. TN and FN are the number of true negative and false negative samples, respectively. Sen. is the ratio of the number of samples correctly classified by the classifier to the total positive samples.
Prec. is the ratio of the number of samples correctly classified by the classifier to the sum of true positive and false positive samples.
F 1 is a comprehensive evaluation index of Sen. and Prec.

Assessment of prediction ability
To evaluate the capabilities of the model, we performed experiments on the circR2Disease and circFunBase datasets, respectively. The five-fold cross-validation results on the circR2Disease dataset are summarized in Table 2 The yielded averages of accuracy, sensitivity, precision and f1-score come to be 81.95%, 88.08%, 78.46% and 82.97% as in Table 2.
On the circFunBase dataset, the mean and standard deviation were utilized as the experimental results of the five-fold cross-validation. In Table 3 The yielded averages of accuracy, precision, sensitivity and f1-score come to be 78.03%, 79.96%, 74.94% and 77.31% as in Table 3.

Comparison among different classifiers
In the above experiment, iCDA-CGR has received a reliable result. To prove the correctness of the classifier selection, we have compared the support vector machine (SVM) with random forest (RF), decision tree (DT), k-nearest neighbor (KNN) on benchmark database circR2Disease. Support vector machines (SVM) is a binary classification model. Its purpose is to find a hyperplane to segment samples. The principle of segmentation is to maximize the spacing, and finally it is transformed into a convex quadratic programming problem to solve. The decision tree (DT) adopts a top-down recursive method. The basic idea is to construct a tree with the fastest entropy decline as measured by information entropy, and the entropy value at the leaf node is 0. The random forest (RF) is a kind of Ensemble Learning, which belongs to Bagging. By combining multiple weak classifiers, the final results can be voted or averaged, which makes the results of the whole model have higher accuracy and generalization performance. The main idea of the k-nearest neighbor (KNN) algorithm is that if most of the k most adjacent samples in the feature space belong to a certain category, then the sample also belongs to this category and has the characteristics of samples in this category.

PLOS COMPUTATIONAL BIOLOGY
In Table 4

Comparison with related models
To further evaluate the reliability of iCDA-CGR, we compared it to five related prediction models: KATZHCDA, GHICD, RWRHCD, CD-LNLP and ICFCDA. The details of the comparison are summarized in Table 5. From the table, we can see that KATZHCDA, GHICD, RWRHCD and our model iCDA-CGR are all based on circR2Disease data set and use the five-fold cross-validation method, so iCDA-CGR can be directly compared with these three models. In terms of AUC scores reflecting the overall performance of the model, KATZHCDA, GHICD and RWRHCD achieved 0.7936, 0.7290 and 0.6660 respectively, while the proposed model iCDA-CGR achieved 0.8533. The results show that iCDA-CGR is significantly better than these methods. In the last two rows of Table 5, we list the performance of CD-LNLP and ICFCDA, which are 0.9007 and 0.9460, respectively. However, because the dataset or assessment methods used by these two models are inconsistent with the proposed model, we cannot directly compare them, so they are used as a reference for model performance. The specific reasons that cannot be directly compared are as follows: For model CD-LNLP, it uses the circ2Disease database instead of the more commonly used circR2Disease database. Due to the different data sources used, the training model evaluation criteria will be different. Furthermore, CD-LNLP uses leave-one-out cross validation (LOOCV) to evaluate model performance instead of the more commonly used five-fold cross validation (5-CV). Based on previous work, using the same model and data, LOOCV assessments are usually higher than 5-CV [39]. Therefore, CD-LNLP cannot be directly compared with the proposed model.
For model ICFCDA, it uses the circR2Disease database, but this method removes more noisy data. The training data of ICFCDA includes 212 associations consisting of 200 circRNAs and 42 diseases. The predicted coverage of this model is 7976 associations, which is 17.25% of the coverage of iCDA-CGR. This operation makes the model performance stronger, but sacrifices the model's coverage. In addition, ICFCDA also uses LOOCV. Therefore, ICFCDA cannot be directly compared with the proposed model.
In summary, the proposed model has superior performance and coverage, which indicates that CGR-based sequence extraction technology and characterization of intrinsic structure and circRNA-disease association information could effectively improve the reliability of prediction.

Case study
To verify the performance of the model in predicting potential associations based on confirmed associations, we carried out a case study. To be specific, we define the training samples and test samples as follows:  verified in different literatures shown as Table 6.

Performance on independent data set
The results indicate that this method is reliable for circRNA-disease association prediction. In order to further support this conclusion, we verified the method in other databases (CRDD, circRNADisease, and Circ2Disease). It is not possible to identify all potential circRNA disease associations because each database is incomplete. So, we assume that the associations in the database are the only known associations that have been experimentally verified, and the rest are set to unknown associations. The training samples and test samples are described as follows: ( where Z 1 train database and Z 1 test database are the training set and test set of the independent data sets respectively. Z database represents the independent data sets, such as CRDD, circRNADisease, and Cir-c2Disease. In this experiment, the iCDA-CGR was utilized to construct the prediction model using the base dataset Z 1 . Since the disease and circRNA are different for each data source, the intersection of all possible association sets C U Z 1 with independent data set Z database is used as the test set Z 1 test circR2Disease . It can be seen from Table 8 that the proposed method obtained predicted values of 95.18% (Circ2Disease), 90.64% (circRNADisease) and 95.89% (CRDD) in three databases, respectively. In addition, we did the same on circFunBase. The training samples and test samples are described as follows: ( It can be seen from Table 8 that the proposed method obtained predicted values of 63.26% (Circ2Disease), 73.43% (circRNADisease) and 72.72% (CRDD) in three databases, respectively. The experiment shows that the iCDA-CGR has strong generalization ability.

Discussion
In this study, we proposed the calculation model iCDA-CGR based on quantify location and non-linear information to identify the circRNA-disease associations. This model integrates cir-cRNA sequence information, gene-circRNA associations information, circRNA-disease associations information and the disease semantic information, and predicts the final results by SVM classifier. In particular, we introduce circRNA sequence information and extract the biological sequence position information and quantifies the biological sequence nonlinear relationship of circRNA by Chaos Game Representation for the first time in the circRNA-disease prediction model. The model achieved outstanding results in the experiments of five cross-validation, comparisons with other methods, and independent data sets. Furthermore, 19 of the top 30 circRNA-disease associations predicted in case studies experiments were confirmed by the latest published literature. Due to the addition of sequence information, iCDA-CGR exhibited strong reliability and stability in predicting potential circRNA-disease associations. These experimental results indicate that the sequence information has sufficient coverage relative to nucleic acids, and iCDA-CGR has great potential for nucleic acid function analysis.
Supporting information S1