Fusion of KATZ measure and space projection to fast probe potential lncRNA-disease associations in bipartite graphs

It is well known that numerous long noncoding RNAs (lncRNAs) closely relate to the physiological and pathological processes of human diseases and can serves as potential biomarkers. Therefore, lncRNA-disease associations that are identified by computational methods as the targeted candidates reduce the cost of biological experiments focusing on deep study furtherly. However, inaccurate construction of similarity networks and inadequate numbers of observed known lncRNA–disease associations, such inherent problems make many mature computational methods that have been developed for many years still exit some limitations. It motivates us to explore a new computational method that was fused with KATZ measure and space projection to fast probing potential lncRNA-disease associations (namely KATZSP). KATZSP is comprised of following key steps: combining all the global information with which to change Boolean network of known lncRNA–disease associations into the weighted networks; changing the similarities calculation into counting the number of walks that connect lncRNA nodes and disease nodes in bipartite graphs; obtaining the space projection scores to refine the primary prediction scores. The process to fuse KATZ measure and space projection was simplified and uncomplicated with needing only one attenuation factor. The leave-one-out cross validation (LOOCV) experimental results showed that, compared with other state-of-the-art methods (NCPLDA, LDAI-ISPS and IIRWR), KATZSP had a higher predictive accuracy shown with area-under-the-curve (AUC) value on the three datasets built, while KATZSP well worked on inferring potential associations related to new lncRNAs (or isolated diseases). The results from real cases study (such as pancreas cancer, lung cancer and colorectal cancer) further confirmed that KATZSP is capable of superior predictive ability to be applied as a guide for traditional biological experiments.

It is well known that numerous long noncoding RNAs (lncRNAs) closely relate to the physiological and pathological processes of human diseases and can serves as potential biomarkers. Therefore, lncRNA-disease associations that are identified by computational methods as the targeted candidates reduce the cost of biological experiments focusing on deep study furtherly. However, inaccurate construction of similarity networks and inadequate numbers of observed known lncRNA-disease associations, such inherent problems make many mature computational methods that have been developed for many years still exit some limitations. It motivates us to explore a new computational method that was fused with KATZ measure and space projection to fast probing potential lncRNA-disease associations (namely KATZSP). KATZSP is comprised of following key steps: combining all the global information with which to change Boolean network of known lncRNA-disease associations into the weighted networks; changing the similarities calculation into counting the number of walks that connect lncRNA nodes and disease nodes in bipartite graphs; obtaining the space projection scores to refine the primary prediction scores. The process to fuse KATZ measure and space projection was simplified and uncomplicated with needing only one attenuation factor. The leave-one-out cross validation (LOOCV) experimental results showed that, compared with other state-of-the-art methods (NCPLDA, LDAI-ISPS and IIRWR), KATZSP had a higher predictive accuracy shown with area-under-the-curve (AUC) value on the three datasets built, while KATZSP well worked on inferring potential associations related to new lncRNAs (or isolated diseases). The results from real cases study (such as pancreas cancer, lung cancer and colorectal cancer) further confirmed that KATZSP is capable of superior predictive ability to be applied as a guide for traditional biological experiments.

Introduction
Long non-coding RNAs (lncRNAs) whose length are longer than 200 nucleotides (nt) have crucial roles in gene expression control during developmental and differentiational processes [1]. Therefore, there is no surprise that mutation and dysregulation of lncRNAs could contribute to the development of various human complex diseases [2], such as HOTAIR in breast cancer [3] and MALAT1 in early-stage non-small cell lung cancer [4]. LncRNAs can also drive many important cancer phenotypes through their interactions with other cellular macromolecules including DNA, protein, and RNA [5][6][7][8]. There is urgent need to discern potential functional roles of lncRNAs to further study the pathology, diagnosis, therapy, prognosis, prevention of human complex diseases, and detect disease biomarkers at lncRNA level [9,10].
With strong data support from lncRNA related databases (such as LncRNAdb [11], LncRNA-Disease [12], NRED [13], and NONCODE [14]) and similarity calculation based on miRNA information [15][16][17][18][19][20], the computational prediction models that were built to infer lncRNAdisease associations could supply more accurate targeted candidates [21]: 1) saving cost and time for biological experiments; 2) making bio-experiments focus on deeper study of targets; 3) speeding up understanding the pathogenesis of complex diseases. The computational models used for inferring lncRNA-disease associations have been divided into three main categories: 1) Machine learning-based inferring models use naive Bayesian classifier model [22,23], support vector machine (SVM) [24,25], matrix completion [26,27], matrix factorization [28][29][30] to infer potential lncRNA-disease associations. However, the models categorized to this category are not able to achieve high predictive accuracy. 2) Network-based inferring models, based on the biological premise that lncRNAs with similar functions tend to be associated with similar diseases [31,32] [15,30] to identify potential lncRNA-disease associations. Nevertheless, the models categorized to this category rely heavily on the information integrated from diverse biological data sources, and it is difficult to integrate heterogeneous data from multiple sources deeply. 3) Convolutional neural network (CNN) based inferring models [40][41][42][43], are at the early research stage, with consuming relatively high time complexity and relying on the quality of multiple sources biological data as well. Therefore, those above models still have different limitations, such as, needing negative samples, not being able to infer associations related to isolated diseases and new lncRNAs directly, not high accuracy with singular methodology. Addressing these limitations, we explored a novel prediction method based on the fusion of KATZ Measure and Space Projection to infer potential lncRNA-disease associations in bipartite graphs, namely KATZSP.
KATZ measure such a graph-based computational method could be used to transform the problem of calculating similarities between nodes to link prediction in bipartite graph. In the context of lncRNA-disease association prediction, the heterogeneous networks are represented by matrices (also called bipartite graph). Therefore, calculating similarities between the nodes of lncRNAs and diseases is further transformed into the problem of counting the number of walks that connect the interactive lncRNA-disease pairs in bipartite graph. Furthermore, the number of walks as the lengths decided the potential association probability of this lncRNAdisease pair [36,44]. Space projection method [45, 46] could improve the lncRNA-disease association predictive ability easily with few regulation parameters, even though the known lncRNA-disease associations exist inherent data sparsity. After simplified and uncomplicated fusion process, KATZ measure and space projection method were fused to form an integrated computational model KATZSP with needing only one attenuation factor, while dropping above limitations.

Evaluation metrics
Leave One Out Cross Validation (LOOCV) experiments were implemented for evaluating the predictive performance of KATZSP. We divided the dataset of known associations into two parts: the testing subset and the training subset. In the testing subset, each known association was used as a test data in turn, and the remaining known associations formed the training subset. Under the framework of LOOCV, we compared the prediction results on some specific threshold to obtain the following four metrics: true positive (TP), false positive (FP), false negative (FN), true negative (TN). Furthermore, according to some specified thresholds, we calculated the true positive rate (TPR ¼ TP TPþFN ) against false positive rate (FPR ¼ FP TNþFP ) with which to plot out the receiver operating characteristic curve (ROC). The area under the ROC curve (AUC) was finally calculated to numerically evaluate the overall predictive performance of KATZSP.

Impact with parameter selection
Coefficient β plays as an attenuation factor of weight to control the contribution of lengths coming from walks on calculating the similarities between any two interactive nodes. According to the convergence properties of sequences required by KATZ method, the value of β should be less than the reciprocal of the max-eigenvalue of the adjacency matrix A. In order to obtain the optimal value of β, we set β = 1/max(eig(A)) � K where max(eig(A)) denotes the max-eigenvalue of adjacency matrix A. Then the value of K was increased from 0.1 to 0.9 with step size of 0.1. With changing the value of K, LOOCV was implemented on all the three datasets built (dataset 1, dataset 2 and dataset 3). The results in Fig 1 showed that AUC could achieve the maximum value on all the three datasets when K = 0.1.

Compare predictive abilities under different solutions
To demonstrate how our technical solution selected performed better than others, LOOCV experiments were implemented under following four technical solutions: only using space projection (SP), only using KATZ (KATZ), using space project first and then KATZ (SPKATZ), using KATZ first and then space projection (KATZSP). The results compared on three datasets (dataset 1, dataset 2 and dataset 3) were shown in Figs 2-4, respectively.
From the comparison results shown in Figs 2-4, we easily found the solution used in our model (KATZSP) achieved AUC values of 0.9324, 0.9403 and 0.9472 on dataset 1, dataset 2 and dataset 3, respectively. Among above four solutions, our KATZSP which performed the best predictive ability on all three datasets with distinct advantage than other three solutions.

Compare performance with other models
To further demonstrate the reliable predictive ability of our model, we chose some the-stateof-art computational models in similar type ( to compare with our model in the framework of LOOCV. To make comparison fairly, we configured the same experimental environment and condition for all models on dataset 1, dataset 2 and dataset 3. From the comparison results shown in Figs 5-7, our KATZSP achieved the highest AUC values on all three datasets with detail analysis shown in Table 1.

Verify predictive ability for new lncRNAs and isolated diseases
To implement the verification in this section, we simulated each lncRNA in the known lncRNA-disease associations dataset to be a new lncRNA by removing all known associations relating to it. Similarly, we simulated each disease in the known lncRNA-disease associations dataset to be an isolated disease by removing all known associations relating to it. Each new lncRNA (or isolated disease) simulated was specified to be the test sample for model evaluation and the rest lncRNAs (or diseases) in the known lncRNA-disease associations dataset worked as the training samples for model learning. Until the associations between each new lncRNA and diseases or the associations between lncRNAs and each isolated disease were inferred by our KATZSP, the inferred results on dataset 1, dataset 2 and dataset 3 were shown in Fig 8. With the AUC values in Fig 8, it demonstrated that our KATZSP could be effectively applied to infer associations related to new lncRNAs and associations related to isolated diseases.

PLOS ONE
Fusion of KATZ measure and space projection to fast probe associations

Case study for three specific diseases
To further demonstrate the predictive performance of our KATZSP on real cases study, we selected three specific diseases (pancreas cancer, lung cancer and colorectal cancer) as the cases to examine. With using the training samples composed of the known associations in dataset 2 and the testing samples composed of the unknown associations, our KATZSP focused on inferring the potential lncRNAs relating to above three cases. The lncRNAs with the top five highest prediction scores of each case were listed in Table 2. If the same associations predicted by KATZSP were also found in some literatures or the newest databases, such as LncRNADisease 2.0 (http://www.rnanut.net/lncrnadisease) and Lnc2Cancer 3.0 (http:// www.biobigdata.net/lnc2cancer), it could further validate with the supporting evidences that our KATZSP was capable of the reliable predictive ability and practicability.

Case study for isolated diseases
In recent years, many new diseases without any known association r lncRNAs have been gradually discovered, namely isolated diseases. It is important to verify if our KATZSP could be applied to infer the potential lncRNAs associated to such kind of isolated diseases. Above three cases (pancreas cancer, lung cancer and colon cancer) were simulated as the isolated diseases by removing all known associations relating to them in dataset 2. Our KATZSP only used other information to infer the potential lncRNAs associated with these three isolated diseases simulated. The top five lncRNAs with highest prediction scores of each disease were listed in Table 3 where only two prediction results (TC0101441 and KRASP1) couldn't be found supporting evidence from any databases or published literatures.
In Tables 2 and 3, all predicted results except two were confirmed with extra evidences, which validated our KATZSP could be effectively applied in real life with supplying calculated candidates to guide biological experiments.

PLOS ONE
Fusion of KATZ measure and space projection to fast probe associations has following definition with denotation of D d i ðd t Þ: where Δ was set to be the most suitable value of 0.5. Based on both the addresses of diseases in DAG graphs and the semantic relations with ancestor diseases, the element dd ij in matrix DD = (dd ij ) nd×nd denotes the semantic similarity The data in column "Evidences" of Table 2 showed that all the potential lncRNAs inferred relating to the three specific diseases have been found the evidence in LncRNADisease 2.0 or Lnc2Cancer 3.0. It validated the reliability of the inferred results coming from our KATZSP.
https://doi.org/10.1371/journal.pone.0260329.t002 between diseases d i and d j with definition as follows: where T d i is the set of all ancestor nodes relating to disease d i , including node d i itself in DAG.
LncRNA-lncRNA functional similarity. How to accurately measure the functional similarity between two lncRNAs was detailly descripted in many literatures [47-49, 52]. A group of diseases which have associations with lncRNA l i were denoted by D ðl i Þ ¼ fd i 1 ; d i 2 ; � � � ; d i k g, and the similarity between any disease d t in D ðl i Þ and the whole set D ðl i Þ has following definition: Similarly, set D ðl j Þ ¼ fd j 1 ; d j 2 ; � � � ; d j k 0 g denotes a group of diseases associate with lncRNA l j .
The similarity between any disease d t in D ðl j Þ and the whole set D ðl j Þ has following definition: Functional similarities between the lncRNAs were denoted by LL = (ll ij ) nl×nl whose element ll ij represents the functional similarity between l i and l j with calculation as follows: Central similarity of the Gaussian interaction profile. Compared to the number of unknown lncRNA-disease associations, the number of known lncRNA-disease associations is very small, which leads the bipartite graph represented by Boolean matrix of known lncRNAdisease associations to have sparsity. In order to reduce the influence from sparsity on prediction precision, the central similarities of Gaussian interaction profile were calculated in accordance with the description in Laarhoven's work [53]. Therefore, the central similarities of Gaussian interaction profile between the diseases were denoted by DD ðgÞ ¼ ðdd g ij Þ nd�nd whose element dd g ij represents the central similarity of Gaussian interaction profile between disease d i and d j with following definition: where the ith column of matrix LD was denoted by LD(:,i) which represents all the known associations relating to disease d i ; The Gaussian kernel bandwidth here was denoted by γ d with following definition in accordance to the previous study [54]: Similarly, the central similarities of Gaussian interaction profile between the lncRNAs were denoted by LL ðgÞ ¼ ðll g ij Þ nl�nl whose element ll g ij represents the central similarity of Gaussian interaction profile between lncRNA l i and l j with definition as follows: where the ith row of matrix LD was denoted by LD(i,:) which represents all the known associations relating to lncRNA l i ; The Gaussian kernel bandwidth here was denoted by γ l with following definition: Integrated similarity of lncRNAs and diseases. The final similarity matrix of diseases denoted by DD ðf Þ ¼ ðdd f ij Þ nd�nd comes from an integration of DD and DD (g) , and the final similarity matrix of lncRNAs denoted by LL ðf Þ ¼ ðll f ij Þ nl�nl comes from an similar integration of LL and LL (g) . When the original semantic similarity between disease d i and d j was 0, the value of element dd f ij in matrix DD (f) was set as the central similarity of the Gaussian interaction profile, otherwise it was set as the original semantic similarity between disease d i and d j . The value of element ll f ij in matrix LL (f) has a similar setting process as above. For clarity, the formalized acquirement for element values was defined as follows:

Obtain primary prediction scores
Construct adjacency matrix. Based on KATZ measurement, the number of walks that connect lncRNA nodes and disease nodes in the original bipartite graph were calculated to measure the similarities between these nodes as the potential association probabilities. The different lengths of walks between lncRNA nodes and disease nodes contributed differently to the similarities between these two kinds of nodes. The shorter length of walks contributed more to the similarities than the longer one. To make full use of the heterogeneous network constructed above, matrix DD (f) , LL (f) and LD were integrated into a new heterogeneous network A (nl+nd)×(nl+nd) as the adjacency matrix with definition as follows: Calculate primary prediction score on KAZT measurement. By applying KATZ measurement, potential association probabilities between node l i and node d j could be calculated as follows with denotation of S KATZ ðl i ; d j Þ: where β is a non-negative coefficient to control the contribution of lengths coming from walks on the similarities between any two nodes, such as l i and d j , β w raised to the power of w, ðA w Þ l i ;d j denotes the number of paths whose length of walks equals w between corresponding nodes pair, such as l i and d j , m denotes the maximum value of the length of walks.
Because bigger value of the length of walks contributes less to the similarities between two nodes, the above formula for similarity calculation could be approximately described in matrix when the value of m tends to be infinity (m!1): where the value of coefficient β was set in range of (0,min{1,1/kAk 2 }), matrix S KATZ has the same size as adjacency matrix A. Submatrix S KATZ [1:nl,nl+1:nl+nd] denotes the elements that located at the rows 1 to nl and the columns nl+1 to nl+nd in matrix S KATZ , which has the same location as matrix LD in adjacency matrix A. In order to express in a consistent way, submatrix S KATZ [1:nl,nl+1:nl+nd] was denoted by matrix LD ðpÞ nl�nd ¼ ðld p ij Þ nl�nd to represent the primary prediction results in the first stage.

Refine primary prediction scores
In order to improve the prediction performance of the proposed model, matrix space projection was used to refine the primary prediction scores obtained in the first stage (LD ðpÞ nl�nd ). Project on lncRNA space. Project the final similarity matrix of lncRNAs (LL (f) ) on the matrix of primary prediction scores (LD (p) ) to obtain the projection scores on the lncRNA space, which were denoted by LD ðplÞ nl�nd ¼ ðld pl ij Þ nl�nd with detailed definition as follows: where ld pl ij denotes the predicted score of the association between lncRNA l i and disease d j with lncRNA space projection, kLD (p) (:,j)k is the 2-norm of vector LD (p) (:,j).
Project on disease space. Similarly, project the final similarity matrix of diseases (DD (f) ) on the matrix of primary prediction scores (LD (p) ) to obtain the projection scores on the disease space, which were denoted by LD ðpdÞ nd�nl ¼ ðld pd ij Þ nd�nl with detailed definition as follows: Integrate space projection scores. In order to fully capture the information of disease similarity, lncRNA similarity, and known lncRNA-disease associations, we integrated the projection scores on lncRNA space (LD ðplÞ nl�nd ) and the projection scores on disease space (LD ðpdÞ nd�nl ) to obtain the final prediction scores (LD ðf Þ nl�nd ) with detailed definition as follows:

Represent workflow model
With the related data preparation, the inferring process with each key step of KATZSP for lncRNA-disease associations was graphically reprensented in Fig 9.

Conclusions
In recent years, even though many computational models for inferring lncRNA-disease associations have emerged, those computational methods still have some limitations that motivated us to propose a new model (KATZSP) to infer lncRNA-disease associations. The main contribution of KATZSP is composed of: only needing one attenuation factor β to control the contribution of walk lengths between any two nodes in bipartite graphs; making up the sparsity with simply integrating KATZ measurement and space projection; no needing negative samples; being able to be applied to isolated diseases and new lncRNAs directly. Compared with some state-of-the-art methods in similar type (NCPLDA, LDAI-ISPS and IIRWR), our model KATZSP achieved higher prediction accuracy on all three datasets (dataset 1, dataset 2 and dataset 3). The results from case study further confirmed the stronger predictive performance of KATZSP to be applied for real cases. Our KATZSP still has following limitations that need to be improved in future: further reducing the biases that the predicted results prefer the data with more known associations; the prediction accuracy needing to be enhanced further with fusion of more heterogeneous data.