iPiDA-LTR: Identifying piwi-interacting RNA-disease associations based on Learning to Rank

Piwi-interacting RNAs (piRNAs) are regarded as drug targets and biomarkers for the diagnosis and therapy of diseases. However, biological experiments cost substantial time and resources, and the existing computational methods only focus on identifying missing associations between known piRNAs and diseases. With the fast development of biological experiments, more and more piRNAs are detected. Therefore, the identification of piRNA-disease associations of newly detected piRNAs has significant theoretical value and practical significance on pathogenesis of diseases. In this study, the iPiDA-LTR predictor is proposed to identify associations between piRNAs and diseases based on Learning to Rank. The iPiDA-LTR predictor not only identifies the missing associations between known piRNAs and diseases, but also detects diseases associated with newly detected piRNAs. Experimental results demonstrate that iPiDA-LTR effectively predicts piRNA-disease associations outperforming the other related methods.

As more and more piRNA functions were detected, many evidences indicated that dysfunction and abnormal expression of piRNAs are closely associated with the emergence and development of diseases [13][14][15][16][17]. Therefore, the identification of associations between piR-NAs and diseases is important for diagnosis and treatment of diseases [18,19]. Currently, it mainly focused on biological experimental methods and computational methods. For biological experiments methods, Cabral et al. indicated that piRNAs play a role in the process of translational research of gastric cancer as potential biomarkers [20]. Krishnan et al. identified eight non-redundant piRNAs as breast cancer markers [21]. Roy et al. studied the reciprocal expression between piRNAs and the corresponding targets, and provided a novel insight into the role of piRNAs in Alzheimer's disease [22]. Although biological experimental methods are highly reliable, it takes substantial time and resources. Some computational methods have been proposed for identifying the associations between non-coding RNAs and diseases, such as miRNA-disease associations [23], circRNA-disease associations [24], etc. In this regard, computational methods are proposed to predict piRNA-disease associations, which can serve as powerful auxiliary tools to save time and cost compared with biological experiments. For example, Wei et al. proposed the first computational predictor for identifying piRNA-disease associations based on the positive unlabelled learning algorithm, and established the first web server [25]. A convolutional neural network was utilized to extract association features between piRNAs and diseases, and then the Support Vector Machine was employed to construct the predictor [26]. Although computational methods have been proposed, they mainly aim at the application scenario of identifying missing associations between known piRNAs and diseases. However, more and more newly detected piRNAs were detected [27][28][29]. Therefore, the application scenario of identifying piRNAdisease associations of newly detected piRNAs is very important to investigate piRNA functions and disease pathogenesis.
In recent years, information retrieval (IR) becomes a widely used technology, whose ultimate goal is to rank documents based on the relevance to certain topics [30,31]. As an successful algorithm in information retrieval, Learning to Rank (LTR) [ [38], drug-target binding affinity prediction [39], etc. The core concept of LTR is to calculate the relevance score f(q, d) between query q and document d. Therefore, this task is particularly similar with identification of piRNA-disease associations (see Fig 1). PiRNAs and diseases can be treated as queries and documents, respectively. Learning to Rank not only identifies associations between known piRNAs and diseases, but also ranks diseases associated with newly detected piRNAs.
In this study, we propose a new predictor, named iPiDA-LTR, to predict associations between piRNAs and diseases, which has the following advantages. iPiDA-LTR predictor combines component methods and Learning to Rank, which cannot only identify missing associations between known piRNAs and diseases, but also can identify diseases associated with newly detected piRNAs. Experimental results indicated that iPiDA-LTR is promising to identify piRNA-disease associations. A web server of iPiDA-LTR is constructed to identify diseases associated with query piRNAs, which can be accessed at http://bliulab.net/iPiDA-LTR.

Materials
To imitate two application scenarios, we construct two types of datasets based on piRDisease v1.0 database [40] collecting 7939 piRNA-disease associations with 4796 piRNAs and 28 diseases. Firstly, a standard dataset S all is constructed following [25], which can be represented as: ( where A all represents 5002 piRNA-disease associations from [25]. P all and D contain 4350 piR-NAs and 21 diseases from A all , respectively. A þ all and A À all contain known piRNA-disease associations and unknown piRNA-disease associations, respectively. Specifically, piRNA-disease associations contained in A þ all are labelled as 1, otherwise 0. To avoid overfitting problem, S all is further divided into a benchmark dataset and an independent dataset. The benchmark dataset is used to adjust parameters and train model via cross-validation, and the independent dataset is employed to evaluate the performance of different methods.

For the first application scenario: predicting associations between known piRNAs and known diseases
Benchmark dataset and independent dataset are constructed as: where we randomly select 20% associations from A þ all and A À all to construct S aþ ind and S aÀ ind , respectively, and then the remaining associations in A þ all and A À all are used to construct S aþ ben and S aÀ ben , respectively. Obviously, S a ben represents benchmark dataset, which is used to optimize parameters and train models, and then trained models are used to identify unknown associations in S a ind .

For the second application scenario: predicting the associations between newly detected piRNAs and known diseases
To imitate the second application scenario, we randomly select 80% and 20% piRNAs from P all as known piRNA set P knwon all and newly detected piRNA set P unknown all , respectively, based on which benchmark dataset and independent dataset are constructed as: where S p ben and S p ind represent benchmark dataset and independent dataset, respectively. PiR-NAs contained in S p ben and S p ind belong to P known all and P unknown all , respectively. Detailed information of S a ben , S a ind , S p ben and S p ind is shown in Table 1. The datasets can be obtained at http://bliulab. net/iPiDA-LTR/dataset/.

Method overview
In this study, a novel ranking framework, named iPiDA-LTR, is proposed to solve two application scenarios. The workflow of iPiDA-LTR is shown in

PLOS COMPUTATIONAL BIOLOGY
The identification of piwi-interacting RNA-disease associations

Association feature extraction
PiRNA sequence similarities. The piRNA similarities play a vital role in RNA-disease association identification [24-26], and piRNA sequence similarities have been applied to piRNA-disease association identification [25,26]. Many methods have been proposed to calculate sequence similarities [41][42][43]. For example, Smith-Waterman algorithm has been successfully applied to multiple sequence analysis tasks, including RNA sequence similarity analysis [25,26,44], protein sequence analysis [45,46], etc. In this study, we employ Smith-Waterman algorithm [41,44] to calculate piRNA sequence similarities: Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where S P (p i , p j ) is similarity between piRNA p i and piRNA p j . SW(p i , p j ) represents local alignment score between piRNA p i and piRNA p j based on Smith-Waterman algorithm. Disease semantic similarities. The disease semantic similarity calculation is a key component in RNA-disease association identification. The disease ontology [47] has been applied to RNA-disease association identification so as to calculate disease semantic similarities [48][49][50][51][52][53]. Disease ontology organized by the directed acyclic graph (DAG) provides a hierarchical structure of the complex disease parent node [47]. Similar diseases share similar hierarchical structure in DAG of disease ontology. Therefore, DAG of disease ontology helps to measure similarity between two diseases. In this study, we use DAG of disease ontology to calculate disease semantic similarities [54,55]: S n ðiÞ ¼ maxf0:5 � S n ðjÞjj 2 children of ig if i 6 ¼ n ( where S D (m, n) is similarity between disease m and disease n. T k represents the node set containing the ancestor nodes of k and itself. S n (i) is the semantic value of node i to node n. Association features and labels. The association feature between disease d and the query piRNA p is: where Fðp; dÞ is the association features of piRNA p and disease d. S P (p,:) and S D (d,:) represent pth row and dth row in the S P and S D , respectively. If piRNA p is associated with disease d, the label of Fðp; dÞ is equal to 1, otherwise 0.

Component methods
In this study, we select two types of component methods to calculate association scores, including machine learning methods and collaborative filtering (CF). For machine learning methods, of query piRNAs; (iii) Ranking diseases associated with query piRNAs: association scores of samples in the benchmark dataset are used to train LambdaMART model, and then trained LambdaMART model is employed to rank diseases associated with query piRNAs. https://doi.org/10.1371/journal.pcbi.1010404.g002

PLOS COMPUTATIONAL BIOLOGY
The identification of piwi-interacting RNA-disease associations Random Forest (RF) [56][57][58][59][60], Logistic Regression method (LR) [61], and Support Vector Machine (SVM) [62][63][64] are employed, treating the identification of piRNA-disease association as a classification problem. CF is a recommendation algorithm [65,66], which utilizes guilt-by-association assumption to identify piRNA-disease association focusing on local information. In this study, association features of benchmark dataset (see Eq 7) are used to train machine learning models, and then used to calculate association scores for S all dataset. Finally, association features between piRNA p and disease d can be represented as:

Ranking diseases associated with query piRNAs
In this study, we employ Learning to Rank (LTR) to solve the problem of identifying potential piRNA-disease associations motivated by information retrieval [24,37,38,67]. LTR is generally classified into three categories, including ListWise, PairWise and PointWise [68]. In this study, a ListWise method LambdaMART [32] is selected to obtain high quality of top-ranked diseases, which has been applied in identifying circRNA-disease associations [24], detecting protein remote homology [37], predicting protein-phenotype associations [38] and drug-target binding affinity prediction [39]. The number of trees, the truncation level k, shrinkage and the number of leaves are the four main parameters. The truncation level of k influences the quality of top-ranked results by Normalized Discounted Cumulative Gain (NDCG), which can be formulated as [32]: where k represents the truncation level. IDCG@k is the value of DCG@k in the best optimal ranking results. If a query piRNA is associated with disease located in position i, rel i is equal to 1, otherwise 0. To obtain the final ranking results, association scores calculated by Eq 8 for training set are used to train LambdaMART model, and the trained LambdaMART model is employed to rank diseases associated with query piRNAs based on association scores of query piRNAs.

Evaluation criteria
In this study, the benchmark dataset is employed to optimize the parameters of the models, and the independent dataset is used to evaluate the performance of predictors. How to evaluate the ranking quality and prediction performance is crucial for identifying piRNA-disease associations. Because iPiDA-LTR predictor treats the identification of piRNA-disease associations as an information retrieval ranking task, we employ three important ranking criteria to evaluate the rank quality of different predictors: Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and ROCk. Besides, Area Under the ROC Curve (AUC) and Area Under the Precision-Recall Curve (AUPR) are also used to measure comprehensive performance [69][70][71][72]. The average values of these criteria for all query piRNAs are calculated to evaluate performance of predictors.

The effect of parameters for identifying piRNA-disease associations
iPiDA-LTR predictor mainly contains the following four parameters: the number of trees, the truncation level k, shrinkage and the number of leaves. Due to the large number of combinations of the four parameters, we fix three parameters in turns, and then find the local optimal values of the remaining parameters according to AUPR. The influences of different combinations of parameters for iPiDA-LTR on S a ben dataset and S p ben dataset are shown in Figs 3 and 4, respectively, from which we can see that the final optimized combinations of four parameters on iPiDA-LTR predictor on S a ben dataset and S p ben dataset are (120, 14, 0.22, 3) and (30, 15, 0.10, 29), respectively.

Complementary analysis for component methods
In this study, iPiDA-LTR incorporates two types of component methods, including machine learning methods (LR, RF and SVM) and collaborative filtering (CF). LR, RF and SVM are obtained by python package Scikit-learn [73]. For LR's parameters, max_iter and solver are assigned as 300 and liblinear, respectively. For RF's parameters, n_estimators, max_leaf_nodes, n_jobs and max_features are assigned as 80, 10, -1 and 0.2, respectively. For SVM's parameters, kernel and probability are assigned as linear and True, respectively. We analyze the impact of different types of component methods to identify associations between piRNAs and diseases, and the results are shown in Tables 2 and 3, from which we can see the followings: (i) iPi-DA-LTR predictor outperforms iPiDA-LTR-ML predictor on S a ben dataset and S p ben dataset; (ii) The iPiDA-LTR obviously outperforms iPiDA-LTR-ML in terms of ranking criteria (NDCG@5 and ROC1), especially for the second application scenario (see Table 3). Machine learning methods based on classification algorithms focus on global predictive performance, and collaborative filtering can identify special piRNA-related diseases focusing on local predictive performance. Therefore, machine learning methods and collaborative filtering are complementary. It is not surprising that iPiDA-LTR predictor obtains the best performance compared with iPiDA-LTR-ML, because iPiDA-LTR shares the advantages of these two types of methods.
The usage frequencies of component methods measure the contribution of component methods for iPiDA-LTR.

Comparison with related methods
In this section, the two state-of-the-art predictors including iPiDi-PUL predictor [25] and iPiDA-sHN predictor [26] are compared with iPiDA-LTR predictor, and the results are shown in Tables 4 and 5, from which we can see that iPiDA-LTR is better than the other methods, indicating that iPiDA-LTR is more suitable for identifying piRNA-disease associations. Researchers tend to focus on the top ranked predicted associations in practical application scenarios. Therefore, we analyze the quality of the predicted results (see Fig 6), from which we can see that iPiDA-LTR outperforms the other predictors in terms of ROC1-ROC10. It is not surprising because the loss function of LambdaMART NDCG mainly focuses on the topranked predictive known associations (see Eq 9).

Case study
To illustrate the predictive performance of iPiDA-LTR predictor for the identification of associations between new piRNAs and diseases, two query piRNAs, including piR-hsa-23210 and piR-hsa-15023, are selected as query piRNAs from S all dataset, respectively. The remaining piRNAs in S all are used to train iPiDA-LTR model, and then the trained iPiDA-LTR model is employed to predict diseases associated with piR-hsa-15023 and piR-hsa-23210.
The predicted results of piR-hsa-23210 and piR-hsa-15023 are shown in Tables 6 and 7, respectively, from which we can see the followings: (i) The evidences for the top five predicted piR-hsa-23210-associated diseases are supported by PubMed (https://pubmed.ncbi.nlm.nih.gov/).

PLOS COMPUTATIONAL BIOLOGY
The identification of piwi-interacting RNA-disease associations

PLOS COMPUTATIONAL BIOLOGY
The identification of piwi-interacting RNA-disease associations

PLOS COMPUTATIONAL BIOLOGY
The identification of piwi-interacting RNA-disease associations For example, the target gene of piR-hsa-23210 is SMC5, which plays crucial roles in the process of human spermatogenesis, such as on the synaptonemal complex between synapsed chromosomes, and in the development of spermatogonial cells [74]. Roy et al. found that piR-33044 (piR-hsa-23210) is significantly abnormal expression in Alzheimer Disease [22]. (ii) Four diseases in Table 7 have been proved to be associated with piR-hsa-15023. For example, Busch et al. found that piR-hsa-15023 is down-regulated in renal cell carcinoma [75]. piR-hsa-15023 showed a significantly differentially expression in gastric adenocarcinoma and non-malignant stomach tissue [76]. Therefore, these results demonstrated that iPiDA-LTR predictor is an effective approach to identify associated diseases for newly detected query piRNAs.

Conclusion
In this study, we treat the task of piRNA-disease associations as a search task based on Learning to Rank [32,68], where piRNA and disease are regarded as query and document, respectively. The following conclusions can be drawn: (i) iPiDA-LTR can effectively handle with two types of application scenarios compared with the other state-of-the-art methods, especially for the identification of diseases associated with newly detected piRNAs, which is important for studying the pathogenesis of disease and the function of piRNAs; (ii) iPiDA-LTR incorporates component methods into Learning to Rank so as to improve the predictive performance; (iii) The corresponding web server of iPiDA-LTR is freely accessed at http://bliulab.net/ iPiDA-LTR/. Although iPiDA-LTR effectively predicts piRNA-disease associations, it only integrates basic machine learning methods and collaborative filtering. In future studies, we will integrate the other state-of-the-art methods and features to improve piRNA-disease associations. The LTR-based framework discussed in this study is a general framework, which would have many other applications in bioinformatics, such as protein function prediction, remote homology detection, etc.

Author Contributions
Conceptualization: Bin Liu.