iPiDA-SWGCN: Identification of piRNA-disease associations based on Supplementarily Weighted Graph Convolutional Network

Accurately identifying potential piRNA-disease associations is of great importance in uncovering the pathogenesis of diseases. Recently, several machine-learning-based methods have been proposed for piRNA-disease association detection. However, they are suffering from the high sparsity of piRNA-disease association network and the Boolean representation of piRNA-disease associations ignoring the confidence coefficients. In this study, we propose a supplementarily weighted strategy to solve these disadvantages. Combined with Graph Convolutional Networks (GCNs), a novel predictor called iPiDA-SWGCN is proposed for piRNA-disease association prediction. There are three main contributions of iPiDA-SWGCN: (i) Potential piRNA-disease associations are preliminarily supplemented in the sparse piRNA-disease network by integrating various basic predictors to enrich network structure information. (ii) The original Boolean piRNA-disease associations are assigned with different relevance confidence to learn node representations from neighbour nodes in varying degrees. (iii) The experimental results show that iPiDA-SWGCN achieves the best performance compared with the other state-of-the-art methods, and can predict new piRNA-disease associations.

Emerging evidence indicate that piRNAs participate in many disease genesis and prognosis [7][8][9]. For example, piR-36712 is downregulated in breast cancer by suppressing cell proliferation, invasion and migration through the combination with SEPW1P RNA [10]. The piR-651 shows upregulated expression in gastric cancer, and tends to be associated with TNM stages [11]. Several studies highlight that piRNAs can be viewed as potential biomarkers of disease diagnosis and prognosis for effective therapeutic project design [12,13]. Therefore, developing computational methods has great significance for identifying piRNA-disease associations.
Several computational methods have been proposed to predict the associations between non-coding RNA s and diseases. These methods usually rely on network link [14], recommendation system [15], matrix completion [16], classical machine learning [17] and deep learning [18]. However, research on the detection of piRNA-disease associations is still in its preliminary stages. To unravel the complex interactions between piRNAs and diseases, several computational methods for predicting piRNA-disease associations have been proposed [19][20][21][22][23]. For example, iPiDi-PUL [19] and iPiDA-sHN [20] were proposed to predict piRNA-disease association based on positive-unlabeled learning. APDA [21] and iPiDA-GBNN [22] employed a stacked auto-encoder to extract piRNA-disease pair features, and then it was trained with random forests and Gradient Boosting Neural Network respectively to predict new piRNA-disease associations. DFL-PiDA [23] combined convolutional de-noising autoencoder and extreme learning machine to identify potential associations. iPiDA-LTR [24] employed a ranking framework to integrate several component methods for piRNA-disease association detection. To further improve the representation ability of association, iPi-DA-GCN [25] designed Asso-GCN and Sim-GCN models for iteratively extracting features for piRNAs and diseases.
Although the aforementioned methods have contributed to the piRNA-disease association detection, there are two main limitations for the further improvement of piRNA-disease association prediction: (i) The high sparsity of piRNA-disease associations. For example, there are only about 5% piRNA-disease associations with experimental validation in piRDisease v1.0 [26] and MNDR v3.0 [27]. The lack of association information will prevent the predictors to accurately infer the piRNA-disease associations. (ii) The Boolean associations between piRNAs and diseases. Most of the existing methods for piRNA-disease association detection utilize Boolean values to denote whether a piRNA is related with a disease or not during the training process. However, piRNAs are related with diseases with different probabilities. Boolean associations only focusing on the connectivity will ignore the confidence information.
To solve the above limitations, we propose a supplementarily weighted strategy to enrich the topology structure information of piRNA-disease network so as to provide more information for piRNA-disease association detection. As shown in Fig 1, our goal is to infer whether the target piRNA is associated with the target disease or not. In the original network, as the lack of link information, it is difficult to detect whether the target association exists or not (Fig  1a). Then three weighted piRNA-disease associations are preliminarily supplemented (Fig 1b). Therefore, three feasible paths can be generated to infer the association probability (Fig 1c). Finally, the prediction results are obtained by integrating different inference results based on their feasible paths (Fig 1d). As a result, supplementarily weighted associations can enrich the connectivity information, and yield more possible paths to comprehensively predict the relationship between piRNAs and diseases.
With the development of representation learning, Graph Convolutional Network (GCN) [28] is proposed to extend the CNN for graph-structured data, and achieves powerful ability of representation learning by capturing rich structural information [29]. In this paper, we combine the supplementarily weighted strategy and GCN, and propose a novel predictor named iPiDA-SWGCN for piRNA-disease association detection with three major contributions: (i) The supplementarily weighted matrix with high-quality computed by machine learning methods can provide more information for the original sparse piRNA-disease association network, based on which GCN can aggregate more neighbour node information for expressive feature learning. (ii) The Boolean associations are replaced by weighted associations with the computed relevance confidence. With the piRNA-disease associations assigned with different weights, GCN can learn node representations from neighbour nodes in varying degrees. (iii) The evaluation results indicate that iPiDA-SWGCN has the ability to effectively detect missing piRNA-disease associations, and shows superior performance than the other state-of-the-art methods.
In this study, machine learning predictors are trained with two phases: (i) Training several basic classifiers to compute the weights of unknown piRNA-disease associations. (ii) Training GCN to capture piRNA and disease features. In the first phase, the dataset D all is split into a benchmark dataset D 1 ben and an independent dataset D 1 ind . We train several basic predictors on the benchmark dataset, and then utilize the trained predictors to score the associations in the independent dataset. The datasets can be formulated as: where D þ all is the positive set with 11981 known piRNA-disease associations. D À all contains 180850 unknown piRNA-disease associations. D 1þ ben and D 1À ben are randomly selected from D all with the equivalent number of piRNA-disease associations, representing the positive and negative subset of D 1 ben , respectively. In order to assign weights for all unknown piRNA-disease associations, the negative independent set D À all contains all unknown associations. The positive independent set D 1þ ind are constructed by randomly selecting 20% positive associations in D þ all . In the second training phase, we divide the dataset following the previous studies [30,31]: where the positive benchmark dataset D 2þ ben contains 80% of positive piRNA-disease associations randomly selected from D þ all , and the remaining positive associations constitute the positive independent dataset D 2þ ind . The negative independent dataset D 2À ind are randomly selected from D À all with the equal size of D 2þ ind , and the rest of negative samples constitute the sub-dataset D 2À ben .
To prevent overestimating the performance of the proposed method, the piRNA-disease associations in the independent dataset for model evaluation are removed from the training phases:

Method overview
In this study, we propose a novel method named iPiDA-SWGCN for piRNA-disease association detection. The overall process of iPiDA-SWGCN is shown in Fig 2 with four parts: (i) Network construction (Fig 2a). A heterogeneous piRNA-disease network is constructed by integrating piRNA and disease information; (ii) Supplementarily weighted piRNA-disease network generation (Fig 2b). Different supplementary weights are assigned to unknown piRNAdisease pairs based on the scores computed by several predictors; (iii) GCN-based feature extraction (Fig 2c). In this section, GCN is performed on the supplementarily weighted piRNA-disease network to capture the structural information, and extract feature representations of piRNAs and diseases. (iv) Association prediction (Fig 2d). Finally, we use the fully connected layers for dimension reduction and inner product operation so as to calculate the association scores between piRNAs and diseases.

Network construction
In this section, we construct a heterogeneous piRNA-disease network denoted as G = In detail, three types of edges and two types of nodes are included in the piRNA-disease network. The three kinds of edges are piRNA-disease associations, piRNA-piRNA edges and disease-disease edges, represented as E piR−disease , E piR−piR and E disease−disease , respectively. E piR−disease are the original piRNAdisease associations derived from the database MNDR v3.0. The other two kinds of edges are obtained based on the similarity between homogeneous biological entities. V piR and V disease represent the nodes of piRNAs and diseases. The specific calculation of above edge and node representations will be introduced in the following sections.
PiRNA-disease associations. The adjacency matrix A PD represents the associations for each pair of piRNA-disease, denoted as: where m and n are the number of piRNAs and diseases, respectively. The element a i,j is 1, if the association between i-th piRNA and the j-th disease is confirmed with experimental verification, otherwise a i,j = 0. PiRNA-piRNA sequence similarity. For each pair of piRNAs, the sequence similarity is calculated, where the sequence information is downloaded from piRBase v3.0 (http://bigdata. ibp.ac.cn/piRBase/) [32]. PiRNA-piRNA sequence similarity are calculated via Smith-Waterman local alignment algorithm [33], which is highly sensitive and can detect subtle similarities between sequences with low identity in a robust and accurate manner [34,35]. We compute the sequence similarity score for given pair of piRNAs as [33]: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where SW(p i , p j ) denotes the sequence alignment score between p i and p i calculated by Smith-Waterman alignment algorithm. Disease-disease semantic similarity. Disease semantic similarity has been extensively used in ncRNA-disease association detection [36][37][38][39]. Directed Acyclic Graph (DAG) describes the relationship among different disease terms. Disease semantic similarity can be calculated based on the Disease Ontology (DO) descriptors in DAG [40][41][42]. The Directed Acyclic Graph (DAG) based algorithm not only uses a consistent standard to make the calculated disease similarity uniform and comparable but also is capable of fully capturing complex relationships between diseases and better represent the disease space and the interconnection of diseases. In this study, we adopt one of the most effective semantic similarity measurements [41,43,44] to compute the disease semantic similarity. For disease d i and disease d j , the disease semantic similarity D sem (d i , d j ) is calculated by [44]: where T i is composed of all subterms in the DAG of disease d i , S d i t ð Þ indicates the semantic contribution of disease t 2 T i to the i-th disease according to [44]. θ is the semantic contribution factor which is set as 0.5 following [44].
Node feature extraction. The row vector of similarity matrix P seq or D sem can serve as the feature vector for a piRNA or disease. However, they ignore the connectivity information, especially for non-neighboring and higher-order connected nodes. Therefore, in order to introduce the global topology information of similarity networks, we further apply random walk with restart (RWR) [45] on P seq and D sem to extract piRNA and disease features. The feature generation for node i based on RWR can be formulated as [45]: where F k i is the row vector of node i, whose elements indicate the probabilities of walking from node i to all the other homogeneous nodes after k steps. S is the probability transition matrix obtained from similarity matrix (P seq or D sem ) with row-wise normalization, y i is a one-hot vector representing the initial probability vector of node i, and α is the restart probabilities. Finally, we obtain F 1 i as the feature descriptor of node i.

Supplementarily Weighted Graph Convolutional Network (SWGCN)
Many unknown pairs lead to the high sparsity of piRNA-disease association network. GCN performing on such a network may lead to performance degradation because of limited neighbor information aggregation. To overcome this problem, we propose the Supplementarily Weighted Graph Convolutional Network (SWGCN). Firstly, the supplementarily weighted adjacency matrix is computed to achieve an informative piRNA-disease association network with high quality, and then GCN is adopted to capture hidden structure information for feature learning. Weights computation for piRNA-disease association. In this section, we define the supplementarily weighted adjacency matrix in SWGCN for piRNA-disease association completion as follows: Definition 1. Supplementarily weighted adjacency matrix. Given a network G = {V, E}, we train several basic machine-learning-based predictors to compute the average relevance scores for unknown associations among nodes. Specifically, w i,j represents the average score of edge e i,j between node n i and node n j , and the adjacency matrix of the completed network can be To improve the quality of SWGCN, 15 different basic predictors based on Random Forest [46], Support Vector Machine (SVM) [47], Gradient Boosting Decision Tree (GBDT) [48] are trained for weight computation. It should be noted that we train various predictors on different training sets, where the positive training set D 1þ ben keeps the same while the negative training sets D 1À ben are randomly selected from D À all for five times. For a pair of piRNA p i and disease d j , the element A 0 PD i; j ð Þ is formulated as: where s k SVM denotes the prediction score computed by the k-th SVM classifier. 15 predictors are employed to score the unknown associations, and the average scores are set as the weights for unknown associations. Finally, the weighted edges are added into the original heterogeneous network to generate the supplementarily weighted piRNA-disease network.
Feature learning based on Graph Convolutional Network. GCN has been widely used for capturing graph structural information, and aggregating neighbour information so as to extract the node features [49,50]. After assigning weights to piRNA-disease associations, GCN is performed on the supplementarily weighted piRNA-disease network to learn the feature representations of different nodes by information aggregation from neighbours. The node embedding H l learned by GCN in the l-th layer can be formulated as [28]: where where the adjacency matrix A all 2 R (m+n)×(m+n) indicates the association adjacency matrix for the whole network, and g A all is A all added with self-loop.D denotes the degree matrix of g A all , W l represents the trainable parameters of GCN model, σ(�) denotes the nonlinear activation function. H l 2 R (m+n)×c is initialized by concatenating the piRNA and disease representations, where c denotes the common dimension of piRNA and disease features obtained by the dense layer. It is denoted that the batch norm layer is added before each dense layer and convolution layer to alleviate the problem of vanishing or exploding gradients and speed up the convergence. The batch norm layer can standardize the mean and variance of the input to each layer based on the statistics computed over a batch of training examples to make the distribution of each layer's input relatively stable, and further reduce the risk of overfitting.
Compared with the original piRNA-disease network, GCN can effectively convolve more useful neighbor information with supplementarily weighted edges. Fig 3 shows the comparison of performing GCN on different networks. Take the disease node a as an example, on the original piRNA-disease network, the first layer of GCN updates the feature representation of node a by aggregating the features of its first-order neighbor node b1 and b2 (Fig 3a and 3b), failing to capture the indirectly connected node information of c1, c2, c3, d1, d2, d3. In contrast, the node representation of a can be learned by aggregating all piRNA node information with different concerns on the supplementarily weighted piRNA-disease association network (Fig 3c and 3d). As a result, SWGCN is able to capture deeper and wider neighbor information for informative feature learning.

Association prediction
After obtaining the feature representation learned by GCN, the full connection networks with three layers are constructed to separately reduce the dimensions of representations for piRNAs and diseases. The probability of piRNA p i associated with disease d i is calculated by inner production as: where f 0 p i and f 0 d j denote the feature representation of piRNA p i and disease d i after dimensionality reduction, and U is the predicted final score matrix for the associations between piRNAs and diseases. It should be noted that we conduct Batch Normalization (BN) [51] following each convolution layer to mitigate internal covariate shift and increase the stability.
We utilize the mean square error as the loss function which minimizes the Frobenius norm of the difference between prediction score matrix U and the label matrix A PD . Nevertheless, the high sparsity of piRNA-disease association matrix may cause prediction bias to the unknown associations. To alleviate this problem, we adopt β-enhanced loss function [52] that enlarges the margin between the real label matrix A PD and predicted score matrix U with hyper-parameter α. The loss function is formulated as: whereÂ PD indicates the enhanced association matrix. W denotes the trainable parameters. μ is a decay factor controlling the regularization term of W to prevent overfitting.

Performance evaluation
Two metrics are employed to comprehensively evaluate the predictor performance, including AUPR (area under the precision recall curve) [53] and AUC (area under the receiver operating characteristics curve) [54]. AUC measures the sensitivity and specificity of the model [55], and AUPR can avoid the impact of imbalanced data sets, and comprehensively reflect the quality of predictions [56].

Combination of basic methods can improve the quality of SWGCN
In this study, three basic machine learning methods (RF, SVM and GBDT) are used to compute weights for the unknown piRNA-disease pairs. To analyze their contribution to weighting associations in the proposed model, we compare the predictors based on different basic methods and their combinations. Table 1 lists the results, and their performance differences are shown in Fig 4, from which we can draw the following conclusions: (i) Assigning weights to associations indeed contributes to performance improvement. (ii) It is not surprising that integrating several complementary methods for weights computation can effectively improve the performance compared with using a single basic method. (iii) iPiDA-SWGCN outperforms all the other methods because it integrates all the basic methods to compute weights so as to assign weights to piRNA-disease associations with better performance.

Impact of different association adjacency matrices on the performance of iPiDA-SWGCN
Previous studies represent the piRNA-disease associations as Boolean values, where the element in the adjacency matrix is 1 if the association in the network is known, otherwise it is 0.
In this study, we compute weights for unknown associations in the network instead of 0. To illuminate the effectiveness and necessity of assigning weights instead of Boolean values for unknown edges, we perform experiments on five networks with different edge types. The experimental results are listed in Table 2, from which we can see that the model with the weighted network is superior to those with the network with added Boolean piRNA-disease edges. The reason is that assigning weights to associations cannot only enrich more network structural information, but also can provide association confidence information for GCN so as to learn the expressive node representation. In contrast, Boolean edges only indicate whether the associations exist or not.

Performance comparison among different methods
To illuminate the effectiveness of iPiDA-SWGCN for identifying piRNA-disease associations, we conducted a performance comparison of our method with five state-of-the-art approaches, including iPiDi-PUL [19], iPiDA-sHN [20], iPiDA-LTR [24], iPiDA-GCN [25] and piRDA [57]. The web servers or source codes of all these methods are accessible, enabling unbiased evaluation of their performance. The experimental results are displayed in Table 3, from which we can concluded that iPiDA-SWGCN outperforms all the other methods. iPiDA-SWGCN is superior to iPiDA-GCN by 10.29% and 11.15% in terms of AUC and AUPR, respectively. The performance improvement can be attributed to the fact that iPiDA-SWGCN is able to capture more expressive node representations by aggregating more neighbor information from the supplementarily weighted network.

Visualization of predicted associations by iPiDA-SWGCN
In order to visually illustrate the effectiveness of the supplementarily weighted network used in iPiDA-SWGCN, the prediction results of GCN performing on different networks are compared. We take two piRNA-disease associations for further analysis, including <piR-hsa-22710, Parkinson's disease> and <piR-hsa-28405, renal cell carcinoma>. Parkinson's Disease (PD) is a widely prevalent neurodegenerative disorder. The gross pathological findings reveal significant damage to dopaminergic neurons in the midbrain's substantia nigra (SN), leading to dopamine deficiency in the nerve terminals located in the basal ganglia [58]. PiR-has-22710 has length of 30nt and is downregulated in PD-patient tissue samples [59]. Renal cell carcinoma (RCC) is a prevalent cancer, ranking sixth in incidence in men and tenth in women globally [59]. Recent research suggests that the expression levels of piRNA are linked to the histological grade, pathological features of RCC, and patient survival. Multiple studies have demonstrated that piRNAs show differentially expressed in benign versus malignant renal tumor tissues [59,60]. PiR-has-28405 is a type of Homo sapiens piRNA with length of 32nt, showing about 4-fold downregulation in renal tumor tissue [60].
The results are shown in Fig 5, from which we can draw the following conclusions: (i) Due to the shortage of verified piRNA-disease associations, the ability of GCN to capture network structural information is limited; (ii) GCN performing on the supplementarily weighted  [19], iPiDA-sHN [20] and iPiDA-GCN are obtained from [25]. The result of iPiDA-LTR [24]  network can correctly predict the test piRNA-disease associations, because the weighted network cannot only provide more proximity structural information, but also contains the association confidence.

Case study
To illuminate the practicability of iPiDA-SWGCN, we applied iPiDA-SWGCN to detect potential piRNAs associated with three important diseases ('Renal cell carcinoma', 'Parkinson's disease' and 'Cardiovascular disease'). The top five detected piRNAs and their corresponding literature evidence are listed in Table 4. From Table 4 we can see that all the top five detected piRNAs associated with 'Renal cell carcinoma' and 'Parkinson's disease' have been validated by literatures. Four of five detected piRNAs associated with 'cardiovascular disease' are confirmed by the literature. For example, piR-hsa-23184 shows a higher expression of 2.26-fold in metastatic compared to non-metastatic tumor [60]. The piR-hsa-5389 is upregulated in cells and post-mortem tissue samples between control and Parkinson's disease patients [59]. The expression of piR-hsa-25177 in cardio sphere cells is 3.38-fold higher than that in cardio sphere-derived cells [62]. Therefore, iPiDA-SWGCN can effectively detect potential piRNA-disease associations, which is suitable for real world applications. The more specific case study results are shown in Table B in S1 Text.

Conclusion
In this work, we propose a novel predictor named iPiDA-SWGCN for piRNA-disease association prediction by combining the supplementarily weighted strategy and GCN. The iPi-DA-SWGCN mainly has following advantages: (i) Potential piRNA-disease associations are supplemented in the piRNA-disease network by integrating various basic predictors to provide an informative network, based on which GCN can capture deep proximity structure, and fully utilize network information for feature learning. (ii) Different confidences are assigned to the piRNA-disease associations instead of Boolean values. Therefore, GCN aggregates node information and accurately learns node representations from neighbor nodes in varying degrees. (iii) Although iPiDA-SWGCN is proposed for predicting piRNA-disease associations, it can be extended to other link prediction tasks.
It is anticipated that the strategy of supplementarily weighted adjacency matrix will be applied to other related fields to solve the problems of limited positive samples, such as lncRNA-disease association detection and drug repositioning. In particular, there are plenty of unknown associations in link prediction field resulting in high sparsity problem. The supplementarily weighted strategy can be implemented to preliminarily enrich the association network so as to provide more useful information and improve the prediction performance.
Besides, the value of piRNA-disease associations without independent experimental validation is worth mentioning. PiRNAs are potential biomarkers that may provide new avenues of investigation into the pathogenesis of diseases. However, it should be noted that these associations predicted by iPiDA-SWGCN are only correlations and require experimental validation to confirm their significance. This is especially relevant for piRNA studies due to their high abundance and poorly understood roles in diseases. Future studies should aim to validate these associations through a combination of biological experiments and bioinformatics analysis to ensure their reliability and accuracy, promoting to gain a better understanding of the complex mechanisms underlying diseases and develop more effective strategies for prevention and treatment.
Supporting information S1 Text. Fig A. Parameter analysis of iPiDA-SWGCN.