SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions

LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. Existing computational methods utilize multiple lncRNA features or multiple protein features to predict lncRNA-protein interactions, but features are not available for all lncRNAs or proteins; most of existing methods are not capable of predicting interacting proteins (or lncRNAs) for new lncRNAs (or proteins), which don’t have known interactions. In this paper, we propose the sequence-based feature projection ensemble learning method, “SFPEL-LPI”, to predict lncRNA-protein interactions. First, SFPEL-LPI extracts lncRNA sequence-based features and protein sequence-based features. Second, SFPEL-LPI calculates multiple lncRNA-lncRNA similarities and protein-protein similarities by using lncRNA sequences, protein sequences and known lncRNA-protein interactions. Then, SFPEL-LPI combines multiple similarities and multiple features with a feature projection ensemble learning frame. In computational experiments, SFPEL-LPI accurately predicts lncRNA-protein associations and outperforms other state-of-the-art methods. More importantly, SFPEL-LPI can be applied to new lncRNAs (or proteins). The case studies demonstrate that our method can find out novel lncRNA-protein interactions, which are confirmed by literature. Finally, we construct a user-friendly web server, available at http://www.bioinfotech.cn/SFPEL-LPI/.


Introduction
Long noncoding RNAs (lncRNAs) are a class of transcribed RNA molecules with a length of more than 200 nucleotides that do not encode proteins [1,2]. Since lncRNAs are involved in important biological regulations [3][4][5], lncRNAs have gained widespread attention. Studies [5][6][7][8][9] revealed that lncRNAs can interact with proteins, and then activate post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNAs' functions. There exist a large number of unexplored lncRNAs and proteins, which makes it impossible to examine their interactions efficiently and effectively through wet experiments.
In recent years, many computational methods have been proposed to predict lncRNA-protein interactions, in order to screen lncRNA-protein interactions and guide wet experiments. There are two types of computational methods: binary classification methods and semi-supervised learning methods. The binary classification methods take known interacting lncRNAprotein pairs as positive instances and non-interacting pairs as negative instances, and build binary classification-based models. Muppirala et al. [10] adopted the k-mer composition to encode RNA sequences and protein sequences, and used SVM and random forest to build prediction models. Wang et al. [11] used RNA-protein interactions as positive instances, and randomly selected twice number of protein-RNA pairs without interaction information as negative samples, and then built prediction models by using naive Bayes. Suresh et al. [12] proposed a support vector machine-based predictor "RPI-Pred" to predict protein-RNA interactions based on their sequences and structures. Xiao et al. [13] used the HeteSim measure to score lncRNA-protein pairs, and then built an SVM classifier based on HeteSim scores. However, binary classification-based methods are influenced by the imbalance ratio between positive instances and negative instances, and how to select high-quality negative instances is challenging. Semi-supervised learning methods formulate the lncRNA-protein interaction prediction as semi-supervised learning tasks. Lu et al. [14] used matrix multiplication to score each RNA-protein pair for prediction. Li et al. [15] proposed a heterogeneous network-based method "LPIHN", which integrated the lncRNA-lncRNA similarity network, the lncRNA-protein interaction network and the protein-protein interaction network. Then, a random walk with restart was implemented on the heterogeneous network to infer lncRNA-protein interactions. Yang et al. [16] proposed the Hetesim algorithm, which can predict lncRNA-protein relation based on the heterogeneous lncRNA-protein network. Ge et al. [17] proposed a computational method "LPBNI" based on the lncRNA-protein bipartite network inference.
Zheng et al. [18] constructed multiple protein-protein similarity networks to predict lncRNAprotein interactions. Zhang et al. [19] employed KATZ measure to calculate similarities between lncRNAs and proteins in a global network, which were constructed based on lncRNA-lncRNA similarity, lncRNA-protein associations and protein-protein interactions. Hu et al. [20] presented the eigenvalue transformation-based semi-supervised link prediction method "LPI-ETSLP". Zhang et al. [21] proposed a linear neighborhood propagation method (LPLNP) by combining interaction profiles, expression profiles, sequence composition of lncRNAs and interaction profile, CTD feature of proteins. Moreover, there are related works about the DNA-protein binding prediction [22,23].
Existing computational methods utilize diverse lncRNA features and protein features, but features are not available for all lncRNAs or proteins, and these methods cannot work when information is unavailable. In addition, many lncRNAs (or proteins) don't have known interactions with any protein (or lncRNA), and we name them as new lncRNAs (or proteins). Most existing methods are not capable of predicting interacting proteins (or lncRNAs) for new lncRNAs (or proteins).
In this paper, we propose the sequence-based feature projection ensemble learning method, "SFPEL-LPI", to predict lncRNA-protein interactions. First, SFPEL-LPI extracts lncRNA sequence-based features and protein sequence-based features. Second, SFPEL-LPI calculates multiple lncRNA-lncRNA similarities and protein-protein similarities by using lncRNA sequences, protein sequences and known lncRNA-protein interactions. Then, SFPEL-LPI combines multiple similarities and multiple features with a feature projection ensemble learning frame. Computational experiments demonstrate that SFPEL-LPI predicts lncRNA-protein associations accurately and outperforms other state-of-the-art methods. More importantly, SFPEL-LPI can be applied to new lncRNAs (or proteins). The case studies demonstrate that our method can find out novel lncRNA-protein interactions.

Dataset
Several databases facilitate the lncRNA-protein interaction prediction. NPInter database [24] includes experimental interactions among non-coding RNA and biomolecules (i.e. proteins, genomic DNAs and RNAs). NONCODE is an integrated information resource for non-coding RNAs. SUPERFAMILY [25] is a database of structural and functional annotation for all proteins and genomes. As far as we know, lncRNA-protein interactions from NPInter v2.0 database were widely used in related studies [20,21,[26][27][28][29]. Based on NPInter v2.0 interactions, we compiled a dataset containing 4158 lncRNA-protein interactions between 990 lncRNAs and 27 proteins. Moreover, we collected the sequences of these lncRNAs and proteins from NON-CODE and SUPERFAMILY respectively. We adopt NPInter v2.0 dataset as the benchmark dataset to test the performances of prediction models.
Here, we introduce notations about the dataset. Given a set of lncRNAs L ¼ fL 1 ; L 2 ; � � � ; L s g and a set of proteins P ¼ fP 1 ; P 2 ; � � � ; P t g, known lncRNA-protein interactions can be represented by an s×t interaction matrix Y, where Y ij = 1 if the lncRNA L i interacts with the protein P j , otherwise Y ij = 1.

Features for lncRNAs and proteins
In this section, we describe two lncRNA features and two protein features, based on lncRNA sequences, protein sequences and known lncRNA-protein interactions.  [43][44][45][46] describes the contiguous local sequence-order information and the global sequence-order information of lncRNAs. The pseudo dinucleotide composition has several variants, and we use the parallel correlation pseudo dinucleotide composition, which contains the occurrences of different dinucleotides and the physicochemical properties of dinucleotides. The PseDNC feature vector of an RNA sequence L is defined as: where f k is the normalized occurrence frequency of dinucleotide in the RNA sequence L; the parameter τ is an integer, representing the highest counted rank of the correlation along an RNA sequence; w is the weight factor ranging from 0 to 1; θ j is the j-tier correlation factor reflecting the sequence-order correlation between all the j-th most contiguous dinucleotides along an RNA sequence. We obtain PseDNC feature vectors of lncRNAs by using the python package "repDNA", and more details about PseDNC are described in [40]. Moreover, we define the interaction profiles (IP) of lncRNAs based on known lncRNA-protein interactions. For a lncRNA L i , its interaction profile is a binary vector encoding the presence or absence of interactions with every protein, denoted as IP L i . Actually, the interaction profile of a lncRNA corresponds to a row vector of the interaction matrix Y, IP L i ¼ Yði; :Þ.
Protein features. The pseudo amino acid composition (PseAAC) [47-49] describes the amino acid composition and the sequence-order information of proteins, and has been widely used for tasks in bioinformatics. PseAAC contains 20 components reflecting the occurrence frequency of amino acids in a protein as well as the additional factors reflecting sequenceorder information. Thus, we use PseAAC as a feature to represent proteins. There are several variants of PseAAC, and we adopt the parallel correlation pseudo amino acid composition. The PseAAC feature vector of a protein sequence P is defined as: where f i is the normalized occurrence frequency of the 20 amino acids in the protein sequence P; the parameter τ is an integer, representing the highest counted rank of the correlation along a protein sequence; w is the weight factor ranging from 0 to 1; θ j is the j-tier correlation factor reflecting the sequence-order correlation between all the j-th most contiguous residues along a protein sequence. We obtain the PseAAC feature vectors of proteins by using web server "Psein-One", and more details are described in [37]. Similar to the lncRNA interaction profiles, the protein interaction profile (IP) of a protein P i is a binary vector specifying the presence or absence of interactions with every lncRNAs, denoted as IP p i . The interaction profile of a protein corresponds to a column vector of the interaction matrix Y, IP p i ¼ Yð:; iÞ.

Similarities for lncRNAs and proteins
In this section, we describe three lncRNA-lncRNA similarities and three protein-protein similarities.
LncRNA-lncRNA similarities. As introduced in Section "LncRNA features", we have two lncRNA features: PseDNC and IP, and thus use them to calculate two types of lncRNA-lncRNA similarities. There are different approaches to calculate similarity based on feature vectors, such as Jaccard similarity, Gauss similarity and cosine similarity. Here, we adopt the linear neighborhood similarity (LNS), which has been proposed in our previous work and successfully applied to many bioinformatics problems [21,34,50].
Moreover, we define the Smith Waterman subgraph similarity (SWSS) for lncRNAs. Smith Waterman algorithm [51] is a powerful tool to calculate similarity between biological sequences, but Smith Waterman algorithm only takes the sequence information into account. By considering sequence information and interactions information, we define Smith Waterman subgraph similarity (SWSS) between lncRNA L i and lncRNA L j as, where SW(P o1 ,P o2 ) is the Smith Waterman score between protein P o1 and protein P o2 . A(L i ) and A(L j ) are the set of proteins which interact with L i and L j . n1 = |A(L i )| and n2 = |A(L j )|. Therefore, we obtain three lncRNA-lncRNA similarities: PseDNC similarity, IP similarity and SWSS similarity.
Protein-protein similarities. As introduced in Section "Protein features", we have two proteins features: PseAAC and IP. We also calculate two types of similarities by using the linear neighborhood similarity measure.
Similarly, we can calculate the Smith Waterman Subgraph Similarity (SWSS) between two proteins P i and P j , where SW(P o1 ,P o2 ) is the Smith Waterman score between lncRNA L o1 and lncRNA L o2 . A(P i ) and A(P j ) are the set of lncRNAs which interact with protein P i and protein P j . m1 = |A(P i )| and m2 = |A(P j )|. Therefore, we obtain three protein-protein similarities: PseAAC similarity, IP similarity and SWSS similarity.

Feature projection ensemble learning method
Combining various features or fusing various features can usually lead to high-accuracy models [52-58]. We have n features for lncRNAs (or proteins), denoted as n feature matrices known lncRNA-protein interaction matrix is denoted as Y. The flowchart of the feature projection ensemble learning method SFPEL-LPI is shown in Fig 1. Objective function. First, lncRNA (or protein) feature matrices fX i g n i¼1 are respectively projected to the predicted lncRNA-protein interaction matrix R by using the projection matrices fG i g n i¼1 . We estimate the projection matrices fG i g n i¼1 for features by minimizing the squared error between their products and the predicted lncRNA-protein interaction matrix R. So we have: where k � k 2 F is the Frobenius norm, and the projection matrices fG i g n i¼1 are required to be nonnegative.
Then, we introduce the ' 1;2 -norm regularization term of fG i g n i¼1 to ensure the smoothness of the projection matrices. The predicted matrix R should be approximated to the known interaction matrix Y. We can have where λ is the regularization coefficient, and μ is a trade-off parameter. kG i k 1;2 ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Local structure of data can be maintained effectively through constructing a weighted graph or a similarity graph on a scatter of data points.  [64][65][66][67] revealed that the combination of multiple similarities helps to improve performances. Inspired by pioneer work, we define a novel ensemble graph Laplacian regularization: where D i is a diagonal matrix whose diagonal elements are corresponding row sums of W i , and θ = [θ 1 ,θ 2 ,� � �,θ i ,� � �,θ m ] is a weight vector which is introduced to control the contribution of different graph Laplacian regularizations, and tr(�) is the trace of a matrix. η>1 is the exponent of θ, which ensures that all graph Laplacian regularizations contribute effectively for the maintaining of graph local structures. By combining (4) and (5), we obtain the objective function of SFPEL-LPI: We introduce the Lagrangian function (Lf) to solve the optimization problem in (6), We calculate the partial derivatives of above function with respect to R, G i and θ i , and obtain the update rules about R, θ i and G i (proof and deduction are provided in S1 File): G i ¼ G i K ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where e is a column vector with all elements equal to 1, and has the same column dimensions feature projection ensemble learning and lncRNA-protein interactions as X i . J denotes element-wise multiplication (also well known as Hadamard product), and the division in (9) is element-wise division. We separate the positive and negative parts of matrix A as Thus, we update R, G i and θ i based on (7), (8) and (9) alternatively until convergence. Algorithms. Following the method proposed in the Section "Objective function", SFPEL-LPI can predict unobserved interactions between known lncRNAs and proteins. First, based on the lncRNA's features, similarities and lncRNA-protein interactions, the prediction matrix R l could be obtained. Similarly, using protein's features, similarities and protein-lncRNA interactions, the prediction matrix R p could be calculated. Then, SFPEL-LPI integrates the predictions based on lncRNAs and proteins as M = (R l +(R p ) T )/2. Therefore, the unobserved interactions are scored in the corresponding entries of M. Algorithm 1 describes how SFPEL-LPI predicts unobserved associations between known lncRNAs and known proteins.
In addition, SFPEL-LPI could also be applied to predict proteins (or lncRNAs) interacting with new lncRNAs (or proteins). After using Algorithm 1 to train the model, the projection matrix and the weighting parameters of lncRNA's features as well as protein's features: G lu , G lv , θ lu and θ lv could be obtained. Then, we can use the features of new lncRNAs (or proteins) and the trained parameters to predict their predictions. Algorithm 2 describes how SFPEL-LPI finishes this task.

Evaluation metrics
We adopt five-fold cross validation to evaluate the performances of prediction models. The proposed method SFPEL-LPI can predict unobserved interactions between known lncRNAs and known proteins, and also can make predictions for new lncRNAs (or proteins). In predicting unobserved lncRNA-protein interactions, all known lncRNA-protein interactions are randomly split into five subsets with equal size. Each time, four subsets are combined as training set and the remaining one subset is used as the testing set. In predicting proteins interacting with new lncRNAs, all known lncRNAs are split into five subsets with equal size. The model is constructed based on the lncRNAs in training set and their interactions with all proteins, and then is used to predict proteins interacting with testing lncRNAs. Similarly, we evaluate the performances of models in predicting lncRNAs interacting with new proteins. Hence, we introduce notations for above mentioned cross validation settings. CV lp : known lncRNA-protein interactions are split into five folds in predicting unobserved interactions. CV l : known lncRNAs are split into five folds in predicting interactions for new lncRNAs. CV p : known proteins are split into five folds in predicting interactions for new proteins.
The area under ROC curve (AUC) and the area under precision-recall curve (AUPR) are popular metrics for evaluating prediction models. Since known lncRNA-protein interactions are much less than non-interacting lncRNA-protein pairs, we adopt AUPR as the primary metric, which punishes false positive more in the evaluation process [68,69]. Moreover, we adopt several binary classification metrics, i.e. recall (REC), accuracy (ACC), precision (PR) and F1-measure (F1).

Parameter setting
SFPEL-LPI has three parameters: μ, λ and η. μ is a parameter for the error between projected interactions and predicted lncRNA-protein interactions; λ controls the contribution of projection matrix; η describes strength of different similarity measures.
The parameter η is the index of similarity weights, and could control the relative contributions of different similarities. When fixing μ = 10 −3 and λ = 10 −4 , we analyze the relation between η and lncRNA similarity measures θ lncRNA (or protein similarity measures θ protein ). As shown in Fig 3, similarities usually make different contributions to SFPEL-LPI models, and interaction profile similarities usually make more contributions than other similarities. With increase of η, different similarities are likely to make equal contributions.

Performances of SFPEL-LPI
SFPEL-LPI can predict unobserved lncRNA-protein interactions between known lncRNAs and known proteins, and also can make predictions for new lncRNAs (or proteins). For different tasks, we adopt different evaluation schemes to split instances and implement five-fold cross validation under settings: CV lp , CV l and CV p . Table 1 displays AUPR scores and AUC scores of SFPEL-LPI evaluated by CV lp , CV l and CV p . According to previous studies [70][71][72], a prediction model that can accurately recover feature projection ensemble learning and lncRNA-protein interactions the true interacting proteins (or lncRNAs) is usually desired and useful for the wet experimental validation. Thus, we calculate the proportion of correctly predicted true interactions at different top-ranked percentiles under CV l or CV p . A new matric "recall @ top-ranked k %" is defined as the fraction of true interacting proteins (or lncRNAs) that are retrieved in the list of top-ranked k% predictions for a lncRNA (or protein). In Fig 4A, SFPEL-LPI performs effectively in predicting proteins (or lncRNAs) interacting with new lncRNAs (or proteins). The reason why the performances of predicting lncRNAs interacting with new proteins is not as well as the performances of predicting proteins interacting with new lncRNAs is that the number of lncRNAs (990) in our dataset is much more than the number of proteins (27). Consequently, less information is used to train SFPEL-LPI models.
To further test capability of SFPEL-LPI for new proteins, we randomly select ten proteins to conduct experiments. In each experiment, a protein is used as the testing protein, and the model is constructed based on other proteins, all lncRNAs and their associations, and then predict lncRNAs interacting with the testing protein. AUC scores and AUPR scores are calculated based on the results for each protein. As shown in Fig 4B, SFPEL-LPI produces the AUPR values greater than 0.6 and the AUC values greater than 0.7 for most proteins, indicating great potential of predicting lncRNAs interacting with new proteins.

Comparison with state-of-the-art prediction methods
Several state-of-the-art computational methods have been proposed to predict lncRNA-protein interactions. Here, we adopt RWR [17], LPBNI [17], KATZLGO[19], LPI-ETSLP [20] and LPLNP [21] for comparison. RWR implemented random walk with restart to predict lncRNAprotein interactions. LPBNI constructed a lncRNA-protein bipartite network based on known lncRNA-protein interactions, and then predicted lncRNA-protein interactions by using the  feature projection ensemble learning and lncRNA-protein interactions resource allocation algorithm. KATZLGO constructed a heterogeneous network based on lncRNA-lncRNA similarity, lncRNA-protein interactions and protein-protein similarity, and then adopted KATZ measure to calculate distances between lncRNAs and proteins in the network. LPI-ETSLP calculated lncRNA-lncRNA similarity and protein-protein similarity based on pairwise sequence Smith-Waterman scores, and then built semi-supervised link prediction classifier based on these similarities. LPNLP calculated three lncRNA-lncRNA similarities and two protein-protein similarities by using linear neighborhood similarity measure, and implemented label propagation to develop the integrated models. First, we respectively build different prediction models based on the benchmark dataset. The benchmark methods were designed to predict unobserved interaction between know lncRNAs and know proteins. Therefore, we implement these methods and mainly evaluate their performances in predicting unobserved interactions under CV lp . As shown in Table 2 .920, respectively. SFPEL-LPI outperforms these five methods, and makes 100.4%, 43.3%, 65.4%, 46.9%, 3.1% improvements in terms of AUPR scores and 8.2%, 7.5%, 21.1%, 3.5%, 1.1% improvements in terms of AUC scores when compared with five benchmark methods. Though SFPEL-LPI produces slightly better performances than LPLNP in terms of AUPR and AUC, LPLNP utilizes more information than SFPEL-LPI for modeling. To be more specific, LPLNP uses three lncRNA features ("interaction profile", "expression profile", "sequence composition") and two protein features ("interaction profile", "CTD"), while SFPEL-LPI only used lncRNA sequences, protein lncRNAs and known lncRNA-protein interactions.
We conduct 20 runs of five-fold cross validation to evaluate methods, and take the paired ttest to analyze difference between SFPEL-LPI and benchmark methods. Table 3 demonstrates  feature projection ensemble learning and lncRNA-protein interactions that SFPEL-LPI produces significantly better results than state-of-the-art methods in terms of AUC and AUPR. The computational complexity is important for a computational method. To test the efficiency of SFPEL-LPI, we repeat 5-fold cross validation 20 times and compare running time of different methods on a PC with an Intel i7 7700k CPU and 16GB RAM. SFPEL-LPI costs the reasonable running time (29.42s) when compared with RWR (25.83s), LPBNI (4.01s), KATZLGO (4.36s), LPI-ETSLP (4.56s) and LPLNP (1337.64s).
Further, we randomly perturb all known lncRNA-protein interactions to test the robustness of prediction methods. To be more specific, we randomly remove 5% of known lncRNA-protein interactions and add the same number of inexistent interactions, and then compile the perturbed dataset. We build different prediction models based on the perturbed dataset and evaluate their performances. Clearly, data perturbation brings noise, and decreases the performances of prediction models. As displayed in Fig 5, Table 2, SFPEL-LPI still produces satisfying results, and outperforms RWR, LPBNI, KATZLGO, LPI-ETSLP and LPLNP.

Independent experiments
Here, we conduct independent experiments to evaluate the practical ability of SFPEL-LPI. As described in Section "Dataset", NPInter v2.0 dataset was compiled from the V2.0 edition of NPInter database. NPInter database has been updated to V3.0 edition, and contains newly discovered lncRNA-protein interactions. Therefore, we train the prediction model based on the NPInter v2.0 dataset and predict new lncRNA-protein interactions, and then check up on predictions in the NPInter database. Fig 6 shows  feature projection ensemble learning and lncRNA-protein interactions methods. Clearly, SFPEL-LPI finds out more interactions than benchmark methods. In addition, we observe that most of novel interactions identified by SFPEL-LPI have low ranks in the predictions of other benchmark methods, indicating that SFPEL-LPI can find out interactions ignored by these methods. Top predictions and their ranks are provided in S1 Table.

Web server
We develop a web server based on SFPEL-LPI to facilitate the lncRNA-protein interaction prediction, available at http://www.bioinfotech.cn/SFPEL-LPI/. Users can input lncRNA sequences (or protein sequences) or upload a text file with FASTA-formatted lncRNA sequences (or protein sequences) for prediction, and freely download the results and visualize the predicted lncRNA-protein interactions. Moreover, gene ontology (GO) terms of proteins are annotated for indicating lncRNAs' functions. Fig 7 displays the top 10 predictions for the lncRNA "NONHSAT041930". "NON-HSAT041930" named OIP5-AS1 (OIP5 antisense RNA 1), is a mammalian lncRNA that is abundant in the cytoplasm [73]. OIP5-AS1 has gained wide attention. In 2011, it was first identified to be involved in brain and eye development [74]. In 2016, Kim et al. [75] found that it can prevent HuR binding to target mRNAs and thus suppress the HuR-elicited proliferative phenotypes. Moreover, the lncRNA was found to interact with GAK mRNA, promoting GAK mRNA decay and hence reducing GAK protein levels and lowering cell proliferation [76]. Among top 10 predicted proteins interacting with OIP5-AS1, two proteins have already been known to have interactions with OIP5-AS1, which are included in the NPInter dataset. In addition, we find evidence from literature to support other six predicted proteins. For example, IGF2BP1, IGF2BP2, IGF2BP3, EWSR1 and TIA1 have already been examined to interact with OIP5-AS1 according to lncRNA-protein interacting data report [77]. Protein Argonaute 2 (AGO2) is required for proper nuclear migration, pole cell formation, and cellularization during the early stages of embryonic development. Several studies [75,78] showed that OIP5-AS1 is associated with AGO2. Moreover, annotated GO terms of predicted proteins indicate the function of the lncRNA OIP5-AS1: mRNA binding (GO: 0005845, GO: 0035925, GO: 0036002, GO: 0048027, GO: 0098808) and cell proliferation (GO:0022013). More details are provided in S2 Table. These encouraging instances demonstrate that the proposed method can successfully predict novel lncRNA-protein interactions.
Moreover, the server can predict interacting lncRNAs for proteins. For example, top 20 interacting lncRNAs of the protein "9606.ENSP00000240185" are shown in the Fig 8, and details are provided in S3 Table.

Discussion
This paper presents a novel lncRNA-protein interaction prediction method, namely sequencebased feature projection ensemble learning (SFPEL-LPI). The novelty of SFPEL-LPI comes feature projection ensemble learning and lncRNA-protein interactions from integrating sequence-derived features and similarities with a feature projection ensemble learning frame. Specifically, SFPEL-LPI only utilizes lncRNA sequences, protein sequences and known interactions to extract features, and calculates lncRNA-lncRNA similarities and protein-protein similarities. Since sequences are usually available for lncRNAs or proteins, SFPEL-LPI can make predictions for almost all lncRNA-protein pairs. Moreover, diverse information leads to the good performances of SFPEL-LPI.
To evaluate the performance of SFPEL-LPI, an extensive set of experiments were performed on the benchmark dataset under three CV setting: CV lp , CV l and CV p , compared with state-ofthe-art lncRNA-protein interaction prediction methods. The promising results validate efficacy of the proposed algorithm for predicting lncRNA-protein interactions, especially for the new lncRNAs or new proteins, which do not have known interactions. SFPEL-LPI outperforms five methods: RWR, LPBNI, KATZLGO, LPI-ETSLP, LPLNP, and makes 100.4%, 43.3%, 65.4%, 46.9%, 3.1% improvements in terms of AUPR scores. Further, we also analyze the running time of SFPEL-LPI and benchmark methods, and randomly perturb all known lncRNA-protein interactions to test the robustness of prediction methods. A web server is constructed to predict interacting proteins/lncRNAs for given lncRNAs/proteins. We adopt the lncRNA "NONHSAT041930" as an example to predict interacting proteins, and can find evidences to confirm novel lncRNA-protein interactions.
However, SFPEL-LPI still has several limitations. It has three parameters, and parameter tuning is time-consuming. In addition, known lncRNA-protein interactions are limited, and performances of SFPEL-LPI will be improved if more interactions are known.