PRAS: Predicting functional targets of RNA binding proteins based on CLIP-seq peaks

RNA-protein interaction plays important roles in post-transcriptional regulation. Recent advancements in cross-linking and immunoprecipitation followed by sequencing (CLIP-seq) technologies make it possible to detect the binding peaks of a given RNA binding protein (RBP) at transcriptome scale. However, it is still challenging to predict the functional consequences of RBP binding peaks. In this study, we propose the Protein-RNA Association Strength (PRAS), which integrates the intensities and positions of the binding peaks of RBPs for functional mRNA targets prediction. We illustrate the superiority of PRAS over existing approaches on predicting the functional targets of two related but divergent CELF (CUGBP, ELAV-like factor) RBPs in mouse brain and muscle. We also demonstrate the potential of PRAS for wide adoption by applying it to the enhanced CLIP-seq (eCLIP) datasets of 37 RNA decay related RBPs in two human cell lines. PRAS can be utilized to investigate any RBPs with available CLIP-seq peaks. PRAS is freely available at http://ouyanglab.jax.org/pras/.

It is important to identify the functional targets of RBPs, which are essential regulators in post-transcriptional processes. PRAS aims to predict RBP targets based on the intensities and positions of the binding peaks obtained from CLIP-seq studies. We demonstrate that PRAS score outperforms other existing methods not only in the prediction of the PCRvalidated targets of the RBP CELF4, but also in the correlation with the global expression change induced by CELF proteins. The better performance of PRAS on a group of RBPs associated with RNA decay in comparison to the existing methods indicates its potential for large-scale applications in detecting functional targets. Leveraging the position information of the binding peaks, PRAS is a bridge linking peak-calling methods and the interpretation of RBPs' biological functions, which strengthens the analysis of CLIP-seq data. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Introduction RNA-binding proteins (RBPs) are essential in many post-transcriptional regulatory processes, such as alternative splicing, stability, localization and editing [1]. For example, RBP Quaking plays important roles in pre-mRNA splicing and mRNA export [2]; RBP HuR is an mRNA stability and splicing regulator [3]; RBP Ataxin-2 promotes mRNA stability and protein expression [4]. RBPs achieve their functions via binding to RNAs; therefore, it is of vital importance to study RNA-protein interaction. Cross-linking and immunoprecipitation followed by sequencing (CLIP-seq) approaches have been widely used to detect the binding peaks of RBPs at the transcriptome scale [5][6][7][8][9]. Thus, the examination of CLIP-seq peaks informs us of the functional targets of RBPs.
In this paper, we develop a new approach named Protein-RNA Association Strength (PRAS), which incorporates the intensity and positional information of CLIP-seq peaks to quantitate the association between an RBP and its targets. We apply PRAS to study two CUGBP ELAV-like family proteins, CELF4 and CELF1 with both CLIP and perturbation RNA-seq data available. CELF4 (also known as Brunol4) is expressed as an mRNA regulator in the central nervous system across species [24,25]. The deficiency of CELF4 is associated with a complex neurobehavioral disorder including seizures and autism-like features in human [26,27] and in mice [28]. iCLIP studies revealed that CELF4 preferentially binds, almost exclusively in 3' untranslated regions (UTRs), to mRNAs encoding many important neurological functions, [29]. CELF1 is implicated in myotonic dystrophy [30]. CELF1 is highly expressed in early embryonic stages and are then down-regulated dramatically in skeletal muscle and the heart during development [31,32]. CELF1 has been reported to promote transcript deadenylation and the abnormal up-regulation of its protein level could contribute to the myotonic dystrophy pathology [33,34]. A more refined understanding of the functional targets of CELF RBPs is essential for understanding the impact of CELF in development and diseases, and may provide clues as to the mechanisms by which CELF impacts mRNA function. In addition, to demonstrate the robustness of PRAS, we examined its performance of detecting the functional targets in a large-scale collection of eCLIP data of RBPs in the integrated encyclopedia of DNA elements in the human genome (ENCODE). By applying PRAS to the eCLIP peaks of the RNA decay regulators, we demonstrate that PRAS outperforms other existing methods and also provide deeper understanding in the post-transcriptional regulation of these RBPs.

The framework of PRAS
The basis of PRAS is to score a potential functional target of an RBP based on both the intensities and positions of its binding sites. Our pipeline of calculating PRAS is shown in Fig 1. First,  Fig 1. Flowchart of the PRAS pipeline. There are mainly three steps in calculating the PRAS scores. First, we merge the significant cross-linking sites as the binding peaks. Then, we use user-provided or automatically selected reference position and score each transcript based on both the intensities given a CLIP-seq dataset, the significant cross-linking sites that are within a small interval of each other (default: 20 nt) are merged as RBP binding peaks. If the called binding peaks are provided, we will use them directly. Second, if a reference position is provided by the user based on known knowledge of the function of the RBP, PRAS will use it directly; if no reference position is given, PRAS will set it based on the RBP's binding preference, e.g., the distal end of the 3' UTR of the transcript (aka polyadenylation site). Finally, each transcript is scored as the sum of the intensities of the binding peaks weighted by the distances between the mid points of the binding peaks and the preselected reference position. All mRNAs are then ranked by the PRAS scores and can be tested for associations with functions.

PRAS score calculation
As described in Fig 1, the PRAS score is based on the weighted sum of the intensities of the binding given detected CLIP-seq peaks. In the study that analyzed the interaction between DNA and proteins with ChIP-seq datasets, the exponential decay function was used to characterize the decreasing effects of a transcription factor binding peak on its targets with increasing distances [35]. Therefore, we here construct the score to describe the regulatory effect of an RBP on its targets in a similar way. Specifically, we define the PRAS score for an mRNA as: where r i is the intensity (CLIP-seq read counts) of the ith peak cluster of the RBP, d i is the distance (number of nucleotides) between the reference position and the ith peak cluster, and d 0 is a constant. For both CELF4 and CELF1 in mouse, we set the reference position as the distal 3' UTR and the constant d 0 = 1000 nt. Note that d 0 = 1000 nt is the default setting, but not a hard-set option in PRAS. For the RNA decay regulators in human, we set the constant d 0 = 500 nt. The details of d 0 estimation for RBPs in mouse and human are described in the Results and Discussions sections.

PRAS implementation
PRAS is implemented in Python (version 2.7.14 or above) and R (version 3.3.2 or above) scripts and has minimum requirements for the inputs. To reformat the annotation file, PRAS takes use of gtfToGenePred, a toolkit from the UCSC Genome Browser [36]. PRAS also uses BEDTools [37] to efficiently obtain the overlapping between the binding sites and the annotation regions. The annotation file should be the Gene Transfer Format (GTF) format and the peak file (no special requirement for the peak caller) should be the Browser Extensible Data (BED) format as the required input files, which are both the standard file formats. Details of usage can be found on the instruction page of our website: http://ouyanglab.jax.org/pras/.

PRAS score is a strong predictor of PCR-validated mRNA targets of CELF4
CELF4 is expressed in excitatory neurons of the adult mouse brain, from which iCLIP data are available [25,28,29]. We collected the significant cross-linking sites detected by iCount (http://icount.fri.uni-lj.si) with false discovery rate (FDR) less than or equal to 0.05. We conducted a metagene analysis involving all 9,193 mRNAs that are bound by CELF4 and noted an enrichment of iCLIP reads at the distal (3' end) versus proximal (5' end) 3' UTR (S1 Fig). This preference suggests a potentially functional role of CELF4 binding close to the polyadenylation site. We calculated the PRAS scores for CELF4 binding mRNAs with the polyadenylation site as the reference position, which gives the binding sites closer to the polyadenylation site higher weights. We estimated the decay parameter d 0 in Eq (1) based on the strength of the peak intensity decay shown in S1 Fig. In detail, we defined the weighting formula as w ¼ e À d=d 0 according to Eq (1). The highest average peak density, 0.843, appears at 63 nt to the 3' end of 3' UTR and the average peak density at 1000 nt upstream to the 3' end of the 3' UTR is 0.285 (S1 Fig). We calculated w as the ratio between the average peak intensity at the 1000 nt upstream to the 3' UTR and that of the 3' end of the 3' UTR, which is 0.285/0.843 = 0.339. By plugging d = 937 nt (which is 1000 nt-63 nt) and w = 0.339 into the weighting formula, we obtained the estimation of 866 nt for d 0 , which is approximately the default of 1000 nt. For comparison, we applied the expressRNA procedure of Rot et al. [22], which sums the number of reads in CLIP peaks within 200 nt upstream and downstream flanking the polyadenylation sites (Fig 2). We also applied the procedure in Wang et al. [34], which calculated the score as the number of significant CLIP peaks per kilobase (noted as PPK; Fig 2). Each of the three measurements ranks CELF4 binding mRNAs from high to low scores.
We then evaluated the performance of PRAS, expressRNA, and PPK on a list of known functional targets previously validated by qPCR in wild-type and Celf4 null mouse brain, totaling 23 mRNAs [29] (see details in S1 Text). To investigate the ability of the three measurements to identify CELF4 functional targets, we performed receiver operating characteristic (ROC) analysis. We extracted the log fold change (LFC) of the qPCR values in Celf4 null mouse brain over wild-type. The mRNAs with positive and negative LFCs were labelled as CELF4-degraded and CELF4-stabilized genes, respectively. The area under the curve (AUC) of the ROC curve was used to measure the prediction performance of the methods. We found that PRAS perfectly distinguished the PCR-validated CELF4-degraded and CELF4-stabilized genes (AUC = 1), outperforming expressRNA (AUC = 0.867) and PPK (AUC = 0.7) (Fig 3A). This result suggests that given CLIP peaks, PRAS has greater ability to capture the functional  The scatter plot of the PRAS score against the qPCR log fold change (LFC). The X-axis represents the log2 fold change in qPCR from the wild-type to Celf4 null mouse brain. The Y-axis shows the log10 of PRAS score. Each black dot represents a validated target by the qPCR. The regression line is highlighted in blue color. The Pearson's correlation coefficient is indicated by the red text on the plot. (C) The heatmap of binding signals of the qPCR-validated targets. The X-axis represents the distance to the 3' end of the 3' UTR, and the Y-axis shows the genes in the validated list. The color shows the log2 of read counts of CELF4 iCLIP-seq within its significant peaks, where the warmer the color is the stronger the binding is. The black bars in each row shows the distance from the 5' end of the 3' UTR to the 3' end of the 3' UTR, which indicates the length of each 3' UTR. https://doi.org/10.1371/journal.pcbi.1007227.g003 Predicting RBP functional targets from CLIP-seq peaks targets of CELF4 compared to expressRNA and PPK. In addition, we examined the quantitative relationship between the PRAS scores and the qPCR LFCs of these known targets. A negative Pearson's correlation coefficient (-0.60) was obtained, suggesting that the more negative qPCR LFC a target has, the larger the PRAS score is (Fig 3B). The advantage of PRAS over expressRNA and PPK can be attributed to two factors. First, PRAS utilizes the binding bias of CELF4 towards the distal 3' UTRs of its validated targets (Fig 3C). expressRNA partially utilizes this bias by considering the 200 nt flanking region around the polyadenylation site, whereas PPK does not consider the binding bias. Second, unlike expressRNA which only considers a fixed flanking region, PRAS considers all binding peaks, which decreases loss of important RBP binding sites. The analysis of the validated targets of CELF4 suggests the importance of binding near the polyadenylation sites as a potential factor on how it regulates gene expression. By applying different decay parameter d 0 to PRAS, we found that PRAS obtained equally good performance over a reasonable range of d 0 s (S2A and S2B Fig).

PRAS score correlates with global mRNA change induced by CELF RBPs
To assess the ability of PRAS to detect RBP functional targets in the entire transcriptome, we extracted the top 500 genes ranked by permutation test p-values in the differential expression test between the wild-type and Celf4 null mouse brain based on existing microarray datasets [29]. We calculated the LFC for gene expression in Celf4 null over wild-type mouse brain. The mRNAs have lower abundance (LFC < 0) in Celf4 null genotype are more likely to be CELF4-stabilized targets, while the mRNAs with higher abundance (LFC > 0) in Celf4 null brain were more likely to be CELF4-degraded targets. We sought to assess the ability of PRAS on capturing CELF4-stabilized vs. CELF4-degraded targets. Specifically, we first set a sequence of cutoffs as the quantiles (from 0.05 to 0.95 with step size as 0.05) of the distribution of the absolute value of the expression LFCs. Second, for each cutoff, we extracted a subset of genes whose absolute expression LFC is larger or equal to the cutoff. Finally, for each subset of potential CELF4 targets, we calculated the Spearman's correlation coefficient between the expression LFCs and the PRAS scores, in which the magnitude and sign of the correlation reflect the association between the two. For comparison, we also applied the same correlation analysis to expressRNA and PPK ranking scores. Line-charts of the Spearman's correlation coefficient of the three methods are shown in Fig 4A. We observed that the more stringent the expression LFC cutoff for the gene subset was set, the stronger the negative correlation between the PRAS score and the expression LFC was obtained, which suggests that PRAS is more powerful in capturing more reliable CELF4-stabilized targets. In addition, the expressRNA score is less correlated with the expression LFC, and the direction of the correlation between the PPK score and the expression LFC flips at different cutoffs. The results suggest that PRAS has greater ability to select the regulated mRNA targets compared to expressRNA and PPK.
We also extracted the top 500 genes ranked by their adjusted p-values in the differential expression test between the wild-type and Celf1 over-expression in mouse muscle based on published RNA-seq datasets [34]. In this dataset, mRNAs that have higher abundance upon Celf1 over-expression (LFC > 0) are more likely to be CELF1-stabilized targets while those that have lower abundance upon Celf1 over-expression (LFC < 0) were more likely to be CELF1-degraded targets. We evaluated the performance of the three aforementioned methods using the same analysis as with CELF4. We used the 3' end of the 3' UTR as the reference site The lower X-axis represents the different cutoffs applied to extract the subset of genes, the upper X-axis represents the number of genes in PRAS to rank the mRNA targets based on the reported binding preference of CELF1 [34]. PRAS has a stronger negative correlation with the expression LFC compared to expressRNA and PPK for each subset of the potential CELF1 targets (Fig 4B). These results suggest that PRAS is more powerful in capturing the reliable CELF1-degraded targets, consistent with the main regulatory function of CELF1 [34].
Next, we used different reference sites in PRAS for scoring functional targets of CELF4 and CELF1 in order to examine the effect of the reference site selection. We scored the targets of CELF4 using the 5' end of the 3'UTR as the reference site in PRAS (PRAS 5') and did a similar correlation analysis as above. We observed that the PRAS 5' score is also negatively correlated with the expression LFC and the magnitude of correlation improves with increasingly stringent cutoffs (S3A Fig). However, the magnitude of the correlation is not as high as that of PRAS with the 3' end of 3' UTR as the reference site (PRAS 3') ( Fig 4A). We also similarly analyzed the targets of CELF1 using PRAS 5'. Again, the PRAS 3' has stronger negative correlation with the expression LFC than PRAS 5' for the more reliable CELF1 targets (S3B Fig). The results indicate that known biological knowledge can aid in reference site selection in PRAS for identifying the functional targets of the CELF proteins. The results also suggest that both the CELF4 and CELF1 proteins may regulate mRNAs via the distal 3' UTRs while having opposite effects on their targets. Indeed, this is plausible because CELF proteins play various roles in both co-transcriptional and post-transcriptional RNA regulation, as well as translation inhibition in different cellular contexts [38][39][40].
To examine the difference of taking the raw or the normalized read density of the CLIP peaks as the input of PRAS, we then used the Celf4 null iCLIP-seq as the negative control for the wild-type CELF4 iCLIP to score the functional targets of CELF4 with the 3' end of the 3'UTR as the reference site. Specifically, we replaced the iCLIP-seq read counts r i in Eq (1) by the enrichment ratio r i � log 2 r i c i � � as suggested by Van Nostrand et al. [41], where c i is the Celf4 null iCLIP-seq read counts of the ith peak cluster. We noted the PRAS score using the raw read intensity and the enrichment ratio of peaks as PRAS-raw and PRAS-norm, respectively. By applying the correlation analysis as above, we found that PRAS-norm has achieved stronger negative correlation with the expression LFC than PRAS-raw (S4 Fig). This improvement of performance indicates the important role of the negative control in reducing the noise, which is consistent with the results in [42]. Even though PRAS-raw cannot achieve as good performance as PRAS-norm, the difference in the performance between them is small (S4 Fig), which indicates that PRAS can handle the situation where the negative control of CLIP-seq is not available, such as the CELF1 data in our study.

PRAS identified targets are strongly enriched in functional categories
To further compare the functional relevance of the targets identified by PRAS, expressRNA and PPK, we performed gene ontology (GO) analysis on the top 500 mRNA targets of CELF4 ranked by each score (Fig 5A-5C), which is similar to the analysis shown in Wagnon et al. [29]. There is much greater enrichment (5 to 40 orders based on p-values) of the categories related to suspected CELF4 function in the targets identified by PRAS than those identified by expressRNA and PPK. For example, in the class of "Biological Process", most of the top 10 corresponding to the applied cutoffs, and the Y-axis shows the value of Spearman's correlation coefficient. The corresponding curves for PRAS, expressRNA, and PPK are indicated by red, blue, and green lines, respectively. Each dot in the plot is for one subset of genes selected based on the absolute LFC cutoff. (B) Similar line-chart to A, but for the Celf1-regulated list. These two line-charts show that the higher ranked targets by PRAS have higher enrichment in the regulated lists comparing to the top ranked lists of expressRNA and PPK. https://doi.org/10.1371/journal.pcbi.1007227.g004 Predicting RBP functional targets from CLIP-seq peaks significant categories for PRAS top-ranked targets are related to neuron or synaptic functions and ion transport, consistent with prior studies on CELF4 [29]. These results suggest that PRAS captures CELF4 functional targets more precisely than the other methods being compared.

Functional targets are identified in a large-scale PRAS application to human RBPs
To demonstrate that PRAS has the potential for wide adoption, we further applied PRAS to the eCLIP data [42] in two human cell lines, K562 and HepG2, from the ENCODE consortium [43]. Specifically, we selected the RBPs that are related to the RNA decay function [41] because this function can be clearly quantified at gene level in the differential expression (DE) analysis between the RBP knockdown and the wild-type RNA-seq samples. We collected the DE analysis results by DESeq [44] from ENCODE and obtained 37 distinct RBPs, which include 28 and 32 RBPs in HepG2 and K562 cell line, respectively. We then applied PRAS to the eCLIP data using the enrichment ratio over the control sample described above as the peak intensities. In the parameter settings in PRAS, we selected the reference site for each RBP from 4 candidates: transcription start site, translation initiation site, translation termination site, and transcription end site, based on eCLIP peak intensity distribution along the transcript. S5 Fig presents four example RBPs assigned with 4 different reference sites. To simplify the analysis, we applied d 0 = 500 nt to all the selected RBPs according to the distribution (S6 Fig) of the estimated decay parameters as described previously. This general selection of d 0 may not achieve the best performance of PRAS but is likely to be comparable with the best d 0 selection as discussed in the CELF4 data. After obtaining the PRAS scores, we did the correlation analysis of the DE (adjusted p-value < = 0.05) genes for each RBP. We found PRAS scores achieved significantly stronger correlation with the LFC in gene expression in comparison to expressRNA Predicting RBP functional targets from CLIP-seq peaks and PPK, with p-value equal to 3.8e-9 and 4.4e-4, respectively (Fig 6A). We then separated the RBPs by their reference site usage and found that the translation termination site and the transcription end site, both of which are related to the 3' UTR, constitute the majority of the RNA decay regulators' reference sites (Fig 6B). It suggests the essential association between the 3' UTR of transcripts and the regulation of their fates by RBPs. In addition, we found that the correlation can reflect important biological functions of RBPs. For example, the 5' poly(A) site (transcription end site) is used as the reference site for DDX6 in the HepG2 cell line (S5C Fig)  and the PRAS score is negatively correlated with the LFC of DDX6's target gene expression (Fig 6B), which indicates that DDX6 may stabilize its targets via binding near to the poly(A) site. Interestingly, DDX6 is known to be an important regulator in mRNA decapping and degradation [45,46], which supports our claim that PRAS has the ability to identify the biologically functional targets of the RBP regulators. All these results demonstrate that PRAS has the potential for wide adoption in RBP functional targets identification.

Discussions on biological insights from the use of PRAS
In this study, we developed PRAS, a position dependent scoring method for identifying and prioritizing RBP functional targets. Weighting the proximity of RBP binding sites to a given reference position exponentially and combining the strengths of the binding signals, we obtained the PRAS scores and the ranking of all the mRNAs that have reliable binding sites of the RBP. We applied this approach to the iCLIP dataset of a neuronal disease-related RBP, CELF4 and to the CLIP dataset of a DM disease-related RBP, CELF1 -both belonging to the CELF family of RBP. We report a much stronger association between CELF4 and its targets at the distal 3' UTRs compared to internal 3' UTR positions. We also demonstrate that PRAS performs much better in predicting the mRNA targets stabilized by CELF4, compared to the other existing methods such as expressRNA and PPK. We further observe that PRAS performs much better at predicting the mRNA targets degraded by CELF1. These results not only suggest the importance of incorporating the positional information of the binding sites into target identification, but also suggest the important roles of the distal 3' UTRs in CELF protein regulated mRNAs. The binding preferences of RBPs have been noticed in previous studies [3,29]. However, the link between positional biases of RBP binding sites and their functional consequences has not been well established. PRAS reveals that the distal end of 3' UTR binding is predictive of CELF4-stabilized targets. The distal end bias of CELF4-stabilized targets suggests possible molecular mechanism(s) by which CELF4 regulates its mRNAs. It has been reported that poly (A) tails enhance the stability of mRNAs [47]. The proximity between poly(A) tails and the distal 3' UTRs suggests possible connections with poly(A) tail functions, such as mRNA stability, polyadenylation itself or promotion of translational reinitiation-possibilities to be explored in future experimental studies. CELF1 is known to recruit cytoplasmic deadenylases [48] and the extent of mRNA degradation is positively correlated to CELF1's binding magnitude to the 3' UTRs [34]. Based on the finding in the previous study [34] that CELF1 binding is enriched in the 3' end of the 3' UTR, we further found that this binding bias shows strong predictive ability to CELF1-degraded targets (Fig 3B). We also demonstrated the potential of PRAS in the largescale applications by showing the better performance of PRAS than other methods in identifying the targets of RNA decay related RBPs from ENCODE [43]. These results again strengthen the relationship between the regulatory functions of the RBPs and their binding positions.

Availability and future directions
PRAS is implemented in Python and R and is freely available at http://ouyanglab.jax.org/pras/. PRAS can be applied widely to identify the functional targets of any RBPs with CLIP-seq peaks. For RBPs with a known post-transcriptional function, the functional targets may be identified with a corresponding reference position that is related to that function (e.g. splicing sites for alternative splicing). PRAS can also be combined with other types of information, such as sequence motifs, conservation, and perturbation data to predict RBP functional targets using integrative approaches such as [49]. In addition, future versions of PRAS can be extended to study the co-regulations of multiple RBPs by being applied to a set of interested RBPs simultaneously and evaluating the importance of different reference sites on the targets.

S1 Text. Datasets collection.
(DOCX) S1 Fig. CELF4 binding characteristics in 3' UTRs. Shown are distributions of the distances between the iCLIP reads and the proximal/distal end of 3' UTRs in mRNAs. X-axis represents the distance (number of nucleotide) to the proximal/distal end of 3' UTRs. Y-axis represents the average iCLIP read counts within the significant peaks at that position across all the genes. The curve for the distal end is highlighted by red color and that for the proximal end is highlighted by blue. The density curves are highlighted by red and blue for RBPs in K562 and HepG2, respectively. The estimation is done based on the eCLIP peak intensities around the selected reference sites as described in the subsection "PRAS score is a strong predictor of PCR-validated mRNA targets of CELF4". (TIF)