Genome-wide identification of clusters of predicted microRNA binding sites as microRNA sponge candidates

The number of discovered natural miRNA sponges in plants, viruses, and mammals is increasing steadily. Some sponges like ciRS-7 for miR-7 contain multiple nearby miRNA binding sites. We hypothesize that such clusters of miRNA binding sites on the genome can function together as a sponge. No systematic effort has been made in search for clusters of miRNA targets. Here, we, to our knowledge, make the first genome-wide target site predictions for clusters of mature human miRNAs. For each miRNA, we predict the target sites on a genome-wide scale, build a graph with edge weights based on the pairwise distances between sites, and apply Markov clustering to identify genomic regions with high binding site density. Significant clusters are then extracted based on cluster size difference between real and shuffled genomes preserving local properties such as the GC content. We then use conservation and binding energy to filter a final set of miRNA target site clusters or sponge candidates. Our pipeline predicts 3673 sponge candidates for 1250 miRNAs, including the experimentally verified miR-7 sponge ciRS-7. In addition, we point explicitly to 19 high-confidence candidates overlapping annotated genomic sequence. The full list of candidates is freely available at http://rth.dk/resources/mirnasponge, where detailed properties for individual candidates can be explored, such as alignment details, conservation, accessibility and target profiles, which facilitates selection of sponge candidates for further context specific analysis.


Introduction
MicroRNAs (miRNAs) are a class of endogenous small non-coding RNAs of about 20 nucleotides in length, which play crucial roles in transcriptional and post-transcriptional control of gene expression through interacting with other RNAs [1][2][3][4]. To date, more than 2000 human mature miRNAs (miRBase v20) [5] have been discovered. These mature miRNAs are formed from pre-miRNAs, which are processed by DICER in the cytoplasm [6]. They are estimated to regulate more than 60% of all human protein-coding genes (PCGs) [7] and have been a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 implicated in many human diseases [8,9]. With more and more miRNAs being discovered, identifying their functions is becoming increasingly important for understanding the molecular mechanisms of diseases [10].
The miRNAs can themselves be regulated by so-called miRNA sponges, which are RNAs with many miRNA binding sites that compete with the target sites for binding of one or more miRNAs of interest. Denzler et al. analyzed the relationship between sponge activity and number of binding sites and found that more binding sites enhance miRNA sponge effect on releasing mRNA target repression regulated by that miRNA [11]. Artificial miRNA sponges have been used to generate loss-of-function phenotypes for miRNAs in cell culture [12] and to discover miRNA functions in vivo [13]. It has advantages over genetic knock-outs and antisense oligonucleotide inhibitors by being cheaper and less time consuming [14]. They are also of therapeutic interest [15,16].
Natural miRNA sponges with many miRNA binding sites separated by linker regions also exist [17] (S1 Table). They have also been called competing endogenous RNA (ceRNA), and the ceRNA hypothesis suggests that RNAs regulate each other by competing for shared miRNAs [18]. Recently, a circular RNA (circRNA) with more than 70 binding sites was shown to function as a sponge for miR-7 [19,20]. This natural sponge, named ciRS-7 and CDR1as, has been implicated in cancer-related pathways [21]. A circRNA derived from the gene encoding zinc finger protein 91 (circRNA-ZNF91) with 24 miR-23 binding sites has similarly been identified as a possible miRNA sponge [22]. However, some studies consistently mention that only few circRNAs can function as miRNA sponges [22,23]. Other types of transcripts can also serve as natural miRNA sponges, such as the pseudogene PTENP1 [24] and the long non-coding RNAs (lncRNAs) H19 [25] and lincRNA-RoR [26]. It has been estimated that there are thousands of RNA transcripts functioning as potential miRNA natural sponges [27], but despite increasing evidence for the ceRNA hypothesis, it still attracts some skepticism [28]. Although there exist several compilations of putative ceR-NAs derived from predicted miRNA target sites, CLIP-Seq data, or both [29][30][31], reviewed in [32], none of the studies to date have systematically analyzed the genome for clusters of miRNA binding sites.
In an attempt to shed more light on this, we here analyze the complete human genome for clusters of predicted miRNA target sites, which may represent natural miRNA sponges. To this end, we identify statistically significant clusters by comparing the numbers of binding sites in the clusters obtained from the real genome and from shuffled genomes, retaining the local sequence composition. We further filter the resulting clusters based on evolutionary conservation and binding energies. With this approach we rediscover one known miRNA sponge ciRS-7 for miR-7 and identify 3672 novel sponge candidates.

Pipeline for clustering of miRNA target sites
To identify genomic clusters of predicted miRNA target sites for a given miRNA, we have developed a pipeline as outlined in Fig 1. RIsearch2 screen for miRNA target sites. The computational screen on hg19 for target sites of the miRBase miRNAs was performed with a preliminary version of RIsearch2 [37]. It is a seed-and-extend approach to predict RNA-RNA interactions, applying suffix arrays in the first stage to locate initial seed matches (allowing for G-U wobble matches) and using dynamic programming (DP) to extend those matches with the simplified energy model as introduced in RIsearch [38]. The seed was specified to require a stretch of six consecutive bases within the first eight bases of the miRNA sequence to be paired. The window for DP extension was set to always include the entire remaining query sequence outside the seed, and the same number of nucleotides extended by five from the target. This parameter from the preliminary version of RIsearch2 used in here has been replaced with a maximum extension length in the released version. The default value of 20 nt should yield comparable results. The maximum hybridization energy was set to −10 kcal/mol. Overlapping target sites were merged in post-processing.
Markov clustering of predicted miRNA target sites. To identify genomic regions with a high density of predicted binding sites for a given miRNA, we used the Markov Cluster (MCL) algorithm [39], which does not need specify the number of clusters in advance. In our pipeline, we run clustering for individual miRNAs, and each miRNA has different number of clusters.
To this end, we represented the predicted binding sites for each miRNA as a weighted network, in which the weight of the edge between two sites on the same strand of the same chromosome is defined based on their nucleotide distance (x) as follows: Fig 1. Flowchart of the analysis pipeline. For each mature miRNA in miRBase v20, we ran RIsearch2 against both the real repeat-masked genome and a shuffled version to predict binding sites. We then used the Markov Cluster (MCL) algorithm to identify genomic clusters of binding sites and identified statistically significant clusters by comparing the results for the real and shuffled genomes. Finally, the significant clusters were further filtered by conservation and binding energy.
where the constant C = 1000 determines over which distance the weights decay. The value of C was chosen to allow identification of large clusters while limiting the computational cost. Clusters within this weighted network were then identified using the MCL algorithm with a range of different values for the inflation factor parameter, which influences the size and number of clusters.
Shuffling of genome sequence. We evaluate the statistical significance of the identified clusters by creating a background model from randomized genome sequence and repeating the RIsearch2 and MCL steps described above. To preserve the local dinucleotide content, the non-masked sequence segments of the human genome are shuffled by uShuffle [40] in nonoverlapping windows of 120 nt, the typical size of structured RNA [41,42].
Statistical significance of clusters. For each miRNA, we estimate a cutoff on the number of predicted target sites in a cluster that is required for statistical significance. This is done by fitting the size distribution of the top-10% largest clusters obtained for the miRNA in question on the randomized genome, assuming an exponential tail.
where x is the cluster size and y is the number of clusters of a given size. Based on this fit, we extrapolate the largest cluster one would expect to observe in 1000 randomizations.
Only clusters larger than or equal to this cutoff are considered statistically significant. We used this approximation because it would take prohibitively long time to run MCL clustering on the RIsearch2 output for thousands of randomized genomes for every miRNA.
Filtering by conservation and binding energy. To further improve the quality of the predictions, we apply two additional filtering criteria to the statistically significant clusters. First, we extract the evolutionarily conserved subset of target site predictions on the real genome, by requiring that the miRNA seed site has at least five continuous nucleotides with a phyloP score greater than 0.3 [20]. For individual miRNAs, clusters with the percent of conserved sites smaller than 2 times as one would expect to observe in whole genome (the number of conserved binding sites vs the number of all binding sites for this miRNA) are excluded. Second, we filter out statistically significant clusters that are caused by repetitive sequences not masked in the downloaded genome, because RepeatMasker and Tandem Repeats Finder by default only mask simple repeats with a unit length up to 12 and curated repeats from Repbase [43,44]. Instead of rerunning RepeatMasker with different parameters, we use the binding energies already calculated by RIsearch2 to eliminate clusters for which many sites have the exact same predicted binding energy. To this end, we calculate the normalized Shannon entropy of the binding site energies in each cluster as follows: where m is the number of different binding energy values found in the given cluster, p i is the relative frequency of a particular energy in there, n is the number of binding sites in the cluster. We used an entropy threshold of 0.6 to filter out clusters that have many sites caused by repetitive sequences.

Characterization of sponge candidates
To characterize the sponge candidates within the web interface, we annotate them with overlapping genes based on their coordinates and strand using genes from GENCODE v19 and circBase. Thus we create a reference annotation in which GENCODE annotation is primary and all annotation from circBase not overlapping that of GENCODE is included as additional annotation. We further calculate a number of properties for each sponge candidate, which are described in the following sections. SNP density ratio. CircRNAs have significant lower SNP density at miRNA seed sites than in their flanking regions and other sites, suggesting selective pressure to maintain those binding sites [45]. We calculated the SNP density ratio (SDR) between miRNA seed sites and the remaining sequence. For every binding site within a sponge, the seven nucleotides basepairing with the miRNA seed region (positions 2-8) are defined as miRNA seed site. The gap region between every two seed sites is defined as flanking region, hence also including regions base-pairing with the miRNA outside the seed. The SDR is calculated as ratio between the SNP density in miRNA seed sites and flanking region, shown in Eq (5). SNP data was downloaded from Ensembl 75 [46].
Fraction of binding sites within exons. It is furthermore important whether the predicted binding sites are intronic or exonic [22]. We thus calculate the fraction of binding sites in the sponge candidate that fall within exons (FBSE) based on GENCODE v19 as follows: Energy and fraction of paired nucleotides for binding sites. For each sponge candidate, we plot the predicted binding site energies and compare them to the distribution of binding site energies for known targets of the same miRNA. We calculated the latter by running RIsearch2 on the 3' UTRs from Ensembl 75 of the known and experimentally identified miRNA targets in the RAIN database [47]. We similarly plot the fraction of paired nucleotides for the binding sites, which is defined as the number of base-pairings between the binding site and the miRNA divided by the length of the mature miRNA.
Visualization of sponge candidates in genome browser. To allow for visualization of sponge candidates in the UCSC browser [33], we display tracks with the position, binding energy, conservation, and accessibility of each binding site. The accessibility track contains the probability of each nucleotide being unpaired within the internal structure of the transcript, estimated using RNAplfold [48] with a maximum base pair span of 120 nt and window size 170 nt.

Clusters of predicted miRNA binding sites
The first step in searching for clusters of predicted miRNA binding sites is to predict the binding sites themselves. To do that several tools can potentially be employed; however, doing a largescale screen on the complete human genome requires speed. We primarily focus on RIsearch2 as it was benchmarked against miRNA tools and found to be substantially faster than other RNA-RNA interaction prediction tools [37]. We justify this choice by comparing the RIsearch2-based screen to screens based on a GUUGle search (using a minimum match size of six nucleotides) [49] and a relaxed BLAST search (E-value < 10 000) [50]. For all three tools we subsequently employ the MCL clustering algorithm (with the default inflation factor of 2.0) to identify clusters in the genome with a high density of predicted binding sites for a given miRNA.
We used RIsearch2 to search the 2578 mature miRNAs from miRBase (v20) against the human genome (hg19) and its shuffled counterpart. For each miRNA, we subsequently used MCL to identify clusters of predicted miRNA binding sites and compared the number of clusters obtained on the real and shuffled genomes as function of cluster size (S2 Table). Whereas we were able to perform this analysis for all miRNAs using RIsearch2, this was not feasible for the GUUGle screen due to the large number of predicted miRNA binding sites.
We thus instead, as an example, compare the results from the three methods for miR-7 (Fig 2). For the RIsearch2 screen (Fig 2A) we obtain far more large clusters on the real genome than on the shuffled one. For the BLAST search (Fig 2B) we see a similar but much weaker trend with fewer clusters both on the real and shuffled genome. A possible explanation is that BLAST does not allow G-U base pairing and thus predicts much fewer binding sites. In contrast, GUUGle predicts many more binding sites resulting in more and larger clusters ( Fig 2C). However, we observe only very small differences between the real and the shuffled genomes. It should be noted, that GUUGle is intended to be used as a prefilter for more sophisticated but computationally expensive methods. However, as RIsearch2 evaluates the thermodynamic strength of the predicted binding sites and is nonetheless faster [37], we opted to use RIsearch2.
To further illustrate the predictive power of RIsearch2, we compare its predicted miR-7 binding sites to those of TargetScan [51] for the natural sponge ciRS-7 [19]. Both RIsearch2 and TargetScan predict 73 binding sites, of which 72 are in common. The one binding site found by TargetScan, but not by RIsearch2, has a predicted binding energy of −9.87 kcal/mol, which is only slightly above the energy cutoff (−10 kcal/mol). By contrast, the BLAST screen predicts only seven binding sites. While giving comparable results to TargetScan, RIsearch2 runs approximately four times faster.
Whereas RIsearch2 is our preferred miRNA binding sites predictor, we can further improve the results by changing the main parameter in MCL, the inflation factor, which affects the granularity of the clusters. To this end, we ran our pipeline with inflation factors 1.5, 2.0, 2.5, 3.0, 3.5 and 4.0 for each miRNA on both the real and shuffled genomes. These runs took roughly three months to compute on a cluster with 16 nodes, each equipped with two Intel Xeon E5-2650 processors, having a total of 256 cores. For each inflation factor, we pooled the resulting clusters for all miRNAs and plotted the cluster size distributions (S1 Fig). As anticipated, higher inflation factors led to smaller clusters. Based on visual inspection of the differences between real and shuffled genomes, we decided to use an inflation factor of 3.5 (Fig 3).
The more predicted miRNA binding sites a cluster contains, the less likely it is to occur in a shuffled genome. We can thus estimate the statistical significance of a cluster based on its size. Since different miRNAs give rise to different cluster size distributions on shuffled genomes (S2 Fig), significance analysis is done separately for each miRNA. In principle, one could estimate the false discovery rate corresponding to a given cutoff on cluster size based on the empirical distribution obtained from thousands of shuffled genomes. However, as this is not computationally feasible, we instead fit the tail of the distribution obtained from a single shuffled genome to find a cluster size cutoff for each miRNA. For example, although the largest miR-7 cluster found in the shuffled genome contains 15 predicted binding sites, we extrapolate that a miR-7 cluster must contain at least 20 binding sites to have less than 0.1% probability of appearing by chance alone. We refer to this as the empirical p-value. As a control we also checked if we obtained the same number of predicted miRNA binding sites in the real and shuffled genome. These numbers (at the −10 kcal/mol cutoff) are 3 430 423 948 and 3 386 114 664, with a ratio of 1.013 constituting no bias in the volume of binding sites in the two genomes.

Sponge candidates
Using the pipeline shown in Fig 1 to the second last step, we obtained a total of 71 106 statistically significant (P < 0.001) clusters of binding sites for 2543 mature miRNAs (for 35 mature miRNAs we obtained no significant clusters). To identify the ones most likely to be of biological relevance, we employ the last step by filtering for conservation and binding energy (see Methods for details). This reduced the clusters to 3673 sponge candidates for 1250 miRNAs, which can all be viewed and downloaded via our web resource (http://rth.dk/resources/ mirnasponge).
To annotate the sponge candidates with presumed genes of origin, we compared the genome coordinates of each sponge to annotated circRNAs from circBase and other genes from GENCODE. We annotated a sponge candidate to a gene if the larger of these two genomic regions covered at least 50% of the smaller one. Given this criterion, the majority of our candidates (2162 out of 3673) fall in unannotated genomic regions, which is not surprising considering that 85.9% of the genome is not covered by either GENCODE or circBase (Fig 4,   Fig 4. Genomic context of the sponge candidates. The upper bar chart shows the percentage for the different types of transcripts in the genome based on GENCODE and circBase, and their percentage within our sponge predictions are calculated after we assign annotations to the predicted sponge candidates. For each type of transcript, we calculate the percentage of their nucleotides under whole genome and annotated sponges. Then we can evaluate the enrichment via comparing the percent between sponges and whole genome. There is big overlap between PCGs and circRNAs, so we further divide them into "PCG not circRNA", "circRNA not PCG" and "circRNA and PCG". They refer to PCGs not overlapping with circRNAs, circRNAs not overlapping with PCGs, and PCGs overlapping with circRNAs, respectively. The lower bar chart shows the percentage of nucleotides located in intron, exon, 3' UTR, and 5' UTR for all annotated PCG sponge candidates. All percentages are calculated based on the number of nucleotides, excluding masked repeats, and are strand-sensitive. upper panel). Most notably, the predicted sponges are enriched for overlaps with circRNAs and protein-coding genes. In particular the circRNA sponges overlapping with PCGs have four times (22.9%) as many sponges as what one would expect by chance (5.7%) from the genomic annotation. The overlaps of sponge candidates with specific parts of protein-coding genes (intron, exon, 3' UTR, and 5' UTR) are shown in Fig 4 (lower panel).
To identify a subset of sponge candidates of particular interest, we first select the 768 sponge candidates that have at least 10 predicted binding sites more than what is required for the significance cutoff. Of these we focus on the subset that could be annotated with known genes and further require that at least 9% of the predicted binding sites reside within exons (FBSE > 0.09, which is the mean value of FBSEs for sponge candidates annotated with known genes). These sponge candidates are listed in Table 1 and include the known natural sponge ciRS-7.

Web interface
The predicted sponge candidates are freely available through a web interface at http://rth.dk/ resources/mirnasponge. The interface provides the ability to search for sponge candidate for a particular miRNA of interest as well as to download the full set of sponge candidates for all miRNAs. An example for miR-7 is shown in Fig 5A. For each sponge candidate, we provide detailed information related to properties of natural miRNA sponges to assist in prioritization, including alignment details from RIsearch2, FBSE, SDR, and accessibility and target profiles with links to the UCSC genome browser (e.g. Fig 5B for miR-7 sponge ciRS-7). The help page provides a detailed explanation of all these properties. Table 1. Filtered miRNA sponge candidates. The table provides an overview of miRNA sponge candidates that have at least 10 binding sites more than what is required for statistical significance, can be annotated with known genes, and have more than 9% of predicted miRNA binding sites within exons. The column cluster size lists the number of binding sites in the given cluster for real genome and the cluster size cutoff for statistical significance obtained from shuffling. The sponge candidates are sorted based on the difference between these two cluster sizes.

Discussion
Natural miRNA sponges have recently attracted much attention due to the discovery of increasing numbers of novel miRNA sponges, some of them have been linked to diseases [24,52]. Experimental verification is still highly time-consuming and prohibitively expensive to perform systematically on a genome-wide scale. For this reason databases such as starBase v2.0 and lnCeDB rely on computational predictions of miRNA sponges; however, because the methods employed focus purely on annotated genes, they are unable to identify other genomic regions that may function as miRNA sponges.
Our study is based on the hypothesis that groups of adjacent binding sites may function together as miRNA sponges. This is consistent with a recent study showing that the number of miRNA binding sites within a sponge correlates with its ability to derepress targets of the miRNA in vivo [11]. We have made a genome-wide computational screen that detected potential sponge candidates through clustering of nearby miRNA binding sites. In total, we identified 3673 sponge candidates spanning 1250 miRNAs. The genome-wide analysis was made possible by the RNA-RNA interaction prediction tool RIsearch2, which can predict miRNA binding sites with high speed and with accuracy comparable to that of other methods. Although we compare cluster sizes between real and shuffled genomes and select candidates based on the differences in sizes, further refinement involves calculating p-values over all cluster sizes. However, this will require comparison across clusters based on different miRNAs as some due to their composition might have different clusters sizes than others. To the best of our knowledge, this is the first genome-wide computational approach to predict sponge candidates specifically based on high binding site density in genomic regions.
There are, however, still many limitations to prediction of miRNA sponges. Like all tools for predicting miRNA binding sites, RIsearch2 produces many true and false predictions. The latter can give rise to false predictions of miRNA sponges if they appear clustered in the genome, although our filtering steps do much to alleviate this problem.
In conclusion, we have presented a computational pipeline for discovery of clusters of putative miRNA binding sites. Interestingly, we observe an enrichment (*2.5-fold) of clusters in protein-coding sequence which is not also annotated as circular RNA. For clusters overlapping sequence annotated both circular RNA and protein-coding sequence we observed an even stronger enrichment (*4-fold). Both competing endogenous RNA (mRNA) and circular RNA have previously been reported to compete for miRNA binding. Hence, we consider our clusters of miRNA binding sites as miRNA sponge candidates and we in particular obtain intriguing candidates overlapping known genes.  Table. Estimated cutoff for significant clusters of 2578 miRNAs. For each miRNA, the number of clusters obtained on the shuffled genomes are fitted as linear functions of cluster size, where the counts are transformed using log 10 . In each line, it gives the fitted parameters and cutoff for significant clusters of each miRNA, and the cutoff is estimated using the fitted parameters by extrapolating the largest cluster one would expect to observe in 1000 randomizations.