Crinet: A computational tool to infer genome-wide competing endogenous RNA (ceRNA) interactions

To understand driving biological factors for complex diseases like cancer, regulatory circuity of genes needs to be discovered. Recently, a new gene regulation mechanism called competing endogenous RNA (ceRNA) interactions has been discovered. Certain genes targeted by common microRNAs (miRNAs) “compete” for these miRNAs, thereby regulate each other by making others free from miRNA regulation. Several computational tools have been published to infer ceRNA networks. In most existing tools, however, expression abundance sufficiency, collective regulation, and groupwise effect of ceRNAs are not considered. In this study, we developed a computational tool named Crinet to infer genome-wide ceRNA networks addressing critical drawbacks. Crinet considers all mRNAs, lncRNAs, and pseudogenes as potential ceRNAs and incorporates a network deconvolution method to exclude the spurious ceRNA pairs. We tested Crinet on breast cancer data in TCGA. Crinet inferred reproducible ceRNA interactions and groups, which were significantly enriched in the cancer-related genes and processes. We validated the selected miRNA-target interactions with the protein expression-based benchmarks and also evaluated the inferred ceRNA interactions predicting gene expression change in knockdown assays. The hub genes in the inferred ceRNA network included known suppressor/oncogene lncRNAs in breast cancer showing the importance of non-coding RNA’s inclusion for ceRNA inference. Crinet-inferred ceRNA groups that were consistently involved in the immune system related processes could be important assets in the light of the studies confirming the relation between immunotherapy and cancer. The source code of Crinet is in R and available at https://github.com/bozdaglab/crinet.


A Supplementary Methods
A.1 Our novel scoring for sufficiency of interactions drastically dropped the target distribution of miRNAs.
In this section, we evaluated our novel Interaction Regulation (IR) score that we used to eliminate miRNA-target interactions if there is no sufficient expression for miRNA or target. Since we had a much smaller number of miRNA as compared to targets, this step will affect the miRNAs more in terms of a decreased number of interactions. If a miRNA has many targets but low abundance, we expected a drastic change in the number of targets of this miRNA. To analyze the number of decrease in target numbers, we got the top 25% miRNAs (222 out of 888 miRNAs) with the highest number of targets in miRNA-target interactions set before abundance filtering and after abundance filtering. As in S1A Fig, it is clear that log median expression of top miRNAs are much higher after abundance filtering, suggesting that our scoring works well to eliminate the interactions with many targets without sufficient abundance.
Overall, the median of top 25% miRNAs with a high number of targets before abundance filtering had a minimum of 877 targets and maximum of 2428 targets with median 1155.5, however, the log median expression of these miRNAs had a median of 0.42 (having minimum 0 and maximum 67484). On the other side, after abundance filtering, top miRNAs with the highest number of targets had a minimum 222 and a maximum of 1333 targets with 431 as the median. With abundance filtering, naturally, the overall target numbers are dropped with the same number of miRNAs kept. However, the log median expression for top 25% miRNAs had a median of 45.74 (having minimum 0 and maximum 251417), much higher as compared to before abundance filtering. As expected from our novel scoring, miRNAs had a high number of targets if their abundance was sufficient.
Specifically, we got the top 6 miRNAs with the highest number of targets before the abundance step shown in S1A Fig. Namely, hsa-mir-939 with 2428 targets, hsa-mir-628 with 1994 targets, hsa-mir-4726 with 1978 targets, hsa-mir-616 with 1971 targets, hsamir-3191 with 1884 targets, and hsa-mir-4677 with 1806 targets. However, the median of these miRNAs are pretty low (0.9, 31.8, 0.0, 3.1, 0.2, 8.7, respectively). When we checked their target numbers after abundance filtering, they significantly dropped to 519, 729, 300, 424, 218, and 692 respectively. Similarly, when we got the top 6 miRNAs with the highest targets after abundance filtering, we had hsa-mir-22 with 1333 targets, hsa-mir-142 with 1323 targets, hsa-mir-210 with 1281 targets, hsa-mir-148b with 1233 targets, hsa-mir-17 with 1176 targets, and hsa-mir-183 with 1173 targets. When we checked their median miRNA expression, they have 67484.1, 1899.9, 385.1, 186.8, 465.7, and 15124.2 respectively. These expression values are high enough, especially when compared to top ones before abundance filtering. When we checked the target numbers before abundance filtering, they are still high with the numbers 1641, 1551, 1694, 1783, 1376, and 1445, respectively. These analyses showed that our scoring works well to exclude miRNA-target interactions from interactions set considering the sufficiency of abundance.
Moreover, we analyzed the number of targets overall before and after abundance filtering. Before the abundance step, we had miRNAs having targets starting from 8 to 2428 with a median of 519. Even keeping all the miRNAs in the set after abundance filtering, the target number range was starting from 6 to 1333 with a median 99.5 (S1B Fig).
Also, we analyzed the number of reduced targets for each miRNA after abundance filtering. When we just considered the miRNAs that are reducing more than half of their targets after abundance filtering, their number of targets had a median of 591. When we checked the median target number for the miRNAs reducing more than 75% of their targets, it increased to 622 targets per miRNA (with minimum 39 and maximum 2,428), while it dropped to 305 (with minimum 8 and maximum 1,694) for miRNAs reducing less than 25% of their targets. This shows the miRNAs with a higher number of targets reducing their targets even as a percentage, showing the preference is not because that they have a higher number of targets, but the ratio of reduced target number relative to all targets before abundance filtering is also high for them.
We also observed that the density plot showed the density of the reduced number of miRNA targets with a peak occurring around 0, is much lower than the expected reduction for any miRNA, which is 415.9 (mean of overall reduced target number per miRNA with minimum 0 and maximum 1,909) (S1C Fig). These analyses suggested that we eliminated miRNA-target interactions having low abundance and many targets applying our novel scoring to the miRNA-target interaction set. Density of ceRNA network was slightly more sparse after losing many genes following network deconvolution filtering: Also, we investigated the density of the final network as compared to before network deconvolution (ND) step.
Since we applied ND to eliminate the amplifying effect of ceRNA interactions, we expected to have a more sparse network, even we expected to lose some ceRNAs totally from the network. We examined the remained network's edge density (the ratio of the number of edges and the number of possible edges) and transitivity (the probability that the adjacent nodes of a node are connected, also called the clustering coefficient). When we applied ND filtering, degree distribution in the final network was significantly lower as compared to the network before applying ND (p-value < 10 −15 ). Even we lost almost 3,000 ceRNAs from the network after applying the ND method, we had lower degree distribution, lower transitivity (dropped from 0.22 to 0.14), and slightly lower edge density (dropped from 0.0020 to 0.0017) for our final network.

A.2 Inferred network was scale-free as a biological network and specific to Crinet
Since biological networks generally exhibit scale-free property, we checked whether our inferred ceRNA network is scale-free by computing its degree probability distribution function. Following the power-law rule, we fitted linear regression for the log of ceRNA's degree probability to the log of ceRNA's degree. The resulting plot of the inferred ceRNA network (S2 Fig) had a negative slope with high fitness (R 2 = 0.93), indicating that the inferred ceRNA network was scale-free.
To show the specificity of Crinet, we compared our inferred ceRNA pairs from different regulatory layers, namely protein-protein interactions (PPIs) and transcription factor (TF)-gene interactions. We collected 1,663,810 TF-target interactions from TRRUST v2 [3] database and from the ENCODE Transcription Factor Targets dataset [1]. Within all inferred ceRNA interactions, very few interactions were TF-gene interactions (0.46% of all inferred interactions). Similarly, we collected 1,847,774 PPIs from BIOGRID v3.5.186 [4]. We also found that very few interactions were also PPIs (0.51% of all inferred interactions).

A.3 Results were reproducible and robust to hyperparameter selection
To check the reproducibility of Crinet based on different datasets and hyperparameters, we ran whole pipeline multiple times with different datasets and compared the overlapping of the inferred networks. In addition to that, we examined that our pipeline is robust to hyperparameter selection running the whole pipeline with different thresholds for each step, and checking the overlapping ceRNA pairs. Reproducibility of Crinet: We ran Crinet separately for randomly selected two different breast tumor sample sets. We divided all the breast tumor samples we had in our dataset into two equal-sized random sets with similar subtype distribution (namely basal-like, normal-like, luminal-A, luminal-B, Her2-enriched) to avoid bias in results. We ran the pipeline for both subsets and obtained inferred ceRNA interactions. We repeated this process for other random subsets of our samples, again having approximately the same number of samples for subtypes. We compared the results of two runs with two subsets. We had high overlapping among different runs and different subset for interactions and unique ceRNAs (S4 Table). Furthermore, we got the overlapping pairs for all subsets having 17,419 ceRNA interactions between unique 5223 ceRNAs. When we checked the distribution of overlapping ceRNAs in our inferred network, the degree distribution of these ceRNAs were much higher as compared to the overall degree distribution before filtering (p-value < 2.10 −16 ) suggesting that consistently inferred ceRNAs were the important ceRNAs highly involving in our inferred ceRNA network.

3/11
Robustness of hyperparameter selection: To show the hyperparameter selection in Crinet is robust to hyperparameter selection, we modified hyperparameters in ceRNA inference including network deconvolution step, and analyzed the results.
To show the robutness of ceRNA inference, we slightly changed hyperparameters eliminating more interactions with respect to one hyperparameter each time keeping similar number of interactions. The hyperparameters we used were hypergeometric p-value for common miRNA regulator among ceRNA pair, partial correlation and partial correlation bootstrapping threshold, and collective regulation correlation and its bootstrapping correlation, and we eliminated approximately one-third of the interactions concerning only one threshold for each run. Then we checked the overlapping interactions for them before network deconvolution filtering to see the effect. We got the overlapping of three different runs changing only one hyperparameter each time, and they had high overlapping having 17,419 interactions between unique 5223 ceRNAs, still keeping half of the interactions the same with highly-overlapping ceRNAs. (S6 Table). This analysis suggests that even each hyperparameter had significant effect on the results, they consistently kept similar interactions with high number of unique overlapped ceRNAs.
To analyze that selection of ranking score in the incorporated network deconvolution algorithm does not change results much, we tried several values, and evaluated the results with LINCS data in a same way with pairwise ceRNA inference. Since the incorporated network deconvolution algorithm [2] gives a ranking score for interactions, but this ranking score is not showing exact significant, we had a wide range for ranking selection, but still the evaluation of obtained network was around the same accuracy (S5 Table). These analysis showed that the network deconvolution step is important to eliminate indirect ceRNA interactions, but the selection of ranking did not significantly change Crinet results.

A.4 Skipping individual steps made a substantial difference in inferred results
In this section, we assessed three major steps of ceRNA inference in our pipeline, namely: partial correlation filtering, collective regulation filtering, and exclusion of indirect ceRNA interactions. To examine the importance of each step, we skipped these steps to see how it would modify the inferred ceRNA network. Specifically, we kept all steps just skipping partial correlation filtering, referred to as "No CORR". Since we had much more interaction at this step, we used -0.2 for collective regulation threshold, which is around 1.quartile of correlation distribution inferring 180,668 ceRNA interaction. As a second step, we just skipped collective regulation filtering, referred to as "No COLL", and ran the pipeline similarly inferring 146,174 ceRNA interactions. Also, we got the results of our network before applying network deconvolution filtering, referred to as "No ND". We also checked the skipped correlation and abundance filtering without network deconvolution. We called our results with whole steps as "Crinet", and compare it in S1 Table with other runs using LINCS data that we used for network accuracy assessment. Based on these results, skipping any step drastically changed the results. For the case "Crinet-no CORR", the evaluation was based on a very limited number of perturbagens. On the other hand, even we had a slightly better result for "Crinet-no CORR no ND", we had a huge number of inferred pairs, and a small number of perturbagen to evaluate with respect to the inferred number of pairs. For other cases, the accuracies of inferred networks were lower than the original results. Overall, these results suggest that the ceRNA interaction obtained from the modified network was not good at predicting gene expression change of its ceRNA partner, potentially inferring 4/11 more false ceRNA interactions. S1 Table. Analysis to check the accuracy of inferred ceRNA interactions for modified pipeline using LINCS-L1000 shRNA-mediated gene knockdown experiment in breast cancer cell line  Table. GO Slim terms from the biological process ontology enriched for the inferred ceRNA groups with coding genes. We put the GO Slim terms with at least half of the genes in the group. Group indicates the group we analyzed deeply shown in S4 Fig). Definition is the name of GO term. Gene ratio is the fraction of inferred genes in the term to total applicable genes in the term. We used the tool named GO Term Mapper from URL https://go.princeton.edu/cgi-bin/GOTermMapper Group # GO term ID Definition Gene Ratio