Transcriptome-Wide Prediction of miRNA Targets in Human and Mouse Using FASTH

Transcriptional regulation by microRNAs (miRNAs) involves complementary base-pairing at target sites on mRNAs, yielding complex secondary structures. Here we introduce an efficient computational approach and software (FASTH) for genome-scale prediction of miRNA target sites based on minimizing the free energy of duplex structure. We apply our approach to identify miRNA target sites in the human and mouse transcriptomes. Our results show that short sequence motifs in the 5′ end of miRNAs frequently match mRNAs perfectly, not only at validated target sites but additionally at many other, energetically favourable sites. High-quality matching regions are abundant and occur at similar frequencies in all mRNA regions, not only the 3′UTR. About one-third of potential miRNA target sites are reassigned to different mRNA regions, or gained or lost altogether, among different transcript isoforms from the same gene. Many potential miRNA target sites predicted in human are not found in mouse, and vice-versa, but among those that do occur in orthologous human and mouse mRNAs most are situated in corresponding mRNA regions, i.e. these sites are themselves orthologous. Using a luciferase assay in HEK293 cells, we validate four of six predicted miRNA-mRNA interactions, with the mRNA level reduced by an average of 73%. We demonstrate that a thermodynamically based computational approach to prediction of miRNA binding sites on mRNAs can be scaled to analyse complete mammalian transcriptome datasets. These results confirm and extend the scope of miRNA-mediated species- and transcript-specific regulation in different cell types, tissues and developmental conditions.

Introduction miRNAs are endogenous short (,22 nt) RNAs that exert regulatory control of many cellular processes, negatively regulating specific mRNAs via complementary base-pairing at a target site. miRNAs of plants bind the targeted mRNA with high complementarity and thereby mark it for degradation, whereas animal miRNAs more typically bind with sub-optimal complementarity and inhibit or diminish productive translation [1].
Target sites for complementary base-pairing by miRNAs can be inferred using computational methods based on empirically determined features of how known miRNAs bind in vivo to validated target sites. For example, perfect Watson-Crick (WC) matching over a 6-or 7-nt ''seed'' region at the 59 end of the miRNA [2][3][4] is very important for target recognition and can by itself repress translation [4]. Base-pairing elsewhere in the miRNA, including at a so-called T1A site extending the seed region [3] and in the 39 region [3,4], has also been variously proposed to contribute to target binding.
Although not every validated miRNA-mRNA pair exhibits a short region of perfect WC complementarity [5][6][7][8][9] and computational methods based entirely on such seeds will fail to find every likely target, many experimentally validated target sites exhibit perfect WC matches in the seed region.
Several distinct algorithmic approaches are utilized to predict miRNA targets. The most widely used approach is a sequence-based search, in which a local sequence alignment tool is used to find target sites with near-perfect WC complementarity to the query miRNA [10][11][12][13][14]. In this approach, the existence of this seed region is often assumed. Because miRNA target sites are short and may exhibit only limited complementarity outside the seed region, additional criteria based on abstractions of in vivo processes are then employed to improve the efficiency of the extension step; these include requiring that targets be conserved across homologous mRNAs from different species, or requiring multiple matches by one or more miRNAs. Target sites predicted using this approach may subsequently be ranked according to hybridization energy score [12,13], but the actual search is alignment-based.
In a second approach, machine learning [15] and pattern discovery [9] have been applied to capture features of mRNAs that are known or suspected to bind mRNAs. This approach builds on numerous other applications in genomics and bioinformatics, but has two important limitations. As relatively few miRNA target sites have been functionally validated, such predictors cannot yet be precise, and discovery must continually be re-run on the entire database as new miRNAs are discovered and new binding sites validated. Further, biologically relevant features that contribute to the predictions may not be captured or, if captured, may remain unknown to the user.
Both of the above approaches are complicated by regions of unpaired nucleotides (loops and bulges) known to be present in many miRNA-target interactions [16]. To accommodate these regions of imperfect pairing in duplexes between the 39 end of a miRNA and its mRNA target site, sophisticated sequence-based methods may search for near-perfect WC complementarity in the seed region (miRNA positions 2-7, 2-8 or 1-7) [11,14] then extend the search e.g. to the 39 end of the miRNA, rather than attempting to align the entire miRNA in a single operation.
A third approach [16,17] is based on the perspective that it is the thermodynamic stability of the RNA-RNA duplex, not its abstraction as a string-match or model, which allows miRNAs to bind targets in vivo. Potential target sites are identified as those at which the free energy of hybridization is minimized; complex structures involving loops and bulges can readily be accommodated in this calculation. We adopt this third approach in the work reported here. We introduce the FASTH (FAST Hybridization) program and, as a first stage, use it to predict potential miRNA-mRNA interaction sites. The name was chosen to be similar to FASTA [18] because both programs employ similar search strategies and both allow bulges (gaps) in the local alignment. With FASTH, however, we follow a purely biophysical path, searching for binding targets taking into account (a) RNA basepair and base-pair-stacking free energies [19] in perfect helices, (b) the unfavorable contributions from interior loops and bulges, (c) energy contributions from single-stranded bases and from mismatched pairs adjacent to a base pair, and (d) the initiation energy required to bring the two molecules together and initiate duplex formation. We assume that the population of duplexes actually formed between a miRNA and its target region on an individual mRNA is best represented by the duplex with the minimum free energy possible, even if this implies the presence of non-WC base pairs, loops, bulges and/or mismatches.
In many cases, RNA-RNA hybridization results in the formation of a population of duplexes, one (or a few) with minimum free energy and many others with increasingly suboptimal energies. miRNA-mRNA duplexes with slightly suboptimal free energies probably occur in vivo, perhaps more transiently or at lower frequencies, and might be biologically relevant. Finding these suboptimal structures is computationally expensive, and previous work [16,17] employed dynamic programming to search the target database. FASTH uses a fast heuristic (i.e. not dynamic programming) for the database search, with search time scaling sub-linearly with database size; it can return a user-specified number (hundreds or thousands) of suboptimal sites, and process multiple databases with rigorous minimum free-energy calculations, in a single search.
In the second stage of our approach, we then select those FASTH results that satisfy empirically derived rules unrelated to energy minimization per se, including perfect WC complementarity to the target at the 59 end of miRNAs, minimum numbers of WC base pairs at the 39 end, and/or specified free energy score thresholds. As we demonstrate, further criteria can be imposed as well, capturing features e.g. of the specific mRNA region bound (59UTR, CDS, or 39 UTR), presence of orthologs in other transcriptomes, and/or levels of sequence conservation. Energybased prediction has recently been extended to include a penalty term that reflects the free-energy cost associated with disruption of pre-existing secondary structure at potential target sites on the mRNA [20][21][22]; such a term could, if desired, be easily incorporated into our approach.
With few exceptions, miRNA target-prediction methods have until now been applied to EST data, focusing on only the 39UTR. Here we apply our approach to the complete human and mouse transcriptomes as represented in RefSeq. We parameterized the second-stage requirements within biologically reasonable ranges, and observe the effects on number of predicted sites and on statistical significance. We demonstrate that short strings of matching nucleotides (usually 6 or 7 in length) appear more frequently in the 59 end than in the 39 end of human miRNAs. We show that many energetically favourable binding sites with perfect seed matches, i.e. potential miRNA targets, occur in all mRNA regions, and that a substantial number of these sites are reassigned between mRNA regions depending on the specific transcript isoform. This opens the possibility that, at least in human, alternative transcripts of many genomic regions may be differentially regulated by miRNAs. Selected results were thereafter taken forward into laboratory-based validation. Although the results reported below focus on results with 313 human miRNAs, very similar results were also obtained for 233 mouse miRNAs and are presented in the Supplementary Material S1.

Predicted target sites for an individual miRNA
For each miRNA it is always the case that our approach predicts a set of targets, distributed through a range of free energy scores.
To assess their quality, we compare them statistically against the targets predicted, under the same criteria, for a set of control miRNAs ( Figure 1A-E). To this end, for each miRNA we employ two sets of controls (mononucleotide shuffled, or MS, and firstorder Markov, or FOM) derived to reflect different assumptions, as described in the Supplementary Text S1; each control miRNA finds a set of targets, likewise distributed over a range of free energies. The targets predicted for the real miRNA may be distributed through better, similar, or worse free energies than are the targets predicted for its controls. For example, targets predicted for let-7a are shifted toward better free energies than are those predicted for both sets of controls ( Figure 1B), whereas targets predicted for miR-17-5p and its controls show similar energy distributions ( Figure 1C). In some cases the true miRNA found more target sites than did, on average, the corresponding controls (miR-324-3p, Figure 1D), whereas in other cases each control, on average, found more targets than did the miRNA (miR-129, Figure 1E).
Over all these miRNAs, the sets of predicted targets also vary by free energy score. The targets predicted for some miRNAs (and for the corresponding controls) show high (poor) energies, with miR-1 among the most extreme in this regard ( Figure 1A), whereas better energies are predicted for others. Panels B, C and D of Figure 1 show the increasingly better energy-score ranges for predicted targets of let-7a, miR-17-5p and miR-324-3p. The diversity of these examples with respect to the range (kcal/mol) and shape of the distribution of energy scores, and number of target sites predicted with real versus control sequences, reflects an estimate of the diversity of molecular interactions over the transcriptome. We return below to the distribution of predicted targets over all human miRNAs.
For binding sites experimentally validated in human [23], the actual hybridisation energies returned by FASTH are similar to those calculated by Tafer and Hofacker using RNAduplex and RNAplex [17], but are quite different from those computed by Rehmsmeier et al. using RNAhybrid [16].

Predicted target sites over all human miRNAs
As we have seen, the set of predicted targets for an individual miRNA can show diversity in number (frequency) and free energy profile. In Figure 2A-D this case-to-case variation is aggregated over all sites predicted for the 313 human miRNAs considered in this study; panels A and C show results with minimal second-stage filtering (requiring only perfect WC complementarity at nucleotide positions 2-8), while panels B and D reflect the imposition of an additional criterion (that there be ,6 mismatches and GU pairs at positions $15).
In either case, at the lower free energies in aggregate these miRNAs find more predicted target sites than do control sequences of the same length and nucleotide composition (Figure 2A,B). At free energies better than about 215 kcal/mol the signal-to-noise ratio (S:N, see Methods) remains above 1.5, but this ratio diminishes rapidly at higher free energies ( Figure 2C,D). At predicted free energies better than about 212 kcal/mol, on average these miRNAs find more target sites than do the MS controls. In comparison with the FOM controls, however, the miRNAs find more target sites only at energies between about 213 and 223 kcal/mol (Figure 2A). The greatest fold difference reaches 3.33 against the MS controls (albeit for small numbers of predictions, at 238 kcal/mol) but only 1.51 against the FOM set ( Figure 2C). Similar behaviour is seen for the subset of predictions that survive more-rigorous second-stage filtering ( Figure 2B,D). Thus at least for miRNA-mRNA binding sites with perfect WC complementarity in a 7-nt seed region, dinucleotide frequency bias is a main contributor to stability of the predicted duplex.
The S:N ratio diminishes rapidly at free energies greater (i.e. worse) than about 215 kcal/mol regardless of whether MS or FOM controls are used, falling below 1.00 at energies above about 214 and 211 kcal/mol when calculated against the FOM and MS control sets respectively ( Figure 2C). Very similar results are seen after second-stage filtering ( Figure 2D), with S:N falling below 1.00 at energies above about 212 kcal/mol. These results indicate that an energy-based approach will be increasingly unproductive as the free energy of duplex formation becomes progressively weaker beyond a threshold determined, in part, by details of the second-stage filtering criteria; conversely, ignoring such weak duplexes should reduce the frequency of false-positive predictions. Restricting allowable free energies by introduction of a higher energy threshold improves the overall S:N ratio by around 10-20% for this set of miRNAs (Supplementary Table S1,  Supplementary Table S2, and Methods).
At energies more-favorable than about 217 kcal/mol, the S:N ratio either decreases or increases depending on whether the FOM or MS control set is used ( Figure 2C-D). As argued above, we suspect that use of FOM underestimates S:N, while use of MS may overestimate S:N.
Free energies of predicted target duplexes are correlated with the potential binding energy of each miRNA Figure 3 shows, over all these 313 human miRNAs, the free energies of our predicted target duplexes plotted against the potential binding energy if each miRNA were bound to an ideal target site with perfect WC complementarity. These miRNAs bind at their potential target sites within a wide band of energies, bounded on one side by the energy score at perfect WC base pair complementarity and on the other by the score associated with the minimal extent of interaction we consider (e.g. only the seed regions). The actual predicted free energies are broadly correlated with the best possible energies but with a wide range of suboptimality.
Experimentally validated targets are not restricted to mRNAs with the energetically most-favorable target sites. For example, five experimentally validated targets of miR-1 ( Figure 1A) have energy scores between 210.9 and 214.5 kcal/mol (for details see Table 1). Thus a high (weak) energy score does not by itself disqualify a predicted site. We are not hypothesizing that true miRNAs are those that bind most stably, only that stability (unlike its abstraction as a string) is important in prediction of potential target sites. Indeed a degree of energetic sub-optimality may be necessary, where single miRNA has to bind many different target sites in different mRNAs.
For known miRNAs, occurrence of a 6-7 nucleotide region of perfect WC base-pairing with mRNAs is biased toward the 59 end of miRNAs The minimum sequence complementarity and thermodynamics of hybridization required for a region of mRNA to serve as a miRNA target site in vitro are not fully understood. However, many computational and experimental studies have addressed how miRNAs recognize and pair with their mRNA target sites [2,4,10,[12][13][14]24]. One of the most important determinants of this interaction is the so-called seed region, a stretch of 6 or 7 contiguous nucleotides at or near the 59 end of a miRNA involved in perfect WC base-pairing with the mRNA [2].
We examined whether our predicted targets possess such a 6-7 nt region of perfect WC complementarity, and if so, whether it is preferentially complementary to the 59 end or to the 39 end of the corresponding miRNA. To compare preferences to the two ends, we made perfect WC complementarity the second-stage filtering criterion, and sorted the results into one set of potential targets with perfect complementarity to the 59 ends of miRNAs, and another set with perfect complementarity to the 39 ends. As FASTH identifies targets based only on free energy of hybridization, without information about or bias toward any particular region of the miRNA, our null hypothesis is that all miRNA regions (including 39 and 59 ends) should show equal numbers of perfectly complementary target sites. We then did the same using randomized miRNA sequences as controls (see Methods). We carried out these experimental and control procedures using four sets of conditions that represent 6-nt and 7-nt regions of perfect WC matching, with and without conditions on base-pairing at the 39 end of miRNA and on binding-energy threshold (see legend to Figure 4 for details). Figure 4 shows that predicted target sites complementary to the 59 end of the 313 human miRNAs are much more numerous than those complementary to the 39 end, whereas little or no such difference is observed with the corresponding MS or FOM controls. For these miRNAs, the 59 end is favored by a ratio of 1.26-1.52 (depending on second-stage filtering conditions) compared with 0.97-1.03 for the MS and 0.90-1.00 for the FOM controls ( Figure 4A-D). Figure 5 presents a second perspective on these results. The number of target sites predicted using known miRNAs, divided by the number predicted using the randomized controls, can be considered as a signal-to-noise ratio. In S:N ratio as well as in number of predicted targets in mRNAs, we observe a clear preference for perfect WC base pairs involving the 59 end of miRNAs ( Figure 5A-D). The number of high-quality potential binding sites with a 6-7 nt region of perfect complementarity at their 59 end (i.e. a seed region) is substantially (1.34-1.87 times) greater than expected under the null model reflected in the MS control set, and slightly (1.06-1.33 times) greater than expected under the null model reflected by FOM. By contrast, the corresponding ratios for complementarity at the 39 end are 1.04-1.20 and 0.79-0.84 for the MS and FOM control sets respectively. The ratio of these ratios (1.28-1.55 with MS, 1.26-1.68 with FOM) describes the selectivity, here favoring the 59 end, at each filtering condition. Thus motifs with perfect WC complementarity to a 6-7 nt region at the 59 end of known miRNAs occur preferentially in mRNAs, relative to motifs complementary to the 39 end.
For the simplest (although not necessarily biologically meaningful) case considered here, i.e. perfect WC base-pairing of the six nucleotides constituting positions 2-7 from either the 59 or the 39 end of the (real or randomized) sequences, similar numbers of target sites are predicted at each end of the MS controls and at the 39 end of real miRNAs ( Figure 5A). When positions 2-8 are considered instead ( Figure 5B), only predicted sites at the two ends of the MS controls remain similar in number; and as second-stage filtering is made more-stringent by imposing mismatch and free energy thresholds ( Figure 5C-D) the predicted target-site numbers, S:N ratios and 59-over-39 selectivity deviate farther. For the conditions reported in Figure 5A-D, (1) more-stringent filtering always reduces the number of predicted target sites; (2) except for 39 ends under FOM, more-stringent filtering increases the S:N ratio; (3) more-stringent filtering tends to increase 59-over-39 end selectivity; and (4) the FOM controls find more predicted targets than do the corresponding real miRNAs at 39 ends of mRNA, resulting in S:N,1.00. We explore some of these parameter conditions in further detail below.
Exact matching in seed regions does not, of course, tell the whole story. A high-quality match may not only involve, but also extend, a seed region. It may also be the case that some sites with perfect WC complementarity in the seed region show poor overall free energy scores and consequently are not identified, by our approach, as potential targets. As demonstrated previously [3], mismatches (including GU pairs) in the seed region may be compensated by optimal complementarity elsewhere. In agreement with many previous studies, though, we are able to use the recognition of a seed region in the 59 end of miRNAs to identity many energetically favorable candidate mRNA target sites.
Parameters of binding at the 39 end of miRNAs affect the number and S:N ratios of predicted target sites We examined the effect of binding parameters at the 39 end of miRNAs by constraining the total number of mismatches and GU pairs at nt positions $15 to be ,6 ( Figure 6). Except at the weakest energies this improved the mean S:N ratio, particularly as calculated against the MS control set ( Figure 6C versus D), although at the cost of a ,50% reduction in number of predicted  Other parameters of binding affect the number and signal:noise ratios of predicted target sites Identifying the seed as nt positions 2-8 instead of positions 2-7 decreased the number of predicted targets but improved the S:N ratio of prediction by around 10% (Figure 6A-B). As allowing one GU pair in the seed region very greatly increased the number of potential target sites, for each miRNA we ordered our target-site predictions by free energy score and restricted our examination to those scoring $40% of the best score observed ( Figure 6C). The results demonstrate that allowing up to one GU pair in the seed region increases the number of predicted targets (parameter conditions 1 and 3 above), but decreases the S:N ratio of prediction; applying more-stringent conditions improves the S:N ratio; and allowing only a single loop in the middle of the binding site does not affect the S:N ratio (although many predicted sites contain such a loop). These results are summarized in Supplementary Table S1.
Energetically favorable binding sites are not restricted to 39UTRs of mRNAs For short interfering RNAs (siRNAs), whether a target site is located within a translated region or a non-coding region has only marginal effects on RNA interference [25]. Plant miRNAs bind with near-perfect complementarity at sites that are primarily, though not exclusively, located within the CDS [26]. In animals, all previously validated miRNA target sites lie in the 39UTR, and searches are often, although not always [2], restricted to 39UTR data; modest targeting on CDS has, however, been reported [27]. For this set of known human and mouse miRNAs, however, the S:N ratios of our predicted binding sites are very similar, regardless of location in the 59UTR, CDS or 39UTR (Table 2 and Table 3). Energetically favorable binding sites were more-frequent in the CDS than in any other region even after normalization by relative size (nt) of region. Predicted sites were slightly under-represented in 39UTRs relative to mRNA length, perhaps due to their lower GC content, but show a higher S:N ratio regardless of parameterization. Thus energetically favorable candidate target sites may occur in all mRNA regions. Interestingly, a recent report shows that in mouse, miRNAs regulate cell development through target sites in CDS regions [28]. Evidence has been presented that for miRNAs with perfectly WC-complementary 8-nt seed regions, target sites in the 39UTR are more efficacious in depressing mRNA levels than are sites in the 59UTR or CDS [24]. Efficacious sites tend to be flanked ,30 nt both upstream and downstream by AU-enriched regions [24]. Our data confirm that predicted target sites in higher-GC regions (59UTR and CDS) have lower (more-favorable) energy scores than our predicted target sites in lower-GC regions (39UTRs) (Figure 7). Higher-GC regions are expected to contain more-stably folded local structure, whereas sites in lower-GC regions may be, on average, more accessible. As a consequence, miRNAs targeting lower-GC regions may form duplexes at relatively weaker energies and still be functional, whereas those targeting the 59UTR or CDS may require more-favorable energies to compete with pre-existing structure, and this may require hybridization that extends well beyond the seed region and near perfect complementary binding to the target sites. An approach based on hybridization energy, such as ours, tends to give prominence to target sites in higher-GC regions such as 59UTR or CDS. Optimizing the prediction of target sites in regions of morestably folded local structure (i.e. higher-GC regions) may therefore require more-strict conditions at the secondary filtering stage.
Individual binding sites are predicted in the coding region or the UTR, or be absent altogether, depending on mRNA isoform As described immediately above, many potential energetically favorable binding sites can be found in 59UTRs and coding regions, as well as in the 39UTRs. RefSeq (www.ncbi.nlm.nih.gov/ RefSeq) identifies 2943 protein-coding genes as having more than one transcriptional isoform in human; these 2943 genes are represented by 7965 RefSeq mRNAs, among which we predict no miRNA binding site in any transcript of 64 genes. Removing these 64 genes and their 146 transcripts, we are left with 66381 predicted binding sites distributed across 7819 transcript isoforms of 2879 multi-transcript human genes. For each of these genes in turn, we next asked whether a predicted binding site was found in the same mRNA region in every transcript isoform, reassigned to a different mRNA region in at least one isoform, or lost altogether in at least one isoform (Table 4).
Using the parameterization that yields the highest S:N ratio ( Figure 6C), in this set we found 2658 genes (92.3% of 2879) with 7023 isoforms (89.8% of 7819) in which at least one predicted miRNA target site remains in the same mRNA region in every isoform; 44972 target sites (67.7% of 66381) thus resist reassignment or loss across known transcript isoforms. We found 777 genes (27.0%) with 2485 isoforms (31.8%) in which at least one predicted target site is reassigned to a different mRNA region in at least one isoform, with 6519 (9.8%) of predicted miRNA targets in this category; most of these cases involve a site found in the 59UTR of one isoform but in the CDS of a second (e.g. the energetically most-favorable miR-17-5p target site on TNFSF12 mRNAs: Supplementary Figure S1), or in the CDS of one but in the 39UTR of a second, although in a small number of isoform sets a binding site is reassigned among all three regions. Finally, in 2022 genes (70.2%) with 5823 isoforms (74.5%) at least one predicted target site is lost (or gained) among different isoforms; 14890 target sites (22.4%) are in this category. Clearly, many genes (thus many sets of transcripts) contain target sites that fall into more than one of these situations. A proportionally similar distribution of fates is seen whether the seed region is taken to be nt 2-7, or nt 2-8 (Table 4).
It is interesting to ask how these frequencies of predicted targetsite reassignment and gain/loss among transcript-isoform sets compare with the corresponding frequencies for nucleotides (whether associated with a predicted miRNA target site or not). Calculation of nucleotide reassignment and gain/loss is straightforward for transcript sets with only two isoforms (see Methods), although more-complicated for larger isoform sets. Of these 2943 genes with multiple isoforms, 1947 have only two isoforms and the other 996 have three or more (range 3 to 23). Among these 1947,   on average 3.4% of nucleotides are reassigned among mRNA regions, while nucleotide loss averages 9.8% based on the longer isoform and 18.5% based on the shorter (Supplementary Table  S3). By contrast, for predicted target sites we observe 6.0% reassignment and 15.2-15.6% gain/loss among transcript sets for genes with only two isoforms (Table 4). Correction for edge effects scarcely alters these proportions (Table 4 and Supplementary  Table S3). Thus the observed frequency of target site reassignment exceeds that expected under a length-proportional model based on these data, whereas the frequency of gain/loss may not be significantly different from expectation.

miRNA targets with orthologs in mouse
One motivation for this work has been to predict miRNA target sites (and thus the mRNAs in which these sites exist) without taking into account their conservation across different species. However, most known miRNAs and many validated target sites are conserved across species, and we can use this information to improve our prediction accuracy.
Using those 181 pairs of miRNAs orthologous between human and mouse for which sequence positions 1-8 are identical, we predicted target sites using two parameterization conditions and the same sets of randomized human sequences as controls in each case. Requiring orthology improves the S:N ratio (1.87 compared to 1.50 for one parameterization, 2.97 compared to 1.81 for the other against MS controls; 1.81 compared to 1.21 for one parameterization, 2.61 compared to 1.34 for the other against FOM) although at the cost of a 72-81% reduction in number of predicted sites (Table 5).
Under these two conditions, 83-87% of the predicted target sites present in orthologous mRNAs (human and mouse) occur in the same mRNA region (e.g. 39UTR); these values are uncorrected for the few cases in which an mRNA in one species is annotated as having two or more orthologs in the other. To a first approximation, miRNA target sites in the same region of orthologous mRNAs can be considered orthologous sites. Thus most miRNA target sites present in homologous mRNAs between human and mouse are themselves orthologous. miRNA sequences can be highly conserved across different species, but this does not imply that their target sites are necessarily similarly conserved in sequence. Under these two parameterizations, 72-81% of predicted target sites do not have a counterpart in an orthologous mRNA (Table 5). Many taxon-specific miRNA-mRNA interactions may be thus available to regulate taxonspecific developmental processes in human and mouse.

Method evaluation
In our hands, FASTH is about 30-fold faster than RNAhybrid version 2.1 [16] in finding target sites for a single miRNA among 500 mRNAs (,1.5 Mbp); on a single AMD64 core (1 Gb memory) FASTH completes this in 4 seconds, and RNAhybrid in 125 seconds. RNAplex version 1.0.0 is claimed to be 10-27 times faster than RNAhybrid [17] but could not process even our 500-mRNA database, as it cannot deal with large sequences (.,4100 nt) at least on the above-mentioned hardware. Search time scales linearly with database size for RNAhybrid and RNAplex, whereas with FASTH it scales sub-linearly (see Supplementary Text).
We examined the efficiency of our approach in recovering those miRNA target sites that have been experimentally validated in human [23] (Table 1) or mouse [9]. Of 82 miRNA binding sites experimentally validated in human (all of which are in the 39UTR), 65 occur in RefSeq and were available for discovery (see Methods). Of these we recovered 40 using the most permissive search criterion (Table 1), i.e. recall was 62%. However, many of these 40 have poor free energy scores, and about half failed to meet one or more of the secondary filtering criteria we used to improve the S:N ratio.
To provide a broader test, we applied the same computational approach to the RefSeq mouse mRNA database (see Methods), asking what proportion of the target sites experimentally validated by Miranda et al. [9] we recovered. As Miranda et al. set different criteria for target prediction (e.g. allowing GU pairs and a mismatch in the seed regions) and selected sites for validation on that basis, not all of their validated sites could have been discovered by our approach; we correct for this difference, to Table 4. Numbers of predicted target sites reassigned to a different mRNA region, or gained or lost, among different isoforms of the same transcriptional region (gene), for the 2943 human protein-coding genes annotated (RefSeq) as having more than one transcriptional isoform, under six sets of defining conditions (A-F). the extent possible, by removing from consideration those targets (in [9]) that have GU pairs and/or mismatches in the seed region. Miranda et al. also give details of their predicted sites that failed validation (i.e. were false positive predictions); in principle this makes it possible for us to calculate the sensitivity as well as specificity of our approach (see Methods). On this basis, our approach shows high specificity (range 0.50-0.93) but moderate sensitivity (range 0.33-0.80 depending on miRNA and filtering criteria) ( Table 6). As the number of false positives is small and the data apply to mouse only, these specificity values should be interpreted as only indicative for our approach.

Comparison with other methods
We compared our predicted target sites to those identified using the commonly used methods PicTar [14], TargetScan [11] and MiRanda [13], restricting our comparison to 39UTR regions of human mRNAs. Overlap between our predicted target sites and those of the other methods was 15-19% with PicTar, 13-16% with TargetScan, and 4-5% with MiRanda depending on the parameter values used in our second-stage filtering. The overlap with MiRanda is smaller in part because the number of targets predicted using MiRanda is smaller (22,896) than with PicTar (61,820) or TargetScan (44,657) (see Supplementary Table S4). Pairwise overlaps among the other three methods ranged from 8% (PicTar vs MiRanda) to 55% (PicTar vs TargetScan) (Supplementary Table S5). Substantial overlap between predictions of PicTar and TargetScan has been observed by others [21].

Experimental validation
The Gene Ontology [29] terms most-enriched among our lowestenergy targets include protein kinase cascade, signalling pathway, negative regulator of transcription, and anti-apoptosis (Supplementary Table S6).
We selected, for experimental validation, targets to three of these miRNAs: hsa-miR-17-5p (one target: see below), hsa-miR-15a (two targets), and hsa-miR-324-3p (three targets). This selection was made on the basis of functional association with cancer (hsa-miR-17-5p, hsa-miR-15a) or predicted targets in the Wnt signalling pathway (hsa-miR-324-3p) as described in Supplementary Table S7. Sequence corresponding to individual predicted target sites was cloned into the 39UTR of a luciferaseexpressing vector, and luciferase activity (directly proportional to translation from the plasmid) was measured in the presence of either the test miRNA, or a control miRNA-like sequence. For replication, normalization and statistical analyses see Methods.
For five of the six targets subjected to experimental validation, expression was inhibited by the corresponding miRNA ( Figure 8). Assessed by p-value, this decrease was significant for the TSPYL2 and WNT9B constructs at 10 and 50 nM, and for the BCL2 and TNFSF12 constructs at 50 nM. BCL2 mRNA has previously been validated as a target for has-miR-15a inhibition [30]. The CREBBP construct showed ,32% mean reduction at 50 nM, although this reduction is not significant as assessed by p-value. The sixth mRNA construct, for DVL2, showed . 40% mean reduction with small variance at 10 nM (p,0.467), but an increased expression at 50 nM ( Figure 8B). Table 5. Total numbers of predicted targets and signal-to-noise (S:N) ratios among human (RefSeq) mRNAs for the 181 human miRNAs with an ortholog in mouse, with and without requiring that the orthologous mouse miRNA have a target (under the same second-stage criterion) in the orthologous mouse mRNA. Five of the target sites (hsa-miR-15a/TSPYL2, hsa-miR-15a/ BCL2, hsa-miR-17-5p/TNFSF12, hsa-miR-324-3p/CREBBP and hsa-miR-324-3p/WNT9B) exhibit perfect WC complementarity in the seed regions, while has-miR-324-3p/DVL2 has one GU pair in the same region (Supplementary Figure S1). Although the binding site of hsa-miR-324-3p/DVL2 has very low (stable) free energy, the presence of a GU pair in the seed region means that binding at the 39 end of the miRNA would have to be optimal to compensate (if indeed 39-end hybridization can compensate at all); in this case, however, two unpaired bases and two GU pairs are predicted in the 39 end, blocking even the possibility of adequate compensation.

Methodological approach
Computational prediction may be motivated by different aims, e.g. to annotate sequences, to guide experimental validation, or to describe a potential solution space. Common to these is the goal of generating a ranked list of candidates that contains most (ideally all) true interactions and excludes most (ideally all) noninteractions. Biological relevance may, however, be contingent on features other than those being maximized or minimized. Animal miRNAs, for example, often bind to mRNAs with suboptimal complementarity as assessed by string-matching [1] Table 6. For three miRNAs, our predictions (condition: perfect Watson-Crick complementarity at nt 2-7) on targets experimentally validated by Miranda et al. [9]. Of 158 genes experimentally tested for regulation by miR-134, 85 occur in our database, as do 14 of 24 tested for regulation by miR-296, and 22 of 44 tested for regulation by miR-375. As Miranda et al. set different criteria for target prediction (e.g. allowing GU pairs and a mismatch in the seed regions) and selected sites for validation on that basis, not all of their validated sites could have been discovered by our approach. To correct (to the extent possible) for this difference, we excluded a further 20 target sites with GU pairs and mismatches in the seed region for miR-134, and 8 target sites for miR-296 (all target sites for miR-375 have WC matches in the seed region), and report the results of 65 target genes examined for miR-134, 6 for miR-296 (for miR-296, all six sites examined were validated), and 22 for miR-375. Miranda et al. also give details of predicted sites that failed validation (i.e. were false positive predictions); in principle this makes it possible for us to calculate the specificity of our approach. In some cases, our 40% free energy threshold is more-stringent than that of Miranda et al. because in vivo, potential binding sites are arrayed dynamically within double-stranded regions, single-stranded loops, and bulges that are imperfectly represented as strings but nonetheless contribute to duplex stability. Other features, e.g. post-translational editing of miRNAs and spatiotemporal co-location within the cell, may be even more-difficult to represent, or be unknown.
Here we present a scalable approach and software for predicting miRNA-mRNA interactions and interaction sites, based on minimizing the free energy of duplex structure. Our method finds and can report not only the energetically optimal duplex, but also arbitrarily many energetically suboptimal structures, in genomescale data. In the specific context at hand, for each miRNA we initially identify a large set of candidate sites, and subsequently remove those that fail to meet additional criteria that reflect biologically relevant parameters of known or inferred miRNA-mRNA interactions. In other contexts, the FASTH software might better be paired with machine-learning or other technologies, e.g. to assess the relative contributions of specified parameters.
Short regions perfectly complementary to mRNA targets are preferentially found in the 59 end of miRNAs In many validated miRNA-mRNA interactions, regions of 6-7 contiguous nucleotides having (near-)perfect WC complementarity to the mRNA target occur at the 59 end of the miRNA. Our results confirm this observation, and extend it to energetically favorable potential target sites not only for validated interactions but more broadly. Sequence motifs complementary to the 59 end of miRNAs are preferentially distributed in mRNAs, compared to motifs that are complementary to the 39 end.

Energetically favorable binding sites are not restricted to 39UTRs
Miranda et al. predicted that a significant number of targets occur in 59UTRs and CDS and hypothesized that many are functional, although they could not estimate a false positive rate of prediction [9]. We likewise find many high-quality binding sites in all mRNA regions (59UTR, CDS and 39UTR), with the number of sites and S:N ratio of prediction stable over the range of filtering parameters examined. For the human and mouse transcriptomes, target abundance roughly matches availability (length of mRNA regions). As false-positive as well as true-positive target sites in higher-GC regions tend to have lower (more-favorable) energies (above), additional filtering or weighting may be required to predict target sites in these regions.

Binding sites are often reassigned or lost in alternative transcript isoforms
At least in mouse [31] and human [32], most protein-coding genes are transcribed in different isoforms. Here we report that among isoform sets of multi-transcript human genes, almost 10% of predicted miRNA target sites are situated in one mRNA region in one isoform but in a different region in another isoform, i.e. reassigned between isoforms, as the result of differential splicing. Within those transcript isoform sets having exactly two isoforms, the frequency of such reassignment is some 76% (6.0%/3.4%) greater than expected under a simple transcript length-proportional model. In almost all such cases, reassignment is observed between the 59UTR and CDS, or between the CDS and 39UTR. We also find that more than 20% of our predicted miRNA target sites are present in some but not all isoforms that arise from the same transcriptional region. As many more mRNA isoforms exist than are represented in existing databases, the instances of miRNA binding site reassignment or gain/loss that we predict here almost certainly under-represent the true number in human (and, by extension, other complex animals) that arise as a consequence of transcriptional complexity.

Implication for transcriptional regulation
Some investigations have tried to capture the intersection of miRNA and transcriptional regulation. For example, Stark et al. reported that in Drosophila, miRNAs and genes encoding predicted miRNA targets are expressed in a largely mutually exclusive manner, and that many genes for basic cellular processes have short 39UTRs and thereby avoid miRNA regulation [33]. Farh et al. reported that mRNAs expressed in the same tissue as miRNAs are selectively depleted in sites that match these miRNAs [34].
Our results show that high-quality potential miRNA binding sites exist in all regions of mRNAs. Indeed, in a complex transcriptional program such as that of human or mouse, there is some ambiguity in assigning a target site to a specific mRNA region; as shown above, potential target sites are frequently reassigned among mRNA regions, or indeed lost or gained entirely, in alternative transcript isoforms, opening the possibility that miRNAs can mediate transcript-specific regulation, e.g. in different cell types, tissues or developmental conditions. The observed frequency of site involvement in gain/loss (22.4% of predicted sites) appears somewhat greater than expected under a simple length-proportional model (a maximum of 18.5% of nucleotides are similarly gained/lost, based on the shorter transcript in pairwise comparisons); further analysis is required to determine whether this difference is real, and if so whether it might reflect positive selection in these transcriptional regions or a subset thereof. For one of the experimentally validated miRNAs in this study, miR-17-5p, we predicted target sites in both known transcript isoforms of TNFSF12, but in the CDS of one and the 39UTR of the other.
Most predicted miRNA targets in human and mouse have no counterpart in the orthologous mRNA of the other species Predicted miRNA targets in human mRNAs that have orthologs in mouse show a higher S:N ratio than do targets in human mRNAs that lack mouse orthologs. Of these predicted sites, most (83-87%) occur in the same mRNA region in both species, i.e. most miRNA targets in orthologous mRNAs are in this sense themselves orthologous. However, limiting our analysis to mRNAs with orthologs in the other species diminishes the number of predicted sites by 72-81%: most potential miRNA target sites found by searching only human have no counterpart in an orthologous mouse mRNA (and vice-versa), implying that many miRNA-mRNA interactions are potentially taxon-specific. Similar or greater ratios of genome-specific to evolutionarily conserved potential target sites have been inferred by others [34].

Experimental validation of miRNA-mRNA interaction
We selected, for experimental validation, six miRNA-mRNA interactions predicted by our method, including one previously validated interaction [29] and another identified as a possibility [14]. Using a luciferase assay in HEK293 cells, we showed that four of these six transcripts are bound by endogenous miRNAs, with the corresponding mRNA level reduced by an average of 73% (range 59.6%-86.2%) when assayed at 50 nM miRNA. These four include the previously validated interaction, plus three of the five (60%) newly predicted interactions for which the mRNA level was reduced by an average of 77.5% (range 65.4%-86.2%) when assayed at 50 nM miRNA. A fifth construct showed a ,32% mean reduction, which could be biologically (function-ally) relevant but falls short of statistical significance as assessed by the p-value criterion.
Regardless of where the predicted target site is located in the mRNA, in our experimental protocol the (synthetic) site was inserted into the construct so it could be expressed from the 39UTR of luciferase mRNA. Thus we can validate that a target site is bound by endogenous miRNA and that the mRNA level is reduced as a consequence, but we do not specifically validate interaction with the site in its original location or native folded environment. Duplex formation is prerequisite to biological function, but these assays examine only the effect of binding on mRNA level, not its functional consequences.

Conclusions
It has been an open question whether low (strong) energy of the miRNA-mRNA duplex can serve as a good indicator of miRNA targets. We have developed and implemented an energy-based computational approach to miRNA target prediction, applied it to a standard set of known human mRNA sequences at wholetranscriptome scale, and experimentally measured the change in mRNA level for a small number of selected predictions. We showed that short (6-7 nt) sequence motifs matching mRNAs with perfect WC complementarity occur more frequently in the 59 end of miRNAs than in the 39 end, and that high-quality binding sites are present in all regions of mRNA (59UTR, CDS and 39UTR) in both human and mouse. We also present evidence that many high-quality potential miRNA binding sites become located in different mRNA regions, or are gained or lost altogether, depending on transcript isoform. These observations open the possibility that further complexity of genetic regulation arises at the interface among miRNAs, expression and alternative splicing in morphologically complex animals. Future directions include incorporating predicted mRNA structure at the target site, miRNA and mRNA expression levels, and correlations with protein expression and phenotype.

Materials and Methods
Data 313 mature human and 233 mature mouse miRNA sequences were obtained from miRBase release 7.0 [35] (www.sanger.ac.uk/ software/Rfam) and are available as Supplementary Table S8. mRNA and gene data were obtained from hgdownload.cse.ucsc.edu/downloads.html (file mrnaRefseq.txt) and represent 22947 unique mRNAs and 17751 protein-coding genes in human, and 17510 and 16627 in mouse. mRNAs were mapped to gene identifiers using the files refFlat and refLink also from UCSC, and coordinates of miRNA genes and miRNA targets were assigned based on the NCBI Human (hg17, Build 35) and Mouse (mm6, Build 34) Genome Sequencing Consortium genome builds. Relative sizes of mRNA regions (59UTR, CDS and 39UTR) were calculated using annotations in the files rna.gbff downloaded from NCBI (ftp.ncbi.nih.gov/refseq).

Computational approach
We implemented a two-stage prediction process. First we used FASTH to search for potential targets in the mRNAs by duplex free energy, and to rank the results by energy. Then in a second stage we remove (filter) results that fail to meet further criteria not expressed in terms of energy score per se, e.g. the minimum number of contiguous Watson-Crick base pairs in the so-called seed region, or the maximum number of unpaired bases, bulges and GU pairs. This second-stage filtering was implemented outside FASTH via Perl scripts. Detailed descriptions of our computational approach, including FASTH, are presented in Supplementary Text. Our target-site predictions with selected parameters in human and mouse are available as Supplementary Table S9, Supplementary Table S10,  Supplementary Table S11, and Supplementary Table S12.

Statistical description of results
To estimate the false positive rate of prediction we calculate a signal-to-noise (S:N) ratio, dividing the number of target sites predicted using known (empirical) miRNAs by the mean number predicted using controls. Our motivation for generating these controls by two different approaches, mononucleotide shuffling (MS) and a first-order Markov process (FOM), is presented in the Supplementary Text. For each known miRNA, the MS control set was constructed by random permutation, without replacement, of its nucleotides, preserving both length and nucleotide composition (but not dinucleotide frequencies, except by chance). The FOM control set, generated using Sean Eddy's 'shuffle' program in the Squid package (http://selab.janelia.org/software.html), preserves length and dinucleotide frequencies (but not mononucleotide count, except by chance). For each miRNA we generated 10 MS and 10 FOM control sequences. In all, 26 3130 control sequences were generated for human, and 262330 for mouse. These control sequences were then used to search against the target database under the same secondstage filtering conditions as for the real miRNA. Results are presented in Table 5 and Supplementary Table S1.
We additionally used, as a further set of controls, the shuffled sequences generated by Lewis et al. so as to preserve the expected frequency of random matching between miRNA seed sequences and complementary 39UTR sequences [11]. Results are presented in Table 5 and Supplementary Table S2.
Sensitivity is calculated as true positives/(true positives + false negatives), and specificity as true negatives/(true negatives + false positives).

Mapping target locations to regions of mRNAs
Numbers of predicted targets located in the three regions of mRNAs (59UTR, CDS and 39UTR) were determined by retrieving sizes (lengths measured by number of nucleotides nt) of each region in each mRNA from the rna.gbff files for human and mouse (above), and mapping target sites to these regions using the coordinates provided.
Modelling the expected frequencies of site reassignment and gain/loss among isoforms Genomic coordinates of the first and last nucleotides for each transcript region (59UTR, CDS, 39UTR, introns if any) in human were obtained from the refFlat file downloaded from UCSC (hgdownload.cse.ucsc.edu/downloads.html). The number of nucleotides in each mRNA was computed from these coordinates. For transcriptional regions (genes) with exactly two known isoforms in this file, comparison is straightforward. For genes with $3 isoforms, we made all pairwise comparisons, i.e. a gene with four isoforms requires 4*(421)/2 = 6 pairwise comparisons. To simplify calculation we set transcript length (including introns) equal to that of the longest transcript, so the total number of nucleotides shown in the Intron/Intron cells is slightly exaggerated (because missing nucleotides in the shorter transcripts are counted as introns); otherwise the computation proceeded as for binary transcript sets. In summarising results (Supplementary Table S3), pairwise comparisons relative to the longer member of each mRNA pair were grouped, as were comparisons relative to the shorter member of each pair. Corresponding calculations were also made correcting for edge effects (disregarding predicted miRNA target sites that overlap any nt position within 5 or 10 nt of the edge of a miRNA region); see Supplementary Table S3 for details. miRNA target sites with orthologs in mouse Most of the 233 miRNAs known in mouse have counterparts extremely similar in sequence (i.e. putative orthologs) among the 313 miRNAs known for human. From among these we extracted the 181 pairs that have identical sequences at positions 1-8 inclusive (Supplementary Table S13). The sequence of each human miRNA was randomized, and the S:N ratio calculated, as described above. Given the computational complexity of the target search, we used the same sets of random sequences to search both human and mouse mRNA databases.
Ortholog pairs between the human and mouse mRNA sets were identified using BioMart (www.biomart.org). A miRNA target site (e.g. in human) is considered to have an ortholog in the other species (e.g. mouse) if the miRNAs and targeted mRNAs are orthologs in the two species. For example, the human miRNA has-miR-378 is predicted to have a target site in the mRNA corresponding to human gene BCL7A, while mmu-miR-378, the mouse ortholog of has-miR-378, has a target site in the mRNA for Bcl7a (the orthologous gene in mouse). We thus consider the has-miR-378 target site in BCL7A to have an ortholog in mouse, and infer that regulation of these genes by this miRNA has been evolutionarily conserved between human and mouse.

GO terms
GO terms were retrieved and analysed using DAVID [36].

Experimental validation
To determine the effect of each miRNA on its predicted targets, synthetic oligonucleotides corresponding to 60 nt around the target sequence were cloned into the SpeI and HindIII sites of pMIR-REPORT Luciferase (Ambion). All constructs were verified by sequencing. HEK293 cells were maintained in DMEM (GibcoBRL) containing 10% (v/v) foetal calf serum in a 5% CO 2 atmosphere at 37uC. Cells were transfected using Effectene (Qiagen) according to manufacturer's instructions. In each well, 5610 4 cells were co-transfected with 100 ng of a pMIR-REPORT Luciferase construct, 100 ng of pMIR-REPORT b-galactosidase (Ambion), and either 10 or 50 nM of the appropriate pre-miR miRNA precursor (Ambion). After transfection, cells were incubated for 42 hours prior to harvesting.
Luciferase activity was measured in the presence of either the test miRNA or a control miRNA. p-values were derived using Student's t-test on the mean and standard deviations of the normalized luciferase signal of test miRNAs, and on the negative control miRNAs for the same reporter construct. Before averaging, luciferase activity was normalized to the corresponding signal from the b-galactosidase reporter. Luciferase activity was assayed using the Luciferase Assay System (Promega) and detected on a Wallac 1420 luminometer (Perkin Elmer). b-galactosidase activity was determined using the b-Galactosidase Enzyme Assay System (Promega) and detected on a PowerWave XS spectrophotometer (BioTek). Each assay was repeated at least three times.
Predicted mRNA targets were selected for experimental validation based on our groups' ongoing interest in cancer and in Wnt signalling, as described further in the text.

Availability
The FASTH source code is available by request from MZ (zukerm@rpi.edu). It is configured to run in a Unix or Linux environment.