ExprTarget: An Integrative Approach to Predicting Human MicroRNA Targets

Variation in gene expression has been observed in natural populations and associated with complex traits or phenotypes such as disease susceptibility and drug response. Gene expression itself is controlled by various genetic and non-genetic factors. The binding of a class of small RNA molecules, microRNAs (miRNAs), to mRNA transcript targets has recently been demonstrated to be an important mechanism of gene regulation. Because individual miRNAs may regulate the expression of multiple gene targets, a comprehensive and reliable catalogue of miRNA-regulated targets is critical to understanding gene regulatory networks. Though experimental approaches have been used to identify many miRNA targets, due to cost and efficiency, current miRNA target identification still relies largely on computational algorithms that aim to take advantage of different biochemical/thermodynamic properties of the sequences of miRNAs and their gene targets. A novel approach, ExprTarget, therefore, is proposed here to integrate some of the most frequently invoked methods (miRanda, PicTar, TargetScan) as well as the genome-wide HapMap miRNA and mRNA expression datasets generated in our laboratory. To our knowledge, this dataset constitutes the first miRNA expression profiling in the HapMap lymphoblastoid cell lines. We conducted diagnostic tests of the existing computational solutions using the experimentally supported targets in TarBase as gold standard. To gain insight into the biases that arise from such an analysis, we investigated the effect of the choice of gold standard on the evaluation of the various computational tools. We analyzed the performance of ExprTarget using both ROC curve analysis and cross-validation. We show that ExprTarget greatly improves miRNA target prediction relative to the individual prediction algorithms in terms of sensitivity and specificity. We also developed an online database, ExprTargetDB, of human miRNA targets predicted by our approach that integrates gene expression profiling into a broader framework involving important features of miRNA target site predictions.


Introduction
Gene expression is a fundamental phenotype that affects complex cellular, physiological and clinical phenotypes including disease risk as well as individual response to therapeutic treatment. For example, gene expression alterations have been implicated in the etiologies of common diseases such as cancers [1][2][3], cardiovascular diseases [4], and psychiatric disorders [5]. Previous studies using the International HapMap Project [6,7] lymphoblastoid cell lines (LCLs) derived from individuals of European (CEU: Caucasians from Utah, USA), African (YRI: Yoruba people from Ibadan, Nigeria) and Asian (CHB: Han Chinese from Beijing, China; JPT: Japanese from Tokyo, Japan) ancestry have shown that common genetic variants including single nucleotide polymorphisms (SNPs) and copy number variants (CNVs) account for a substantial fraction of variation in gene expression within a population and between populations [8][9][10][11][12][13][14]. Furthermore, pharmacogenomic studies based on these HapMap cell lines [6,7] strongly suggest that response to therapeutic treatment is likely to be a complex phenotype affected by genetic factors that alter gene regulation, especially in the form of eQTLs (expression quantitative trait loci) [15][16][17].
In addition to eQTLs, more recently, microRNAs (miRNAs) (,700 known in humans to date), a family of small (21-23 nucleotides), single-stranded, non-coding RNAs, have been shown to be an important class of gene regulators that generally downregulate gene expression through sequence-specific binding to the 39 untranslated regions (UTRs) of target mRNAs [18]. In humans, miRNAs are predicted to potentially target up to one third of protein-coding genes [19]. Global microRNAome profiling has demonstrated significant changes in the expression of multiple miRNAs in a growing list of human diseases, including neurodegenerative diseases [20], heart diseases [21] and cancer [22]. Because an individual miRNA may regulate multiple mRNAs, a comprehensive and reliable catalogue of miRNA gene targets should enhance our understanding of the complexity of gene regulatory networks in a cell, as well as facilitate the shifting of focus from miRNA gene identification to functional characterization. Due to the lack of high throughput experimental technique for identifying the targets of miRNAs, only a small proportion of the targets of the potentially more than 1000 human miRNAs -such as those in TarBase [23,24] (a manually curated database of experimentally supported miRNA targets) -have been confirmed experimentally. Therefore, several computational and bioinformatic approaches have been developed for large-scale prediction of miRNA targets including miRanda [25,26] (based on sequence complementarity, free energy of the RNA-RNA duplex, extent of conservation), TargetScan [27][28][29] (based on seed complementarity, thermodynamic free energy of binding, conservation over different species), and PicTar [30] (based on seed complementarity, thermodynamics and a combinatorial prediction for common targets in sets of coexpressed miRNAs). The various computational prediction algorithms have been found to suffer from significant false positive and false negative rates. For example, the miRanda algorithm [25,26] was estimated to have a high false-positive rate at 24-39% in an early study [31]. A high false-negative rate is also expected for the current miRNA target prediction programs, largely due to their requirements for evolutionary conservation, while many of the experimentallysupported targets (e.g., those from TarBase [23,24]) may not be conserved in other species [32]. Though using combinations of two or more of these computational approaches may improve the sensitivity or specificity of the prediction, the degree of overlap of their predictions is poor [32]. To date, a few algorithms have been developed to leverage existing approaches involving relevant miRNA binding site considerations such as thermodynamics, sequence complementarity, conservation, and gene expression profiles [32]; however none of these algorithms provide a genomewide map of predictions that utilizes post-transcriptional regulation within a broad framework of relevant target site prediction features as well as utilizes the experimentally validated binding sites as training set.
Though experimental testing is critically important for validating any putative miRNA targets, we propose here a novel bioinformatic approach, ExprTarget ( Fig. 1), to predicting human miRNA targets by integrating select computational algorithms and our recently-generated miRNA expression data and previously published mRNA data [33] on 58 unrelated HapMap CEU LCLs [6,7]. We demonstrate that a significant improvement of performance in terms of sensitivity and specificity can be achieved by integrating both the computational algorithms as well as the experimental miRNA expression data. Furthermore, the study we report here of miRNA-mRNA relationships in the HapMap samples [6,7] extends, in the direction of miRNA-mediated gene regulation, earlier studies on these same samples that have successfully been used as models for studies of complex traits [34] and for pharmacogenomic studies [15][16][17]. We developed an online resource, ExprTargetDB (http://www.scandb.org/apps/ microrna/) to: 1) enable user-friendly queries of a comprehensive catalogue of miRNA targets using this integrative approach; 2) provide a reference dataset of miRNA-mRNA relationships on the HapMap samples and; 3) advance our understanding of gene regulatory networks and of the contribution of miRNAs to complex traits.

Pair-wise comparisons between prediction algorithms
Pair-wise comparisons show that the miRNA targets predicted by different algorithms are generally not correlated ( Fig. 2A, 2B, 2C) in the sense that good candidates for one algorithm do not tend to be good candidates for the other algorithms. PicTar [30] scores, TargetScan [27][28][29] scores and miRanda [25,26] scores are not correlated for the same miRNA targets predicted by these algorithms. We compared the distribution of experimentallyvalidated targets (from TarBase [23,24]) using the targets' miRanda [25,26] scores (Fig. 2D), PicTar [30] scores (Fig. 2E) and TargetScan [27][28][29] scores ( Fig. 2F) with the corresponding overall distribution of scores. This analysis enables us to compare, for each score bin defined by an algorithm, the proportion of target site predictions and the proportion of experimentally validated predictions falling within the bin; particularly, it shows the proportion of experimentally supported targets that overlap with the highest scoring bins.

Performance of individual prediction algorithms
Individual prediction algorithms were evaluated for performance (see Materials and Methods) using either TarBase [23,24] alone or TarBase [23,24] combined with the expressioncorroborated miRNA targets (TarBase+LCL) as gold standard. The receiver operating characteristic (ROC) curves, which plot the true positive rate (sensitivity) and false positive rate (1-specificity), for the individual algorithms are shown in Fig. 3. Each point on the ROC curve is a specificity/sensitivity pair corresponding to a score threshold. A comparison of the diagnostic performance of the three prediction algorithms using only TarBase [23,24] as gold standard is shown in Fig. 3 (Fig. 3A, 3B, 3C). The same analysis was repeated using (TarBase+LCL) as gold standard to gauge the effect of the choice of gold standard; the results for the three prediction approaches are shown in Figure S1.
Integrating different algorithms and miRNA expression in miRNA target prediction By incorporating the computational approaches included in this study (miRanda [25,26], PicTar [30] and TargetScan [27][28][29]) as well as the expression-based prediction model into an integrative model ExprTarget based on sigmoidal modeling (see Materials and Methods), we observed that the performance in terms of sensitivity and specificity is greatly improved using TarBase [23,24] as gold standard (Fig. 4). A similar ROC-based performance evaluation shows that ExprTarget is a much better classifier in discriminating the experimentally verified targets from the non-experimentally supported ones than the random guessing procedure indicated by the line of no-discrimination ( Fig. 4) or, indeed, any of the existing computational solutions (Fig. 3) evaluated in this study. Reduced models (e.g., the combination of two individual methods as well as the expression-based approach) lead to decreased predictive performance, as measured by the area under the ROC curve.
Restricting the gold standard to the subset of TarBase [23,24] that excludes the high-throughput assays, we observed that the improvement in predictive performance for ExprTarget relative to the individual methods continues to hold robustly ( Figure S2). From each computational algorithm incorporated by ExprTarget, a parameter that quantifies the confidence of each binding site is used so that the final model is target-site based. The use of a score from the expression data that is gene-based rather than target-site based (e.g., when the score from the expression data is defined as the minimum of all p values for the gene) demonstrates the robustness of our primary finding, namely, incorporating these algorithms through sigmoidal modeling improves predictive performance ( Figure S3). To evaluate whether the fitted model can be generalized to as-yet-unseen data (given that ExprTarget uses the experimentally validated targets as training set), we proceeded to do cross-validation (see Materials and Methods) on ExprTarget. Crossvalidation enabled us to evaluate how well our predictive model, which was defined by the use of training data, would perform on future data. Using 10-fold cross validation, ExprTarget resulted in a mean prediction error of 0.000277. ExprTargetDB, a database for human miRNA targets We developed a web-based database, ExprTargetDB (http:// www.scandb.org/apps/microrna/), to provide user-friendly queries of our miRNA-mRNA association data in the context of other computationally-predicted miRNA targets including miRBase [25,26], TargetScan [27][28][29] and PicTar [30] as well as the experimentally-supported miRNA targets [23,24]. The miRNA expression profiling is, to our knowledge, the first such reported dataset on the HapMap samples. Though the initial dataset was generated from the samples of European descent only, ExprTar-getDB will hold the results assayed from other HapMap populations. Since previous studies on the HapMap samples have  shown that complex traits [34] and pharmacologic phenotypes [15][16][17] are affected by genetic variants that alter gene regulation, this dataset on miRNA-mediated gene regulation should be a tremendous resource to the scientific community. ExprTargetDB supports downloading of the complete list of expression-corroborated miRNA targets. In addition, ExprTargetDB can be queried by either miRNA or gene target, in single or batch mode. A query of miRNA should utilize the nomenclature of the miRBase [25,26] (e.g., ''hsa-miR-138''), while a query for gene target should use the official gene symbols (e.g., ''PAPD5''). A successful search outputs a table of miRNA targets (miRNA-centric) or miRNAs (gene targetcentric). A link to our SCAN database (SNP and CNV Annotation, http://www.scandb.org/) [35] is also provided for more information on eQTLs of the gene targets to supplement the information on miRNA-mediated gene regulation. A search example is discussed in the online tutorial.

Discussion
The identification of the transcript targets regulated by miRNAs promises to greatly enhance our understanding of gene regulation and to provide important insights into the genetic/epigenetic basis of various human diseases such as those that have been found to be associated with altered or abnormal miRNA expression [20][21][22]. As a result of the multiplicity of computational algorithms that have been developed, two distinct problems arise: (1) the problem of inter-method reliability, and (2) the problem of integrating the results obtained from the various methods into a single optimal score. The present study conducted a comparative analysis of the most frequently used target prediction algorithms and developed an integrative approach, with certain analytically attractive properties, that improves the predictive performance in relation to the foundational prediction methods.
The problem of inter-method reliability is concerned with the degree of agreement between the various computational methods and with the predictive performance of each algorithm. Many available computational prediction methods take advantage of the biochemical or thermodynamic properties of the binding between miRNAs and their cognate mRNA transcripts. For the three frequently used algorithms we tested (i.e., miRanda [25,26], PicTar [30], TargetScan [27][28][29]), pairwise comparisons reveal that their prediction results are generally not correlated. This unanticipated lack of correlation between these prediction algorithms presumably reflects the differing emphasis on biochemical/thermodynamic factors as well as the extent of the use of evolutionary conservation. Surprisingly, we found very different distributional patterns for the existing prediction approaches when comparing the experimentally-validated miRNA targets (from TarBase [23,24]) predicted by an algorithm with the full set of targets for the same algorithm. For example, Fig. 2D shows that the distribution of miRanda [25,26] scores for predicted targets is similar to the distribution of miRanda [25,26] scores for only the experimentally supported targets; each is bell-shaped and peaks in the same score bin. In contrast, the distributions of the scores from the other two methods show quite distinct patterns.
An ROC analysis using TarBase [23,24] on each of the existing computational methods would seem to suggest that TargetScan [27][28][29] may yield slightly better performance than the other computational methods, based on the area under the ROC curve (AUC). However, caution must be exercised in the interpretation of these results. We sought to evaluate the effect of the choice of gold standard on the performance evaluation. A comparative ROC analysis for the three methods, using either TarBase [23,24] or (TarBase+LCL), shows the dependence of the performance evaluation on the use of a different gold standard. TarBase [23,24] is a manually-curated database with results that assume particular prediction approaches and that were derived during the experimental verification of a prediction algorithm; on the other hand, it has also been shown that a substantial proportion of TarBase predictions are non-conserved target sites and would not have been predicted by computational approaches that assume evolutionary conservation [32]. Due to these biases, it is informative to assess the performance of the individual computational approaches using another benchmark. The miRNA-mRNA associations generated from our HapMap LCL [6,7] data are genome-wide, facilitate a comparative analysis, and may serve as a reference dataset on the regulatory effects of miRNAs. However, this gold standard could reflect the possible biases from the highthroughput expression microarrays.
The problem of integrating existing computational approaches is concerned with identifying a parsimonious model, which utilizes the accumulated knowledge base of experimentally supported miRNA target sites in defining the training algorithm. The parsimonious model is selected from a potential hierarchy of models derived from various combinations of existing algorithms. ExprTarget is inspired by a multivariate logistic regression model for a binary outcome and predictor variables from existing computational solutions. ExprTarget has certain attractive features besides the relevant criterion of parsimony. In relation to existing computational methods, ExprTarget shows a greater predictive power to discriminate the experimentally verified targets based on ROC curve analysis. To gauge the accuracy, we conducted K-fold cross-validation (K = 10) on our proposed algorithm, resulting in a mean prediction error of 0.000277. There are, of course, other ways of combining individual computational approaches such as taking various unions or various intersections of the corresponding result sets. However, while unions of computational approaches may achieve a higher level of sensitivity than the individual approaches, this gain comes at the cost of a reduction in specificity. In the same vein, while intersections of computational approaches may achieve a higher level of specificity, they also generally achieve a much reduced sensitivity [32]. Alternatively, non-linear statistical models may provide an excellent fit; however, the parameters in these models are often difficult to interpret. The application of a complex model with many degrees of freedom to produce a satisfactory fit may come at the cost of failure to replicate in as-yet-unseen data. The model we propose in ExprTarget enables us to evaluate the relative importance of the predictors in miRNA target site prediction.
We acknowledge that miRNA-mediated regulation of gene expression may be tissue-specific and population-specific. Our miRNA expression data on LCLs represent one tissue type in one population (the CEU samples); therefore, it is likely that we might miss some information on miRNAs not expressed in these samples or differentially expressed between human populations. Nevertheless, the approach we propose here is completely extensible. With the availability of more data sets on other tissues (e.g., liver) and the development of new methods that may utilize novel characteristics of the miRNA-mRNA relationships, ExprTarget is amenable to refinement to guide the prediction of miRNA targets.
Finally, the prediction results of our integrative approach are provided through an online searchable database, ExprTargetDB, which can also be used to compare the prediction results (i.e., p values, scores) from the various computational solutions we tested in this study (i.e, miRanda [25,26], PicTar [30], TargetScan [27][28][29], TarBase [23,24], and our LCL data). In particular, the expression dataset in ExprTargetDB constitutes the first reported study of miRNA-mediated gene regulation using the HapMap samples [6,7] that have been so important in transcriptome-based studies of complex traits [34] and pharmacologic phenotypes [17]. Given the role of miRNAs on the regulation of expression, we also provide additional relevant information (e.g., eQTL annotation through the SCAN database [35]) in ExprTargetDB to inform studies of the complex networks of gene expression. ExprTar-getDB has been designed to seamlessly integrate into existing genomic and pharmacogenomic resources (e.g., PharmGKB [36]).

MiRNA profiling of unrelated HapMap LCLs
MiRNA expression was measured in unrelated HapMap LCLs [6,7] including 58 CEU (Caucasians from Utah, USA) samples using the Exiqon miRCURY TM LNA Array v10.0 (,700 human miRNAs, updated to miRBase 11.0 annotation [25,26]) (Exiqon, Inc., Denmark). The HapMap cell lines [6,7] were purchased from Coriell Institute for Medical Research (Camden, NJ). The details for cell line culture were described in a previous publication [10]. Total RNA was extracted using miRNeasy Qiagen Kit (Qiagen, Inc., Valencia, CA) according to manufacturer's protocol. Array hybridization was performed by Exiqon. The quantified signals were background corrected using normexp with offset value 10 based on a convolution model [37] and normalized using the global Lowess regression algorithm. In total, 225 miRNAs were found to be expressed in these samples (unpublished data).

Expression-supported human miRNA targets
Associations between the 225 expressed miRNAs and potential gene targets were evaluated using the miRNA profiling data and our previously-generated Affymetrix Human Exon 1.0ST array data on the same cell lines (,10,000 mRNA transcripts with reliable expression) (NCBI Gene Expression Omnibus Accession: GSE9703) [33]. Association analyses in the CEU samples were carried out using the lm function of the R Statistical Package [38].

Experimentally-supported human miRNA targets
TarBase [23,24] (http://diana.cslab.ece.ntua.gr/tarbase/), which houses a manually curated collection of experimentally tested miRNA targets in a variety of species including human, mouse and several other model organisms. Each target site is described by the miRNA that binds it, the gene in which it occurs, the nature of the experiments that were conducted to test it, the sufficiency of the site to induce translational repression and/or cleavage, and the paper from which all these data were extracted. The current TarBase [23,24] v.5c (June, 2008) covers 1122 distinct miRNA-gene pairs for 143 human miRNAs.
ExprTarget: An integrative approach to miRNA target prediction We constructed a computational prediction model that integrates select miRNA computational algorithms using the experimentally validated targets as training data. Following a multivariate logistic regression model, let the x j,i be the prediction score of algorithm j on the i-th miRNA target site prediction: log it(p i )~ln ( p i 1{p i )~b 0 zb 1 x 1,i z:::zb k x k,i where p i defines the score for ExprTarget that may be derived from a sigmoidal transformation of a weighted sum of the scores from select computational algorithms or target site features: Pictar [30], TargetScan [27][28][29], miRanda [25,26], and our HapMap-based expression microarray dataset were selected as providing the x j,i , although the approach we describe here may extend to a larger list provided certain assumptions are met. Estimated from the data, each beta b k describes the size of the contribution from a target site feature (encapsulated by the prediction algorithm k). The model, as stated, characterizes the relationship between the various predictors and a binary variable, expressed as a probability, showing whether the prediction is experimentally supported or not. A high absolute magnitude for the parameter b k implies that the incorporation of the respective target prediction feature increases the probability of experimental support. The PicTar [30] score, which has a maximum = 1000, has been normalized to lie between 0 and 1 by dividing by this maximum value. The score for the expression-corroborated miRNA targets is set to the p value for the general linear model between miRNA and mRNA normalized (log 2 -transformed) expression intensities, provided the estimated coefficient is negative; otherwise, the expression-based score is set to 1. For the miRanda [25,26] contribution, we used the algorithm's Score, which is based on complementary base pairing as well as the presence of mismatches, gap-opening, and gap-extension. For TargetScan [27][28][29], we utilized the probability of preferentially conserved targeting (P CT ), a Bayesian estimate of the probability of site conservation due to selective maintenance of miRNA targeting. Alternatively, one could utilize the TargetScan context score, which is meant to provide complementary information to P CT and is derived from information orthogonal to site conservation [27][28][29]; however, our analysis shows that the use of context scores does not lead to higher predictive performance based on AUC (see next section). The change in ExprTarget score p with respect to the change in score x i for a given prediction method is given by:

Performance evaluation using ROC curve analysis and cross-validation
The various prediction algorithms were assessed using the receiver operating characteristic (ROC) curves, which plot the true positive rate (sensitivity) and the false positive rate (1-specificity) for a binary classifier at various score thresholds [39]. The performance of a prediction algorithm can be evaluated using the area under the ROC curve (AUC). The area estimate V has the following standard error (SE): where n A and n N are the number of ''failures'' and ''successes'', h is the ''true'' area, and Q 1 and Q 2 are distribution specific quantities [40]: Two prediction algorithms with AUC values V 1 and V 2 may be compared. We can determine the standard error for the difference in areas as follows [41]: SE(V 1 {V 2 )~(SE 2 (V 1 )zSE 2 (V 2 ){2rSE(V 1 )SE(V 2 )) 1 2 where r is the estimated correlation between V 1 and V 2 . The test statistic for the comparison of areas is defined as follows: *N(0,1) Since the experimentally supported targets in TarBase may represent biases (e.g., relative to sequence conservation), we chose to evaluate the performance of each algorithm (miRanda [25,26], Pictar [30] and TargetScan [27][28][29]) using TarBase [23,24] first as gold standard followed by using TarBase [23,24] and expressioncorroborated predictions (TarBase + LCL) as gold standard. We also evaluated the predictive performance of ExprTarget using as gold standard the subset of TarBase [23,24] that includes only the target sites validated by individual experiments rather than by high-throughput assays.
To further assess the predictive power of our integrative approach, we conducted K-fold cross-validation on the training algorithm in ExprTarget. This method partitions the entire dataset into K subsets or folds F i , i = 1,2,…K. One of the K subsets is used as a validation set, F test where test {1,2,…,K}, and the remaining K21 subsets are combined into a training set F train to fit our weighted logistic regression model. The training algorithm is repeated K times so that each subset is used as a validation set once, and an average error is calculated. All observed data are thus used in the training and the validation. We utilized the cv.glm function for cross-validation in generalized linear models in the boot package available in R [38]. For K = 10, the mean error was 0.00027725 while for K = 3, the mean error was 0.00027719. Figure S1 Individual performance of foundational prediction algorithms using (TarBase + LCL) as gold standard. The three prediction algorithms were evaluated using ROC curve analysis, using (TarBase + LCL) as benchmark. Found at: doi:10.1371/journal.pone.0013534.s001 (2.30 MB TIF) Figure S2 When the subset of TarBase that excludes the highthroughput assays is used as gold standard, the improvement in predictive performance for ExprTarget relative to the individual methods continues to hold robustly. Found at: doi:10.1371/journal.pone.0013534.s002 (1.02 MB TIF) Figure S3 The use of a score from the expression data that is gene-based rather than target site-based (e.g., the score is defined as the minimum of all p values for miRNA correlations with the gene) shows that the incorporation of the individual algorithms improves predictive performance.