^{1}

^{2}

^{*}

^{3}

^{4}

^{4}

^{5}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: BG DH CC. Analyzed the data: BG. Contributed reagents/materials/analysis tools: BG CC. Wrote the paper: BG DH.

MicroRNAs (miRs) are known to play an important role in mRNA regulation, often by binding to complementary sequences in “target” mRNAs. Recently, several methods have been developed by which existing sequence-based target predictions can be combined with miR and mRNA expression data to infer true miR-mRNA targeting relationships. It has been shown that the combination of these two approaches gives more reliable results than either by itself. While a few such algorithms give excellent results, none fully addresses expression data sets with a natural ordering of the samples. If the samples in an experiment can be ordered or partially ordered by their expected similarity to one another, such as for time-series or studies of development processes, stages, or types, (e.g. cell type, disease, growth, aging), there are unique opportunities to infer miR-mRNA interactions that may be specific to the underlying processes, and existing methods do not exploit this. We propose an algorithm which specifically addresses [partially] ordered expression data and takes advantage of sample similarities based on the ordering structure. This is done within a Bayesian framework which specifies posterior distributions and therefore statistical significance for each model parameter and latent variable. We apply our model to a previously published expression data set of paired miR and mRNA arrays in five partially ordered conditions, with biological replicates, related to multiple myeloma, and we show how considering potential orderings can improve the inference of miR-mRNA interactions, as measured by existing knowledge about the involved transcripts.

MicroRNAs (miRs) are short RNA sequences which are known to affect expression of messenger RNA (mRNA), often by binding to complementary sequences and either inhibiting translation or directing cleavage of that mRNA. A large database of miR information and annotation can be found at

Recently, some of the most successful attempts to identify likely target pairs include the integration of expression data–most often microarrays–with sequence-based target prediction algorithms that consider the binding affinities between a particular miR and a complementary or near-complementary section of an mRNA sequence. Each data source by itself is prone to error–expression data are noisy, correlation does not imply causation, and prediction algorithms are rife with “false” positives. But, the combination of information from two very different sources had led to vast improvements in the ability to identify likely candidate target pairs. A nice review of the topic can be found in

Most algorithms that combine target predictions with expression data require such data for both miRs and mRNA, but even when miR expression data are unavailable, it is possible to infer miR activity and effective regulation under various experimental conditions using gene expression data and calculated binding strengths from target prediction algorithms

When miR expression data is available,

Another Bayesian model proposed by Stingo,

In some cases, a basic Pearson correlation is used to rank putative targets, possibly in combination with prediction algorithms

In this paper, we propose a Bayesian model for inferring miR-mRNA targeting interactions based on target prediction algorithms and expression data, which we fit using variational Bayesian methods like the

However, in contrast to one or more of the aforementioned algorithms, our model:

considers both positive and negative interactions between miR and mRNA.

uses a normal distribution to characterize interaction strength.

optimizes the weights/coefficients placed on sequence/prediction information via the same variational Bayesian algorithm that estimates the rest of the model parameters.

accounts for data replicates, biological or technical, and propagates uncertainty throughout the model parameter estimates.

can consider a partial ordering of the samples.

With respect to these points, we enumerate how the three algorithms described–

All three algorithms consider only negative interactions, but we chose to consider both positive and negative interactions since some positive indirect effects may, in some cases, better explain changes in expression values than negative effects only

We chose to use a normal distribution to characterize the interaction coefficients where

Both our model and the Stingo model estimate the influence of external target prediction information in the same manner as other parameters (variational Bayes and MCMC, respectively) while

Based on their descriptions and implementations, none of the algorithms explicitly account for technical/replicate variance or otherwise allow for grouping of samples without taking their average value before starting the algorithm.

With the exception of the Stingo model, which in its “time-variant”version allows some interaction parameters to change over time, none of the models considers an ordering of the expression samples.

In the following sections, we specify our model and demonstrate its ability to reliably infer miR-mRNA interactions in an expression data set of samples taken from multiple myeloma patients in different stages of the disease. We use the

We have developed a Bayesian model of miR-mRNA interactions for matched expression data (

We have developed a model for partially ordered samples because prior work in using expression data to infer miR-mRNA targeting interactions has focused on methods that do not depend on the order of the samples, the most common of which is Pearson correlation. We find it both theoretically and practically desireable to consider an ordering of samples because in most cases we expect that samples whose sources are more similar–in this case by disease type or stage–should also have the most similar expression values. If we consider such a natural ordering, we should be more likely to infer significant targeting interactions that occur from one stage to the next but whose expression levels are not necessarily the most correlated throughout the entire data set.

Let us consider a simple example of how using an ordering of data can help infer interaction coefficients. Assume a fully-ordered data set of

With the assumed perfect negative correlation and unit standard deviation, both estimates (2) and (5) for

We can also generalize a bit from these simple examples. Since the summation in (8) is an approximation of the variance of the

In addition the possibility of higher statistical significance in highly varying miR, considering the order of samples in an experiment allows us to detect positive or negative trends in expression value with respect to the process being investigated–a feature that may prove useful in identifying the main drivers of a developmental process such as disease, growth, aging, etc. The model also includes scores from existing prediction algorithms for miR-mRNA targeting to better determine the existence of a targeting interaction. In this paper, we have used data (including prediction scores) from the

Below, we define and fit our model to a previously published multiple myeloma data set using variational Bayesian methods. Then, we check our results against the

We demonstrate our model using the multiple myeloma data set from

In our analyses, we compare three different partial orderings, which we show in

The graphs above show the three different partial orderings of the data that we explore in this paper. Arrows give the direction of the ordering, where sample A can be said to

The second partial ordering is the same as the first, but with the individuals separated. We call this the individual-ordered (

The third partial ordering considers the stage IA samples to be references, while all other samples come directly after them. Individuals are still considered separately. Hence, this is called the individual-reference (

Prior to the main analysis, we performed quantile normalization across all arrays of the data set using the

For these analyses, we included miR-mRNA target prediction data from both

Let us define a stage as a set of expression levels

Then, we assume the log expression value

Note that in the formula for the stage’s prior precision

The distributions we have assumed for mRNA expression are identical to those of the miR, except that the developmental trend component (the product

Specifically, we assume the log expression value

and precision

We assume two technical precisions (inverse variances) in our model. One precision corresponds to an expression set (i.e. the precision/variance between microarrays from the same stage) and one corresponds to replicates within one expression set (i.e. multiple spots for the same probe/set or transcript within a microarray).

For the miR and mRNA expression levels,

Furthermore, within each expression set

Each element

If one of the included algorithms predicts that miR

We chose conjugate prior distributions for each model parameter that required a prior. Thus, we use vaguely informative normal distributions on the parameters

The development parameters

To estimate the parameters of our model, we use variational Bayesian methods. These methods are closely related to expectation-maximization algorithms

In short, variational Bayesian methods find a probability distribution

The result of variational Bayesian calculations is, like with most Bayesian methods, a set of estimated posterior probability distributions over the model parameters. Unlike Markov chain Monte Carlo and related methods, we need not worry much about convergence of the estimated parameter distributions, since, if implemented properly, a variational Bayesian algorithm guarantees an improvement in every iterative update. Of course, calculations can be quite slow when compared to non-Bayesian methods.

All parameters were estimated using variational Bayesian (VB) methods, with the exception of the

This algorithm was coded in the

We applied our model to the multiple myeloma data set from

Below, we evaluate the results and compare them with the interactions rankings one obtains from

First, we checked the

Among our data set, only five putative miR-mRNA targeting interactions have been validated, according to the validation database

All five of the validated, predicted target pairs involve the well-studied miR-17. These five interactions appear at positions 63, 229, 234, 273, and 612 on our interaction ranking based on the

If we divide our ranking positions, in increasing order, by the rankings by correlation (

Though there are too few existing validations for us to draw strong conclusions, the fact that the rankings of these by our model are much closer to the top of the list than those by the correlation (negative or absolute value) indicates that there is at least some advantage to our partially ordered model.

We show in

G–O | I–O | I–R | TaLasso | Neg.Cor | |

Number of unique genes in the top 100 interactions | 41 | 58 | 53 | 85 | 56 |

05200 :Pathways in cancer | 3 | ||||

05215 :Prostate cancer | 2 | 3 | 3 | ||

05219 :Bladder cancer | 2 | 2 | 2 | ||

05222 :Small cell lung cancer | 2 | 2 | |||

05216 :Thyroid cancer | 2 | ||||

05214 :Glioma | 2 | 2 | |||

05218 :Melanoma | 2 | 2 | |||

05016 :Huntington’s disease | 3 | 4 | 3 | ||

05014 :Amyotrophic lateral sclerosis (ALS) | 2 | 2 | 2 | ||

05010 :Alzheimer’s disease | 3 | ||||

04976 :Bile secretion | 2 | ||||

04730 :Long-term depression | 2 | ||||

04115 :p53 signaling pathway | 2 | 3 | 4 | ||

04210 :Apoptosis | 2 | 2 | 2 | ||

04010 :MAPK signaling pathway | 3 | 3 | |||

04722 :Neurotrophin signaling pathway | 2 | ||||

04110 :Cell cycle | 2 | ||||

04120 :Ubiquitin mediated proteolysis | 2 | 4 | |||

04622 :RIG-I-like receptor signaling pathway | 2 | 2 | |||

04144 :Endocytosis | 4 | ||||

04914 :Progesterone-mediated oocyte maturation | 3 | ||||

04114 :Oocyte meiosis | 3 | ||||

04142 :Lysosome | 3 | ||||

03060 :Protein export | 3 | ||||

04141 :Protein processing in endoplasmic reticulum | 4 |

Amongst the three partial orderings we have considered here, there is marginally more pathway enrichment and fewer genes involved in the top 100 targeting interactions for the

Ultimately, our goal with this analysis is to enable the identification of the most promising candidates for further biological investigation. In

In the above diagram, we show the miRs (top row) and genes (bottom row) involved in the 10 most significant targeting interactions based on the

The inclusion of trend parameter

trending miR | direction | z-score |

miR-18b | + | 3.88 |

miR-367 | − | 3.80 |

miR-18a | + | 3.61 |

miR-194 | + | 3.57 |

miR-133b | + | 3.54 |

miR-92a | + | 3.38 |

miR-554 | − | 3.24 |

miR-551a | + | 3.23 |

Top candidates from the table include miR-18a and miR-18b, which, as discussed above, are well-known to play a role in cancer development. Both of these showed increased expression in advanced stages of multiple myeloma. Another candidate is miR-194, which has been shown to be p53-dependent and a positive regulator of this well-known tumor-supressor, creating a positive feedback loop. Furthermore, down-regulation of miR-194 has been demonstrated to play a key role in multiple myeloma development through its modulation of p53 signaling

Combining miR-mRNA target prediction algorithms with expression data has proven to be one of the best strategies for high-throughput target pair inference. However, the exact way in which do this has been the subject of some discussion. Though many methods have addressed specific issues in target inference, and others have attempted a more general approach, none has fully addressed ordered and partially ordered data sets. We tried three different partial orderings in our model, as shown in

As illustrated in the

Both the mRNA targets and targeting miRs from our top-ranked interactions have been previously implicated in multiple myeloma development, suggesting that our analysis has successfully identified biologically-relevant pairs from this data set. Furthermore, some of the miRs that we have identified as having a significant trend through the ordering of the stages have been verified by literature as key players in both cancer and, more specifically, multiple myeloma. The remaining un-verified top interactions and trending miRs may be good candidates for further investigation.

Interestingly, though we didn’t limit our interactions to be non-positive, virtually all of the top 1000 interactions were negative. This is likely an effect of utilizing the the prediction algorithms in the prior distributions for the interaction parameters, since our model estimates coefficients for the inclusion in (and targeting score of) each included prediction algorithm. It is well known that miRs typically down-regulate target mRNAs, and though there have been some reports of up-regulation, we would expect the estimated coefficients for predicted targets would lead to a negative prior distribution (see

One potential weakness of our model–which is shared by virtually all recent models of miR-mRNA targeting–is that we attempt to explain all changes in mRNA expression using miR targeting interaction coefficients. This assumption that miR targeting should account for all gene expression changes is patently untrue. There are other direct processes–involving transcription factors, for instance–as well as indirect processes that can affect mRNA expression. Though it would be quite cumbersome in both data and calculation, an expanded model taking into account other potential influences could prove very useful in inferring true interactions between the various nucleic acids, proteins, etc.

Lastly, though much literature has been published on the topic, we have a lot to learn about high-throughput inference of miR-mRNA target pairs. Experimental validations are so sparse that it is impossible to prove conclusively which prediction or inference techniques routinely give the best results, and in which cases each is most appropriate. Perhaps in the near future we will see a vast increase in the number of targets being validated, possibly through cooperation or organization between research groups to create more complete databases (both of positive and negative results) with which we can compare inference approaches to further refine our methods and in turn more efficiently focus our experimental efforts into the most promising areas.