^{1}

^{1}

^{1}

^{¤}

^{1}

^{2}

^{*}

Conceived and designed the experiments: MA MCZ JJM AB. Analyzed the data: MA MCZ JJM. Wrote the paper: MA MCZ JJM AB.

Current address: Department of Psychiatry, University of California San Diego, La Jolla, California, United States of America

The authors have declared that no competing interests exist.

Expression quantitative trait loci (eQTL) mapping is a widely used technique to uncover regulatory relationships between genes. A range of methodologies have been developed to map links between expression traits and genotypes. The DREAM (Dialogue on Reverse Engineering Assessments and Methods) initiative is a community project to objectively assess the relative performance of different computational approaches for solving specific systems biology problems. The goal of one of the DREAM5 challenges was to reverse-engineer genetic interaction networks from synthetic genetic variation and gene expression data, which simulates the problem of eQTL mapping. In this framework, we proposed an approach whose originality resides in the use of a combination of existing machine learning algorithms (committee). Although it was not the best performer, this method was by far the most precise on average. After the competition, we continued in this direction by evaluating other committees using the DREAM5 data and developed a method that relies on Random Forests and LASSO. It achieved a much higher average precision than the DREAM best performer at the cost of slightly lower average sensitivity.

The analysis of ‘genetical genomics’ data is an important step towards a systems-level understanding of molecular genetics data. It seeks to describe how natural genetic variability influences gene expression on a genome-wide level. Loci that are linked to the expression variation of a gene are called expression quantitative trait loci (eQTL). The advantage of this kind of analysis is its ability to elucidate causal regulatory relationships between genes without the need to actively perturb the system using e.g. gene knock-outs or knock-downs

Much effort has been invested in the development of approaches for (e)QTL analysis

The field of ensemble learning comprises all approaches in which a collection of possibly weak prediction models, so-called base learners, are combined to a robust and powerful model. The concept rests on the observation that combining disparate prediction algorithms has the potential to markedly improve prediction results

There is a need for systematically comparing the performance of eQTL mapping methods under different scenarios to reveal which approach works best in which context. However, due to the lack of trusted gold-standard gene-regulatory networks, it is not straightforward to evaluate the methods using real data

We have decided to address this challenge using ensemble approaches. In particular, we developed filtered and unfiltered committees by combining the predictions of several machine learning methods. As part of the DREAM5 PLoS One collection, we present an overview of the results obtained with several machine learning approaches and show that any combination of the methods outperforms the individual methods. We also show that our proposed approaches lead to a much higher average precision than the other DREAM challenge contributions, at the cost of slightly lower average sensitivity. Finally, we discuss the importance of precision compared to sensitivity in eQTL mapping.

Multivariate mapping approaches such as Random Forests

Since the true model is unknown, and will be different for different genes, we decided to combine several multivariate eQTL mapping methods into committees in order to capture different regulatory mechanisms and average out false positive findings due to noise in the data. We tested different committees of the following methods: Random Forests with two different variable importance measures: permutation importance and selection frequency; the LASSO and the Elastic Net.

Random Forests

We used the reference implementation of Random Forests in R

Tibshirani developed the least absolute shrinkage and selection operator (LASSO) to improve variable selection for linear regression with regard to prediction accuracy and interpretation _{1} norm. LASSO thus tends to set many regression coefficients to 0 in order to retain the most important predictors and to produce an accurate and interpretable model

We used the LASSO implementation from the elasticnet package _{1} norm) determined by 10-fold cross-validation, with an imposed minimum of 0.25. These coefficients were used as the importance score for each predictor.

If there is a group of correlated predictors that all predict the expression trait equally well, LASSO will give a high importance score to only one of them (the predictor most highly correlated with the response). All other predictors in the group will drop out of the model.

Elastic Net is a combination of LASSO and Ridge regression, which uses an L_{2} regression penalty. It has been shown that compared to the LASSO, the Elastic Net is more suited for situations in which the number of predictors greatly exceeds the number of observations

Again, we used the absolute coefficients of the best model (found by ten-fold cross validation with the elasticnet package) as importance scores for the predictors, this time setting λ to 1.

Each method assigns some kind of importance score to each predictor – gene pair. We combined these in committees by averaging the scaled and centered scores. We tested different combinations of methods leading to slightly different performance on the DREAM data. Of particular interest is using all importance scores:

and using only RF.sf and LASSO:

We also investigated a modified version of our committee approach, which seeks for a very sparse solution, i.e. only a very limited number of regulators per gene. In this case, the scores from all methods except the LASSO were scaled and an average score was calculated for each regulator – target gene pair. Subsequently, each average score was set to zero if the corresponding LASSO score was equal to zero. In other words, only variables that were chosen by the very sparse LASSO algorithm got a nonzero final score.

This filtering leads to a very different treatment of markers in linkage disequilibrium. Whereas the unfiltered scoring above will give all markers in a linked region relatively high scores, the filtering results in the selection of only one marker from this region.

In the dream challenge, the area under the Receiver Operator Characteristic (ROC) curve (AUROC) and the area under the precision-recall curve (AUPR) were used to evaluate the performance of the prediction methods

The DREAM5 systems genetics

The simulated genotypes of 1,000 markers, each corresponding to a mutation in exactly one of the 1,000 genes, imitate the architecture of recombinant inbred lines (RIL). RILs are lines derived from a cross between two genetically distinct inbred parental lines, and are homozygous at every locus as a result of inbreeding for multiple generations. Each of these RILs is homozygous for the allele of one of the parents (i.e. each RIL genotype vector can be coded in a 0/1 scheme), and each RIL has inherited different combinations of parental alleles. The RILs constitute a genetically randomized population, meaning that the gene expression pattern of each RIL is the result of a different multifactorial genetic perturbation (quoted from the DREAM web-site:

As in an eQTL study, the aim of the DREAM5 SYSGEN A challenge was to retrieve the regulatory relationships of each network using i) the simulated gene expression levels of the 1,000 genes in each RIL, and ii) the simulated genotype data of the RILs. Results had to be presented as an ordered list of edges between pairs of genes, where the edge scores were only used for ranking and did not necessarily represent any kind of statistical significance of the inferred edges.

Following the conclusion of the DREAM5 challenge, the reference networks were released. We used these data to evaluate the performance of the four multivariate eQTL mapping methods comprising our committee approach, individually and in combination.

ROC curves and precision-recall curves obtained for the prediction of one representative network with an intermediate number of edges and 300 RILs (network 3 of sub-challenge A300). We compare the performance of all four individual methods (RF.sf, RF.pi, LASSO and ElNet), the combination of Random Forests selection frequency and LASSO (RF.sf+LASSO), the combination of all four approaches (RF.sf+RF.pi+ElNet+LASSO) as well as the filtered committee we submitted to the challenge ({RF.sf+RF.pi+ElNet}|LASSO). Left: ROC curves. Right: precision-recall curves. The differences in performance between the methods are more apparent in the precision-recall curves.

The organizers of the DREAM challenge used both the area under the precision recall curve (AUPR) and the area under the receiver operating characteristic curve (AUROC) to assess how well the predicted networks approximate the gold standard networks

For each of the 15 networks of the DREAM5 SYSGEN A challenge, we evaluated the performance of the different methods using the AUROC and AUPR as metrics. To better compare AUROC and AUPR values, they were scaled to the maximum value obtained across methods for each network. Results were then summarized over all 15 networks. The bars show the mean AUROC (left-oriented bars) and AUPR (right-oriented bars) per method, error bars indicate one standard deviation. RF.sf+LASSO outperforms the DREAM best performer and all our tested approaches in terms of AUPR. Differences between the methods on AUROC values are less pronounced.

For each of the 15 networks of the DREAM5 SYSGEN A challenge (5 for each sample size), the performance of the different methods was ranked using both the AUROC (A) and the AUPR (B). For each method, ranks are plotted horizontally across all networks. Sample sizes (number of RILs) and network complexity (number of edges) used for simulating the network are shown between the panels. While the DREAM best performer always ranks best based on the AUROC (Panel A), RF.sf+LASSO ranked first in all but one network based on AUPR (Panel B).

The method we proposed for the challenge was a filtered committee of the four tested multivariate eQTL mapping methods ({RF.sf+RF.pi+ElNet}|LASSO). This approach was designed to identify a small number of regulators per gene with high accuracy. It consists in a combination of 3 variable importance measures (two from Random Forests and one from the Elastic Net) filtered based on the presence of a nonzero LASSO coefficient. The AUPR obtained by our filtered committee approach was the highest among all the methods competing in the challenge, and for all networks (

We have investigated the predictions of the committees consisting of all the possible unfiltered combinations of the four methods. Most of our committees outperform the best DREAM performer in terms of AUPR at the cost of a slightly worse AUROC (

In this article, we have tested several methods to reverse-engineer eQTL networks from synthetic expression and genotype data

When the amount of training data is limited (as is the case in eQTL mapping), many models can explain the data equally well. In machine learning this is well known as the “small

We evaluated the committees composed of all possible pairs of the four single variable selection methods (RF.sf, RF.pi, ElNet and LASSO). In order to assess if committees were beneficial, we compared their performance to the performance of their constituent methods. For each combination of method pairs, we calculated the ratio of the AUPR and AUROC of the constituent methods over the AUPR and AUROC of the committee. We used this ratio to compute the gain of AUROC (A) and AUPR (B) obtained by the committees over the constituent methods and averaged this over the 15 networks of the DREAM challenge. Error bars represent the standard deviation. This figure shows that the committees are almost always more predictive than the constituent methods.

When groups developing algorithms are also the ones validating them, the benchmark data and the assessment metrics can be biased (knowingly or not) in favor of the proposed algorithm

The evaluation of the performance of the methods competing in the DREAM5 challenge relies on the AUROC and AUPR. The Receiver Operator Characteristic (ROC) curve shows how the fraction of correctly classified positive instances (True Positive Rate, TPR) varies with the fraction of incorrectly classified negative instances (False Positive Rate, FPR)

We showed that our approaches yield a much higher AUPR at the cost of a slightly lower AUROC than the other competing methods of the DREAM challenge. We argue here that in the case of eQTL mapping, the AUPR may better assess the performance of the competing methods, in the way that it penalizes the detection of false positive edges among the top scoring edges more heavily than the AUROC score. Indeed, in practice the prediction of a regulatory relationship is only the first step of the analysis. The predicted relationships can be used as a basis to study a biological process, or be validated in a follow-up experiment, or (more commonly) be integrated with other data to make biological inferences. Depending on the down-stream analysis, erroneous prediction of an interaction may be much more expensive than missing an interaction.

Data simulations are a well-established means to test new approaches for data analysis and compare them to state of the art methods in the field. However, the more complex the data to be analyzed, the more difficult it is to mimic these data with simulations. While the DREAM5 SYSGEN A data were designed to simulate the complex regulatory relationships between genetic loci and gene expression, there are several considerations missing from the data-generating model. Epistatic interactions between loci (non-additive effects) greatly complicate the structure of eQTL networks

Area under the ROC (AUROC) curve and area under the precision-recall (AUPR) curve for each of the 15 networks of the DREAM 5 SYSGEN A challenge. The bars show the AUROC (left-oriented bars) and AUPR (right-oriented bars) for each method and each netwrok. Top panel, 100 RILs. Middle panel, 300 RILs. Bottom panel, 999 RILs. Complexity of the networks (number of edges) increases from left to right in each panel.

(PDF)

Number of interactions predicted by the filtered committee ({RF.sf+RF.pi+ElNet}|LASSO) for each of the 15 networks of the DREAM5 SYSGEN A challenge. The challenge was divided into three sub-challenges with varying sample sizes (100, 300 and 999 RILs, respectively), and each sub-challenge consisted of 5 different networks with growing numbers of edges. The number of predicted interactions positively correlates with sample size and network complexity. For the evaluation of the challenge, the top 100,000 scoring interactions were considered. The {RF.sf+RF.pi+ElNet}|LASSO method was very restrictive in the number of predicted network edges. Since the {RF.sf+RF.pi+ElNet}|LASSO did not predict that many interactions for any network, the evaluators added random interactions.

(PDF)