^{*}

Analyzed the data: MG MH. Contributed reagents/materials/analysis tools: MG. Wrote the paper: MG MH.

The authors have declared that no competing interests exist.

To predict gene expressions is an important endeavour within computational systems biology. It can both be a way to explore how drugs affect the system, as well as providing a framework for finding which genes are interrelated in a certain process. A practical problem, however, is how to assess and discriminate among the various algorithms which have been developed for this purpose. Therefore, the DREAM project invited the year 2008 to a challenge for predicting gene expression values, and here we present the algorithm with best performance.

We develop an algorithm by exploring various regression schemes with different model selection procedures. It turns out that the most effective scheme is based on least squares, with a penalty term of a recently developed form called the “elastic net”. Key components in the algorithm are the integration of expression data from other experimental conditions than those presented for the challenge and the utilization of transcription factor binding data for guiding the inference process towards known interactions. Of importance is also a cross-validation procedure where each form of external data is used only to the extent it increases the expected performance.

Our algorithm proves both the possibility to extract information from large-scale expression data concerning prediction of gene levels, as well as the benefits of integrating different data sources for improving the inference. We believe the former is an important message to those still hesitating on the possibilities for computational approaches, while the latter is part of an important way forward for the future development of the field of computational systems biology.

The massive growth of high throughput data within molecular biology during the last decade has sparked an interest in systems biology and generated a great variety of suggestions on how to infer knowledge from these data sets. That is, whether the data belong to the genomics, transcriptomics, proteomics or metabolomics domain, they still need to be structured before one can learn anything from them. Here, networks have proved to be a unifying language for different biological systems involving, genes, proteins, metabolites and also small molecules. These networks, defined by protein-protein, protein-to-gene, metabolic interactions etc., determine cellular responses to input signals and govern cellular dynamics

Integration of data, which this challenge implicitly called upon, has been the subject of much attention recently; see for example the review by Hecker et al.

The challenge for predicting gene expression provided by the DREAM project is of great importance to explore the benefits and bottlenecks of the state-of-the-art algorithms in a fair competition. It represents a solution to the non-trivial problem of designing relevant challenges which at same time addresses biological and computational interesting problems. From the DREAM web-site [

Gene expression time course data is provided for four different strains of yeast (

Expression levels were assayed separately in all four strains following the addition of 3-aminotriazole (3AT). 3AT is an inhibitor of an enzyme in the histidine biosynthesis pathway and, in the appropriate media (which is the case in these experiments) inhibition of the histidine biosynthetic pathway has the effect of starving the cells for this essential amino acid.

Data from eight time points was obtained from 0 to 120 minutes. Time t = 0 means the absence of 3AT.

This challenge is biologically relevant, and the fact a gold standard exists but is hidden makes the challenge objective and fair. Further, the probe names were given, which allows for data integration of publicly available experiments and a priori knowledge, making the challenge even more realistic in describing a situation which can occur in one's laboratory. However, the problem is somewhat different from the normal setting in systems biology where the aim is not only to predict future experiments but also to obtain interpretable models from which we can gain an increased biological understanding

The goal of the challenge of DREAM was to predict the order of the chosen 50 genes within the gat1Δ strain for the eight time points at which they were measured. All details about the algorithm we utilized and how it was developed can be found in the

Comparing with the training results of

Training strains | |||||

Norm | Data | wt | gcn4Δ | leu3Δ | Overall |

L1 | values | 0.616 | 0.632 | 0.686 | 0.670 |

L2 | values | 0.587 | 0.747 | 0.699 | 0.712 |

L1 | values, rates | 0.647 | 0.604 | 0.662 | 0.663 |

L2 | values, rates | 0.587 | 0.747 | 0.699 | 0.712 |

L1 | rates | 0.244 | 0.260 | 0.434 | 0.360 |

L2 | rates | 0.448 | 0.542 | 0.611 | 0.570 |

Training strains | ||||

Algorithm | wt | gcn4Δ | leu3Δ | Overall |

RLAD | 0.486 | 0.799 | 0.684 | 0.681 |

LS, Elastic net | 0.687 | 0.828 | 0.764 | 0.794 |

External data | Training strains | ||||

Expression | TF-binding | wt | gcn4Δ | leu3Δ | Overall |

x | 0.793 | 0.881 | 0.789 | 0.856 | |

x | x | 0.793 | 0.880 | 0.791 | 0.857 |

An observation here is that the submitted prediction for the gat1Δ strain correlates less well with the gold standard than each of the series explored during the development of the algorithm. Neil Clarke points out in his referee report (published on-line accompanying this article) that he picked some of the genes to be predicted because of their surprising or non-trivial expression pattern the gat1Δ strain. This fact, combined with a general observation that cross-validation often underestimates the error

Considering the result for each time point,

The correlations are all with respect to the gold standard. The upper blue curve (stars) is our result; the green curve slightly below (rings) belongs to Ruan

If we instead consider the obtained rank correlations with the gold standard per gene, instead of per time point, we get the result in

The correlations are all with respect to the gold standard. The upper blue curve (stars) is our result; the green curve slightly below (rings) belongs to Ruan

The importance of challenges as DREAM lies to a large extent in its objectiveness. When an inference algorithm comes from the same laboratory as the one which has performed the assessment experiments, sometimes even in the same article, it is likely the algorithm has been tuned to fit with the expected outcome. This is most probably often over fitting, and decreases then its performance for other data sets. Also, the value of this procedure as an assessment is questionable, since the testing of only a few of the predictions of the algorithm has a clear anecdotic flavor, especially when the researcher can choose by him- or herself which parts should be presented. As a contrast, the DREAM challenges provide the community with workbenches where all are welcome to submit the predictions of their algorithms, and thereby getting the opportunity to assess and compare them with the performance of others. No one knows the gold standard beforehand, and even if the evaluation data is limited, it is well defined but still no fine tuning can be carried out. This makes a huge difference compared with the case mentioned above, when the same laboratory both performs the experiments and present inference algorithms with alleged generalizability.

However, this appreciated objectiveness and fairness of DREAM holds of course true only as long as the gold standard is hidden. As soon as it is revealed, one can start improving one's algorithm to better fit the expected outcome, but at the same time taking the risk of exposing it to over fitting. Any “improvements” at this stage must be very well motivated in order to make any sense at all. For example, for our algorithm, we could consider the possibilities to use local fitting parameters instead of a global one for the prior, or to further prune the model by choosing parameter values not at the cross-validation minimum, but one standard deviation below, etc. Due to the above mentioned reasons, we refrain from such actions, though, and instead look forward to the next round of DREAM.

The algorithm here presented represents one efficient way of predicting rankings of expression values. A key component in the development of the algorithm has been the inclusion of results of measurements not directly associated with the experimental condition for which the expression values should be predicted. Whether this inclusion has been for more expression data or for prior knowledge of TF-DNA bindings, a cross-validation scheme has helped us not to rely more on these measurements than the original data allow. This is denoted as “soft integration” and forms a cornerstone of our work. The success of the algorithm clearly shows that prediction of expression levels is a possible task, even when the number of genes in the system exceeds the number of experiments 100-fold.

Surprisingly, the inclusion of a priori knowledge of TF-DNA bindings did not improve the performance of the algorithm substantially. The reason for this needs more research to find out, since the quality of this kind of data is generally believed to be reasonably high. A hypothesis is that our choice to have just one global parameter

Interestingly, the second best performance algorithm, by J. Ruan, is based on a very different thinking with respect to data integration. There, only the data provided by DREAM is utilized, and their algorithm is based on profile similarities measured by Euclidean distances and predictions from

The reason for the success, and failure, for both of these two philosophies for prediction of gene expression needs further research to find out. Especially, the cases where one algorithm is successful and the other is not, deserve extra attention. As a final remark, we stress again how the integration of data, which is important for our algorithm, did not at all appear in the Ruan-algorithm, but still the results are similar. That is, a simple method as KNN can still be as effective as a more sophisticated algorithm where TF-DNA bindings etc. are taken into account. This means there is probably a lot of improvements possible, which is a challenge for the future development of computational systems biology.

The quest for modelling gene networks has taken many different forms during the last decade

An often utilized approach for large-scale modelling of gene regulatory networks is to only consider the transcripts, and thereby letting all interactions be projected onto the space of genes only

Here, however, we are not primarily interested in the dynamics or in the derivatives of the 50 genes in the gat1Δ strain which were removed from the data file, but in the prediction of their expression levels. By denoting this set of 50 genes as

In order to obtain a ranking list based on the expressions of the 50 genes in

The DREAM data are measured by Affymetrix chips of 9335 probes, and obtained from two biological and two technical replicates. We map the probe-names onto unique gene names, which leave 7804 units, where we use mean values when more than one probe corresponded to one gene. Furthermore, we approximate the derivatives from central differences, except at the end points of each series. At time

Throughout the article, we utilize cross-validation (CV) to discriminate among models. We hold one of the three time-series provided by DREAM out from the inference, and utilize the other two, and occasionally also other data sets, for finding the searched parameters. We then use data from the left out strain to predict the expression values of the 50 searched genes for each time point in the this series, rank them according to the predicted levels such that the highest expressed gene obtain rank number one, second highest rank number two, etc., and calculate the Spearman rank correlation with the observed ranking of the same series. This is repeated three times, holding each of the provided time-series out a time. We end up with 24 different ranking lists for the 50 genes in

Before we start exploring various versions of the penalty term in (3), we try to simplify the model (2). The strategy is to primarily work with the DREAM data, in order to reduce the model. When this first reduction is obtained, we will utilize also other publicly available data in order to further strengthen the predictive power of our mathematical model. This first model selection is performed among the models with perfect fits, i.e., the ones where the terms for the first sum of absolute values in (3) all are zero, making the exponent

Only expression values are included

Both expression values and expression rates (derivatives) are included

Only expression rates (derivatives) are included

By picking the solution with zero value for the objective function (without penalty term) and choosing the coefficients

We see that the highest values for the correlations are obtained when we only include the expression levels. Inclusion of the expression rates makes the result slightly worse, except for the least squares where the correlations are equal. However, with the same predictive power, we apply Occam's razor and prefer the simplest model. To only use the rates gives the least satisfying result of them all. Therefore, in the sequel, we choose to discard all derivative terms and determine the parameters according to

By choosing the exponent

The penalty term can take many different forms. A review for least squares of more classical forms as Mallow's Cp, Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Minimum Description Length (MDL) etc in the context of gene networks can be found in

This convex combination is a compromise between the two goals of a good solution, to have good predictive power and to be interpretable

We show in

One way to improve the performance of the algorithm is to include more data. This is a challenging problem which is crucial for all kinds of large-scale inference problems

A straightforward way to include other types of expression sets is to extend the sum of squares in (4) over more data sets. We integrate two collections of expression sets, which reduce the number of possible genes to use as explanatory variables further; down to 4140 genes (since the numbers differ across the experiments, and we only utilize those genes for which we have measurements in all experiments). The final collection comprises:

A set of total of 515 steady-state profiles from a collection of the gene knock-out experiments

A set of 256 time-series profiles, comprising a collection of time-series experiments downloaded from ncbi omnibus

However, the experimental conditions can vary a lot, and most of them are probably distant from the conditions we actually are interested in. It is therefore likely that these profiles have less impact on the actual problem than the primary profiles presented for the actual problem. We therefore introduce an extra coefficient

The TF-DNA binding data are taken from Yeastract

The reason why we focus on co-regulation rather than regulatory interactions is that the values of the inference are based on transcript levels, and TFs are known to be expressed on a low level. Also, their activity is often determined by phosphorylation and other effects rather than their amount

We utilize as prior information that genes which are co-regulated are likely to be effective predictors of each other; the more TFs in common, the more likely to be co-regulated.

To summarize the discussion above, the objective function takes the form

Effectively, for each target gene

Finally, from a computational point of view, we remark that all implementations and calculations have been performed on an ordinary laptop in the languages R and Matlab. That is, the complexity of the problem is not worse than it can be handled in any laboratory.

We thank the organizers of the DREAM project, and especially the DREAM3 challenges, Gustavo Stolovitzky and colleagues