Figure 1.
Rationale of the experimental outline.
The experimental set-up comprised the two Arabidopsis thaliana accessions C24 and Col-0, which served one at a time as the first parent of the testcrosses. One of 359 recombinant inbred lines derived from the two original accessions acted as the second parent. The analysis implicated the metabolic profile and genetic markers of the RILs, which were then used to predict the relative mid-parent heterosis rMPH in biomass. The latter is defined as the relative biomass gain of the testcross as compared to the mean biomass of its parents (cf. Methods section for details) and therefore manifests itself not until the next generation. P1 describes the shoot biomass of the first parent, RIL the shoot biomass of the particular recombinant inbred line and TC_RIL the shoot biomass of the corresponding testcross. The function mean(•, •) refers to the arithmetic mean of the respective values.
Figure 2.
Predictive power of genetic (A, B), metabolic (C, D) and combined (E, F) marker sets for C24- and Col-heterosis.
The diagrams shown here demonstrate the trade-off between overfitting and loss of information in the different models. The x-axis represents the number of predictors used to train the respective model. Red dots display the predictive power of the particular model in leave-one-out validation (LOOV). Panel A, C and E correspond to C24-heterosis, the remaining ones to Col-heterosis. The number of predictors, which maximizes the predictive power, is referred to as Opt, and it differs in the various models. In each case, the predictive power decreases by incorporating too many predictors in the corresponding model. This effect is due to overfitting. On the other hand, loss of information occurs, if too few predictors are selected in the model. Min refers to the minimal number of predictors that does not yet imply a significant loss of predictive power. The corresponding predictive power is still within the estimated confidence interval (gray lines) of the maximal predictive power. Black dots demonstrate the estimation of these confidence intervals. They represent the predictive power of the optimal predictor set when using jackknife resampled data (cf. Methods section for details).
Table 1.
Predictive power (PP) in leave-one-out validation of the respectively optimal selections of predictors for the relative mid-parent heterosis regarding the two different testcross set-ups.
Figure 3.
Discrete VIP of genetic markers for Col-heterosis and C24-heterosis prediction, including overlap with QTL.
Each of the 110 genetic markers and its position on a Chromosome (Chr) is represented by two circles. The circle size indicates high (large circles), medium and low (small circles) VIP in the genetic model for Col-heterosis (circle left of chromosome) and C24-heterosis prediction (circle right of chromosome). We refer to the VIP as high and medium, if the corresponding marker is contained in the minimal genetic model and in the optimal genetic model, respectively. Markers specific for Col-heterosis prediction and those specific for C24-heterosis prediction are coloured in dark blue and in light blue, respectively. To allow positional comparison, support intervals of biomass QTL and three kinds of heterosis related QTL detected by Meyer et al. (2008, submitted) are plotted as coloured boxes along the chromosomes (grey: biomass, green: Z2 [40], dark blue: absolute mid-parent heterosis concerning Col-0, referred to as aMPH_Col, light blue: absolute mid-parent heterosis concerning C24, referred to as aMPH_C24). Horizontal lines represent the position of genes directly involved in reactions including metabolites, which are contained in the respective minimal metabolite model and the combined genetic-metabolic model (Col: left of chromosomes, C24: right of chromosomes). For specific metabolites genes were coloured accordingly.
Table 2.
Metabolic markers highly predictive (pred) in both testcross (TC) populations and those specifically for heterosis (het) prediction in one certain testcross set-up, each in alphabetical order.
Figure 4.
Histograms of particular metabolic markers over all 359 investigated RILs.
Here, unit area histograms are presented, i.e. the particular curve shows proportions rather than absolute numbers. Thus it constitutes a simple density estimate. The x-axis demonstrates normalized metabolite levels and is divided into equidistant intervals. The y-axis represents the relative frequency per interval. The panels A and B show the two metabolic markers with the highest VIP in each investigated model, i.e. Unknown 31 (using a functional group prediction service offered by the Golm Metabolome Database [41] at least one hydroxyl group was predicted to be present in Unknown 31) and Cellobiose. The levels of these highly predictive metabolic markers deviate obviously from normal distributions, namely they display bimodal distributions. The deviation from a normal distribution seems to abate with decreasing importance of the particular metabolic marker in the models. This is demonstrated by the two examples C and D of metabolic markers, which have in average the lowest VIP in our models.
Figure 5.
Simplified display of the idea behind the modelling for the different predictor sets.
The chain of causality from genes to phenotype is displayed here. Since genes are at the starting point of the causal chain, one established way to model a phenotype Y is to use genetic markers X and combine them linearely (genetic model in red). Using instead predictor variables Z, which are close to the phenotype, such as metabolites, presents another promising way to predict a complex phenotype, since those variables integrate already parts of the complex gene interactions (represented by products Xi*Xj). The advantage is that we do not need to know, which genes actually interact in which way, and that the model can stay simple. It just linearely combines metabolite variables, thus integrating non-linear interactions indirectly (metabolite model in blue). In the end, one might use a combination of both approaches, i.e. combining different levels of the causal chain, to explain as much as possible of the complex phenotype. Hence one integrates linear relationships concerning the response as well as non-linear gene interactions, while sticking to the simple model ansatz of a linear combination of genetic predictors X and metabolite variables Z (combined model in violet).