^{1}

^{2}

^{¤}

^{2}

^{1}

^{*}

^{3}

^{2}

^{1}

^{3}

Conceived and designed the experiments: JAH MMWBH JAW AS. Performed the experiments: MJvdW. Analyzed the data: JAH. Contributed reagents/materials/analysis tools: MJvdW. Wrote the paper: JAH MMWBH JAW AS. Group Leader: AS RB MJvdW.

Current address: Biometris, Wageningen University, Wageningen, The Netherlands

The authors have declared that no competing interests exist.

One of the new expanding areas in functional genomics is metabolomics: measuring the metabolome of an organism. Data being generated in metabolomics studies are very diverse in nature depending on the design underlying the experiment. Traditionally, variation in measurements is conceptually broken down in systematic variation and noise where the latter contains, e.g. technical variation. There is increasing evidence that this distinction does not hold (or is too simple) for metabolomics data. A more useful distinction is in terms of informative and non-informative variation where informative relates to the problem being studied. In most common methods for analyzing metabolomics (or any other high-dimensional x-omics) data this distinction is ignored thereby severely hampering the results of the analysis. This leads to poorly interpretable models and may even obscure the relevant biological information. We developed a framework from first data analysis principles by explicitly formulating the problem of analyzing metabolomics data in terms of informative and non-informative parts. This framework allows for flexible interactions with the biologists involved in formulating prior knowledge of underlying structures. The basic idea is that the informative parts of the complex metabolomics data are approximated by simple components with a biological meaning, e.g. in terms of metabolic pathways or their regulation. Hence, we termed the framework ‘simplivariate models’ which constitutes a new way of looking at metabolomics data. The framework is given in its full generality and exemplified with two methods, IDR analysis and plaid modeling, that fit into the framework. Using this strategy of ‘divide and conquer’, we show that meaningful simplivariate models can be obtained using a real-life microbial metabolomics data set. For instance, one of the simple components contained all the measured intermediates of the Krebs cycle of

Modern instrumental methods have been generating a significant advancement in biology research. Especially in the field of functional genomics, transcriptomics and proteomics measurements have provided fundamental insight in many biological processes. The missing link between these measurements and the phenotype is called metabolomics

Traditionally, a set of measurements is analyzed by postulating a model describing systematic variation and assuming the left-overs (residuals) as being random. Due to the complexity of metabolomics data, this concept breaks down. There are many sources of variation in the data

Our assumption is that the studied biological phenomena are not represented by all measured metabolites, but that simple structures (subsets of related metabolites) in (parts of) the data exist, each simple structure or component describing an underlying biological phenomenon. In the development of our discovery tool we are aiming for a method that fulfills the following requirements: i) being able to identify simple structures, in which just a limited number of metabolites are represented by the structure; ii) representing each simple structure by a model, the type of model depending on the data collected and driven by

We have called this new approach

From left to right: the univariate, simplivariate and multivariate approach.

Although the simplivariate framework is general and can be used in exploratory analysis, regression analysis and discriminant analysis, in this paper we will focus on explorative methods. Usually in exploratory data analysis for metabolomics data, use is made of either of two types of techniques: projection (dimension reduction) methods or clustering methods. The first type of techniques (with Principal Components Analysis (PCA) as an example) searches for structures consisting of highly co-varying metabolites to construct new representations of the data

First, the simplivariate modeling framework will be presented in its full generality. Next, two techniques that fit into that framework will be discussed using real-life metabolomics data. Finally, shortcomings of these methods will be discussed and suggestions of improvement will be given.

A flexible framework is built by defining a

In which every element _{ij}_{ijk}_{ij}_{ijk}_{ij}_{ij}_{ijk}

Here _{jk}_{ik}_{jk} = 1 if variable _{ik} = 1 if object

For simplicity we have used the same symbol _{ijk}

When decomposing

The components _{ijk}_{ijk}

Combinations of representations Eq. 4 and 5 are also possible resulting in mixed models:

In equation (2) _{jk}_{ik}_{ijk}_{ik} is always 1:

The type of preprocessing applied to the data is influencing the outcome of an analysis

There are several algorithms described in literature that can create simple models according to our definition in the previous sections. In this paper, we have chosen two algorithms, both representing both the multiplicative and additive model classes. In the following section, a short explanation of both methods will be given.

IDR

Set the

Look for the absolute lowest non zero value of

Calculate the inner product the original loadings vector

Convert the inner product to an angle with the inverse cosine.

Repeat steps 2–5 until only the largest absolute value is left over.

The simplified

Calculate scores (

Repeat this procedure of all IDR components.

The final IDR model has the form:_{ik}_{jk}_{jk} are zero. This can be made explicit by writing_{jk}_{jk}

Plaid

The plaid model consists of a series of additive layers intended to capture the underlying structure of matrix _{k} is introduced to serve as a general mean (model (4) is essentially the same as model (10)

Here, _{ijk}_{ij}_{ij}_{0} is the background layer model for entire the entire data matrix

Choose starting values for

Update layer effects using plaid cluster estimate _{k}

Update cluster membership

repeat step 2–3 for s iterations

Compute final layer effects as in step 2

Prune plaid cluster to remove ill fitting metabolites.

Test _{k}_{k}

Subtract _{k}

Apply backfitting for each obtained plaid cluster

Apply pruning to remove ill fitting metabolites and continue at step 2

The above algorithm is the original Plaid algorithm. We used it with some adaptations to our circumstances:

we did not apply significance testing but selected 6 plaids for illustration.

we applied a one step backfitting procedure

we did used γ_{j} = 1 throughout and, hence, did not have to optimize those values.

When residuals of selected metabolites after the plaid fit are larger than the prune fraction (0.70, see

Setting | Value |

Maximum iterations | 50 |

Number of permutation in significant testing | 25 |

Backfitting | one step |

Maximum number of layers | 6 |

Prunefraction |
0.7 |

Minimum of proportional reduction in residual sum of squares required for cluster membership.

Plaid and IDR were programmed in Matlab 7.1

Metabolomics data is highly dynamic in range. Metabolites can have very different and very large concentration ranges. Some metabolites will be zero since their concentrations will be too low to detect under some experimental conditions. This indicates that metabolomics data is not pure multiplicative in nature and can benefit from removing column means.

For illustrative purposes, some metabolite measurements are plotted in _{i}'s values. It is clear that an additive model has large difficulties modeling data with highly varying ranges for the metabolites. This justifies scaling of the data. The offsets of these ranges are determined by the values of the β_{j}'s. The range of a multiplicative fit can be more dynamic since it is determined by a multiplication of the values α_{i}'s and β_{j}'s. Additive and multiplicative simple components have clearly a different behavior.

Data is taken from E. Coli data as used in the remained of this paper. The whiskers indicate the total concentration range for each of the 10 metabolites. Each metabolite is represented three times. The left black lines for each metabolite are the actual concentrations. The middle red line indicates the fit/model with an additive model. The right blue lines indicate the fit/model with a blue multiplicative model.

Solid line represents IDR components, dotted line represents PCA components. See text for explanation.

The values of the loadings are indicated by a grayscale color as indicated by the colorbar. The grouping of metabolites is identical to the grouping of the plaid solution for clarity.

The optimum is chosen where the angle between the simple component and principal components is minimal. This is indicated by a dotted line and an asterisk.

Black squares indicate a +1, white indicates a zero, grey indicates a −1. The grouping of metabolites is identical to the grouping of the plaid solution for clarity.

Black squares indicate a +1, white indicates a zero. Results have been grouped as much as possible for clarity.

Although IDR and plaid have different underlying models, multiplicative or additive, there are similarities between the IDR components in

Positive correlations are indicated by a white square, negative correlations are indicated by a black square.

There are too many metabolites present in each IDR components to come to a meaningful analysis of the IDR results. However, the plaid clusters are relatively simple and contain biological meaningful metabolite clusters. For instance, the first plaid cluster contains all intermediates of the Krebs cycle whose concentration is above the detection limit in this data set, i.e. fumarate, malate; 2-ketoglutarate, and citrate (

Another example is plaid cluster 4 that contains many intermediates of the phenylalanine biosynthesis pathway, i.e. erythrose-4-phosphate, 3-dehydroquinate, shikimate-3-phosphate, chorismate, phenylpyruvate, and phenylalanine itself, and several compounds which are side routes of this pathway, i.e. 3-phenyllactate, and tyrosine. Interestingly, prephenate, an intermediate at the splitting point of the phenylalanine and tyrosine biosynthesis routes, is not clustered in plaid cluster 4 but in plaid cluster 3. In contrast, when analyzing this data set by IDR, all the phenylalanine-related intermediates described above, including prephenate, end up in the same IDR component, i.e. IDR component 3 (

The most useful results are obtained with plaid which models (patches of) data with an additive model while IDR uses a multiplicative model. It is possible to mix both models to obtain a mixed model representation (see section on simple structures, model number 4). Mixed models might also help to further strengthen the plaid clusters. Additive plaid models can only contain positively correlated metabolite concentrations, while metabolites that are negatively correlated can still be part of the same biochemical process.

The presented framework provides a good basis for simplivariate data analysis models. The two presented methods IDR and Plaid fit well in this framework. IDR suffers from too many selected metabolites which makes it rather ineffective for creating more interpretable models. This selection is intrinsic for the method and cannot be tuned. Plaid, on the other hand, was shown to be very effective in creating clusters with distinct biochemical meanings. This shows that the concept of simplivariate models is valuable.

The Plaid models also have shortcomings, notably, their inability to model metabolites belonging to the same processes having either positive or negative correlations. This can possibly be overcome by using simple components with a mixed-model structure. Moreover, the pruning mechanism present in plaid that prevents that too many metabolites are selected in a plaid cluster, remains a crude way of cleaning up a solution. It is inefficient to first create large plaid clusters (at a certain computational cost) and decreasing them after they are finished. By more carefully optimizing a plaid cluster this should be prevented. This will be subject of further research.

The framework allows for any simple component structure to include in the simplivariate model. When some of the metabolites are known to be linked in certain experiments by interlinked pathways and/or co-regulation, then these can be forced in one simple component with a structure reflecting these pathways/ this co-regulation. Also metabolic network information can be used to choose simple component structures. All these extensions are the subject of a follow-up paper.

Matrix

Sizes:

Group memberships: δ_{jk} = indicator for group membership of variable _{jk} = 1 if variable _{ik} = indicator for group membership of object _{ik} = 1 if object

PCA-scores: _{r} (_{ir}. (

PCA-loadings: _{r}, _{jr}.