Consistent Estimation of Gibbs Energy Using Component Contributions

Standard Gibbs energies of reactions are increasingly being used in metabolic modeling for applying thermodynamic constraints on reaction rates, metabolite concentrations and kinetic parameters. The increasing scope and diversity of metabolic models has led scientists to look for genome-scale solutions that can estimate the standard Gibbs energy of all the reactions in metabolism. Group contribution methods greatly increase coverage, albeit at the price of decreased precision. We present here a way to combine the estimations of group contribution with the more accurate reactant contributions by decomposing each reaction into two parts and applying one of the methods on each of them. This method gives priority to the reactant contributions over group contributions while guaranteeing that all estimations will be consistent, i.e. will not violate the first law of thermodynamics. We show that there is a significant increase in the accuracy of our estimations compared to standard group contribution. Specifically, our cross-validation results show an 80% reduction in the median absolute residual for reactions that can be derived by reactant contributions only. We provide the full framework and source code for deriving estimates of standard reaction Gibbs energy, as well as confidence intervals, and believe this will facilitate the wide use of thermodynamic data for a better understanding of metabolism.


Training data
The component contribution method is based on a machine learning algorithm, and completely relies on empirical data in order to infer the parameters needed for later estimation. This empirical data is called the training data, and the inference of parameters is called the training stage.
Nearly all Gibbs energy measurements, for compounds and reactions in aqueous solutions at near-room temperature, are derived from the equilibrium constants of enzyme-catalyzed reactions. Typically, an enzyme that specifically catalyzes a certain reaction is purified and added to a medium that contains the reaction substrates. After the reaction reaches equilibrium, all reactant concentrations are measured. The equilibrium constant K is defined as the ratio between the product of all product concentrations and the product of all substrate concentrations. Since there is no easy way to distinguish between pseudoisomers of the same compound, the concentration of every reactant is actually the sum of all its protonation states. Therefore, K is the apparent equilibrium constant, which is related to the standard transformed Gibbs energy of reaction (∆ r G • [1]). The problem with using this data as-is lies in the fact that K and ∆ r G • depend on the aqueous environment (e.g. pH, ionic strength, and pMg). The measurements listed in TECRDB span a wide range of pH and ionic strength values, and many of the reactions have only been measured in non-standard conditions.
In order to standardize the training data and eliminate the effects of pH and ionic strength, we apply an inverse Legendre transform [2] to convert ∆ r G • into ∆ r G • (i.e. from transformed Gibbs energies to chemical Gibbs energies). As with the forward Legendre transform, this process requires the pK a values for each of the reactants, provided by Caluclator Plugins from ChemAxon.
The TECRDB contains close to 4000 unique values of K (for about 400 reactions in different conditions). In addition, the standard chemical Gibbs energies of formation for another ≈ 200 compounds are listed in the literature [3,4]. All this training data is collected and stored in a 1362 by 4146 stoichiometric matrix (S ), along with the 4146 Gibbs energies (∆ r G • obs ) corresponding to the columns of S.

Errors associated with the inverse Legendre transform
It is difficult to verify that effectiveness of the inverse Legendre transform in normalizing the effects of pH and ionic strength. This procedure relies on the reported pH and I that appear in TECRDB, and which are not always correct or even missing altogether. In addition, we ignore the effects of temperature and metal ions in the solution. Perhaps the biggest uncertainty, however, comes from the proton dissociation constants (pK a values) which we obtain from a 3rd party software tool by ChemAxon. In theory, if the inverse transform were 100% correct (and all measurements were exact), then multiple measurements of K , for the same reaction at different conditions, would all yield the exact same value for ∆ r G • . To test that, we compared between the standard deviations of the observed Gibbs energies before and after the inverse transform, namely σ(∆ r G • obs ) versus σ(∆ r G • obs ). We only looked at reactions with more than 5 repeated measurements. We found that there was a reduction in the median standard deviation from 2.2 kJ/mol to 1.5 kJ/mol (see Figure S1). These results indicate that the condition dependency of K has been somewhat normalized but that there is still noise in the data, probably attributed both to measurement errors and inaccuracies in the inverse Legendre transform.

Weighing the training observations
Since we are combining multiple sources of standard Gibbs energies (formation energies, reduction potentials, and inverse-transformed reaction energies from hundreds of publications) we allow our method to factor in how reliable each observation is. Prior to applying the component contribution method, each row of the stoichiometric matrix S is scaled by the confidence of that observation and the corresponding observed values (∆ r G • obs ) are scaled accordingly. The results reported in this paper were derived when applying a fixed weight (e.g. 1) to all observations.

Group decomposition
In all group contribution methods, including the component contribution method, molecules are decomposed into non-overlapping structural groups. The contribution of each group to the overall Gibbs energy of the molecule is assumed to be a constant value, independent of molecular environment or the neighboring groups. The implementation by Jankowski et al. [5] added interaction factors, which are similar to groups except that they are allowed to overlap with other groups and with each other. In [6], we introduced the notion of pseudoisomeric groups, which are sensitive to the protonation state of the atoms in the group. For the purpose of group decomposition, and the derivation of the group incidence matrix G, we use here the exact same group definitions as in [6].  Figure S1. Cumulative distribution functions of the standard deviations for ∆ r G • and ∆ r G • . The biochemical Gibbs energies (∆ r G • ) directly derived from the apparent K in TECRDB has a higher median standard deviation, 2.2 kJ/mol, than the Gibbs energies after applying the inverse transform (∆ r G • ) -which is 1.5 kJ/mol. Only reactions with more than 5 measurements were considered for this comparison.

Full mathematical derivation of the component contribution method
In order to derive the main result for the component contribution method (Eq. 10 in the main text) we reiterate the set of definitions required. Let S ∈ R m×n be the stoichiometric matrix of measured reactions, G ∈ R m×g be the group incidence matrix, ∆ r G • obs ∈ R n×1 be the measured standard Gibbs energies of the reactions in S, and x ∈ R m be a stoichiometric vector for a new reaction. Define P R(S) , P N (S ) ∈ R m×m as the orthogonal projection matrices onto the range of S and the null space of S , respectively. Define x R ≡ P R(S) · x and x N ≡ P N (S ) · x. Equations 3 and 8 (from the main text) state that Therefore, the component contribution estimation for the standard Gibbs energy of x will be We can thus say the component contribution method is equivalent to a linear regression problem where the contribution coefficients are given by

Estimation of error in group model
The group model in Eq. 7 (main text) relies on the assumption that the standard Gibbs energy of a compound is a simple linear combination of the Gibbs energy contributions of substructures in that compound. Throughout the main text, we have referred to this as the assumption of group additivity. The error resulting from this assumption, which we will refer to as modeling error, can be estimated by decomposing the residual in group contribution fitted Gibbs energies for reactions in S, into two components; one corresponding to experimental error and the other to modeling error. The component corresponding to experimental error is exactly the residual in the reactant contribution fitted Gibbs energies for reactions in S, An estimate of the modeling error e m is therefore given by the difference between the two residuals To clarify we reiterate that ∆ r G • rc , given by is the closest point to ∆ r G • obs that is in the range of S and therefore consistent with the first law of thermodynamics (see main text section Reactant contribution method ). Since we assume that the first law holds, we must assume that any component of ∆ r G • obs that is orthogonal to R S is due to experimental error. The residual e rc is the orthogonal projection of ∆ r G • obs onto N (S) which is the orthogonal complement of R S . This can be seen by inserting Eq. 5 into Eq. 3; e rc ⊥ R S is therefore an estimate of experimental error in ∆ r G • obs . Group contribution fitted Gibbs energies for reactions in S are given by where P R(S G) = S G S G + is the orthogonal projector onto the range of S G. The residual of the fit is with e m = P R(S ) − P R(S G) · ∆ r G • obs . The modeling error e m is therefore the part of ∆ r G • obs that is consistent with the first law (i.e., is in R S ) but is not explained by the group model (is not in R S G ).
Note that e m ⊥ e rc since (see Section 4.1 for a proof that P N (S) · P N (G S) = P N (S) ). Therefore, It is important to emphasize that the residual e rc is only an estimate of experimental error for several reasons. Systems of biochemical reactions may deviate slightly from the assumptions underlying linear regression, but we assume such deviations are small. More importantly, some error may be introduced by the inverse Legendre transform of the experimental data (see Section 1). Since any such error would contribute equally to e gc , however, this would not affect our estimate of e m . Even if the inverse transform introduced no error, it is possible that the orthogonal projection of ∆ r G • obs did not give the true ∆ r G • for reactions in S. The true ∆ r G • may be some other point in R S that is further away from ∆ r G • obs . Our estimate of e m would then be offset by the same distance. Lastly we note that, although it is unlikely that error due to the assumption of group additivity can be avoided in a linear model such as the group model, it is possible that a different choice of groups would lead to a reduction in e m .

Error in current group model
We estimated the root-mean-square modeling error for the group model used in the current publication as ||e m || 2 n − rank(S G) = 6.8 kJ/mol.
The modeling error was approximately normally distributed ( Figure S2a) with mean ± stdev = -0.4 ± 6.6. The magnitude of the error was only weakly correlated with the length of group vectors for reactions in S (Pearson's correlation coefficient = 1.9, Figure S2b).

Reaction type statistics
The uncertainty in our estimations depends on the quality of the training data (i.e. the measurement error), the amount of examples for each parameter and the crudeness of the assumptions made throughout the evaluation. The hierarchical nature of the component contribution method can be understood by defining four sets of reactions (where S is the stoichiometric matrix for the training data and G is the group incidence matrix): A: Reaction x is part of the training set (i.e. appears in TECRDB). This is equivalent to saying x is one of the columns in S.
B: Reaction x can be constructed by a linear combination of reactions in the training set (i.e. x is in the range of S). This is equivalent to saying x ⊥ N (S ).
C: The group decomposition of reaction x can be constructed by a linear combination of reactions decompositions in the training set. This is equivalent to saying G x ⊥ N (S G). D: x is a reaction.
These four sets have the following layered relationship: A ⊆ B ⊆ C ⊆ D. The first and last relations are trivial. To prove that B ⊆ C, we only need to see that ∀x ∈ B, ∃v such that Sv = x. Then to show that G x ⊥ N (S G), it is enough to see that ∀y ∈ N S G ⇒ G x y = v S G y = v S G y = 0. This hierarchy of reactions is ordered by increasing uncertainty levels. Estimates for reactions in A have the lowest uncertainty since these reactions have been directly measured. Estimates for reactions in B have a slightly higher uncertainty, since some deduction is made in order to get the value of ∆ r G • , namely an inverse Legendre transform (which depends on pK a data) and the projection of the observations onto the column-space of S. CC will apply the RC method exclusively for all reactions in this set. Estimates for reactions in C generally have a higher uncertainty than estimates for reactions in B, as we need to assume that the group decomposition is the only parameter affecting ∆ r G • -i.e. that each group has a defined ∆ gr G • regardless of its surroundings within the molecule. Note, that every reaction in B is also in C and CC will always use the more precise method wherever possible. Furthermore, if x is in C but not completely orthogonal to N (S ), CC will use GC for the projection of x onto N (S ) and RC for the rest (the part which is orthogonal to this null-space, i.e. contained in B). Reactions that are not in C cannot be evaluated at all, and thus the uncertainty in their ∆ r G • is ∞. A summary of the four sets of reactions are given in Table S1, along with the fractions of reactions in iAF1260 and Recon 1 that belong to each set. Table S1. Statistics for the four sets of reactions.

Fraction of reactions Set
Definition 6 Prediction of flux distributions

iAF1260
Flux distributions in iAF1260 were predicted using loopless flux balance analysis (ll-FBA) [7]. ll-FBA was used in preference to standard FBA [8] to avoid artificially high fluxes through reactions in thermodynamically infeasible loops. Thermodynamics-based metabolic flux analysis (TMFA) [9] was not used to avoid biasing solutions with our estimated Gibbs energies. A total of 312 flux distributions were predicted for iAF1260, corresponding to optimal growth on 174 carbon sources, 78 nitrogen sources, 49 phosphate sources, and 11 sulfur sources that the model was previously shown to grow on [10]. Constraints on exchange, sink and demand reactions in each simulation were the same as in [10]. All simulations were done in Matlab (R2009b, The MathWorks, Natick, MA) with the Gurobi solver (version 5.0.1, Gurobi Optimization, Inc., Houston, TX).

Recon 1
As growth is not the primary objective of mammalian cells, we did not simulate optimal growth on different nutrient sources for Recon 1. Instead, we predicted optimal flux distributions for 288 metabolic objectives designed to validate Recon 1 [11]. The size of Recon 1 does not allow efficient application of loopless FBA. We therefore applied standard FBA. In an attempt to find loopless flux distributions we searched among all alternative optimal distributions for one with a minimum taxicab norm. As fluxes through reactions in thermodynamically infeasible loops are usually close to the maximum allowed value, flux distributions including such loops are expected to have a greater taxicab norm. Minimizing the taxicab norm will only yield a loopless flux distribution if one exists at the optimum. As a further step to avoid loops, we therefore eliminated all flux distributions where flux through at least one reaction was at the maximum allowed value. This step eliminated all except 97 predicted flux distributions.

Calculation of confidence and prediction intervals
In this section we summarize the statistical theory underlying calculation of confidence intervals for the true values of standard reaction Gibbs energies, and prediction intervals for observations of standard reaction Gibbs energies. The summary is based on the textbook by Kutner et al. [12]. We focus the summary on reactant contribution, as it is the simplest linear regression model we work with. Results for group contribution are always analogous to those for reactant contribution. We presented the main results for component contribution in the main text.

Assumptions
The regression model for reactant contribution is where ∆ r G • obs ∈ R n×1 is a vector of observed standard reaction Gibbs energies, S ∈ R m×n is the stoichiometric matrix for reactions in ∆ r G • obs , ∆ f G • ∈ R m×1 is the vector of standard Gibbs energies of formation for metabolites in S, and ε rc ∈ R n×1 is a vector of random errors. We assume that the error ε rc is normally distributed, with expected value E (ε rc ) = 0, and covariance matrices σ 2 (ε rc ) = σ 2 rc I n , where σ 2 rc is a constant and I n is the n × n identity matrix. Since all element of both S and ∆ f G • are constants, we have that ∆ r G • obs is also normally distributed with and σ 2 (∆ r G • obs ) = σ 2 ( rc ) = σ 2 rc I n . These assumptions about the distributions of errors and observed standard reaction Gibbs energies, apply to any reaction vector x ∈ R n×1 , whether it is a column of S or a new reaction. Observed standard Gibbs energies for x are assumed to be distributed as ∆ r G • obs,x ∼ N x · ∆ f G • , σ 2 rc , and the errors in those observations are assumed to be distributed as ε rc,x ∼ N 0, σ 2 rc . According to the first law of thermodynamics, the true standard Gibbs energies ∆ r G • of reactions in S are given by Comparison with Eq. 11 shows that E (∆ r G • obs ) = ∆ r G • . The same applies to an arbitrary reaction vector x i.e., that E ∆ r G • obs,x = ∆ r G • x . An estimate of E ∆ r G • obs,x is therefore also an estimate of ∆ r G • x .

Estimation of the distribution of ∆ r G • obs
We use the method of least-squares to estimate E (∆ r G • obs ) and σ 2 rc . The least-squares fit of the reactant contribution model (Eq. 11) to ∆ r G • obs gives an estimate of E (∆ r G • obs ): The variance σ 2 rc is estimated as where e rc = ∆ r G • obs − ∆ r G • rc is the residual of the fit in Eq. 14. The reactant contribution estimate of E ∆ r G • obs,x for an arbitrary reaction vector x is s 2 rc from Eq. 15 gives an estimate of the variance in ∆ r G • obs,x , since we assumed that the variance was the same for all x (see Subsection 7.1).

Confidence intervals for
rc,x in Eq. 16 is an estimate of E ∆ r G • obs,x , and thus also of ∆ r G • x ; the true standard Gibbs energy for reaction vector x. However, it is only a point estimate, which is dependent on the particular sample of observations ∆ r G • obs in our training set. If we were to repeat the measurements of standard Gibbs energies for all reactions in S, we would get a different estimate of E ∆ r G • obs,x . If we repeated the same measurements an infinite number of times, we could construct the sampling distribution of ∆ r G • rc,x . The sampling distribution of ∆ r G • rc,x will be normal with mean E ∆ r G • rc,x = E ∆ r G • obs,x = ∆ r G • x , and variance σ 2 rc,x = σ 2 rc · x SS + x (see [12] for proof). We estimate E ∆ r G • rc,x as ∆ r G • rc,x , and σ 2 rc,x as The estimated standard deviation s rc,x = s 2 rc,x is sometimes referred to as the standard error of ∆ r G • rc,x , and we adopt this terminology here. The estimated parameters of the sampling distribution of ∆ r G • rc,x can be used to calculate confidence intervals for ∆ r G • x . At a specified confidence level γ ∈ [0%, 100%], the confidence interval for ∆ r G • x is where t γ,ν is the value of a t-distribution with ν = n − rank (S) degrees of freedom, at a cumulative probability of (100% + γ) /2. Since ν is large for our reactant contribution model, we can approximate t γ,ν as z γ ; the value of the standard normal distribution at a cumulative probability of (100% + γ) /2. The γ confidence interval for ∆ r G • x is thus approximated as

Prediction intervals for
We seek to validate the reactant contribution method against experimental data. The only experimental data available to us is ∆ r G • obs ; which we assume to be a vector of independent observations of standard reaction Gibbs energies. The appropriate way to validate the reactant contribution method is thus to test its ability to predict independent observations of standard reaction Gibbs energies. We assume that independent observations of the standard Gibbs energy for reaction vector x, are normally distributed with mean E ∆ r G • obs,x and variance σ 2 rc . In Subsection 7.2 we estimated E ∆ r G • obs,x as ∆ r G • rc,x , and σ 2 rc as s 2 rc . The estimated standard deviation of ∆ r G • obs,x is s rc = s 2 rc . Based on these estimates, we could mistakenly predict that approximately 68.4% of ∆ r G • obs,x would fall within the interval ∆ r G • rc,x ±s rc , approximately 95% would fall within the interval ∆ r G • rc,x ±1.96×s rc , and so on. In other words, we could assume that the γ prediction interval for ∆ r G • obs,x were ∆ r G • rc,x ± z γ s rc . The reason this is incorrect is that ∆ r G • rc,x is only a point estimate of E ∆ r G • obs,x . Prediction intervals for ∆ r G • obs,x must account for the variance σ 2 rc,x , of the sampling distribution for ∆ r G • rc,x . The correct way to calculate the γ prediction interval for ∆ r G • obs,x is therefore as ∆ r G • rc,x ± z γ s 2 rc + s 2 rc,x .
8 Symbols Table S2. Descriptions of used symbols.

Symbol
Description R the gas constant (8.31 J mol −1 K −1 ) T temperature (in K) K apparent reaction equilibrium constant Q reaction quotient ∆ f G • standard Gibbs energy of formation (in kJ/mol) ∆ r G • standard Gibbs energy of reaction (in kJ/mol) ∆ r G • standard transformed Gibbs energy of reaction (in kJ/mol) S the stoichiometric matrix of measured reactions G the group incidence matrix ∆ r G • obs observed standard Gibbs energy of measured reactions in S ε rc deviation of ∆ r G • obs from the unknown true Gibbs energies ε gc deviation of ∆ r G • obs from the group contribution assumption e rc / e gc residual values for the linear model used in RC/GC ∆ f G • rc RC estimates of standard Gibbs energies of formation for compounds in S ∆ g G • gc standard Gibbs energy contributions of the groups in G ∆ r G • rc / ∆ r G • gc / ∆ r G • cc RC/GC/CC fitted standard Gibbs energies for reactions in S ∆ r G • rc,x /∆ r G • gc,x /∆ r G • cc,x RC/GC/CC estimation for the standard Gibbs energy of reaction x P R(S) orthogonal projection matrix on the range of S P N (S ) orthogonal projection matrix on the null space of S s 2 rc / s 2 gc estimated variance of the error term ε rc / ε gc V rc / V gc the covariance matrix for RC/GC estimates s rc,x / s gc,x / s cc,x the standard error of ∆ r G • rc,x / ∆ r G • gc,x / ∆ r G • cc,x