Bayesian Weighting of Statistical Potentials in NMR Structure Calculation

The use of statistical potentials in NMR structure calculation improves the accuracy of the final structure but also raises issues of double counting and possible bias. Because statistical potentials are averaged over a large set of structures, they may not reflect the preferences of a particular structure or data set. We propose a Bayesian method to incorporate a knowledge-based backbone dihedral angle potential into an NMR structure calculation. To avoid bias exerted through the backbone potential, we adjust its weight by inferring it from the experimental data. We demonstrate that an optimally weighted potential leads to an improvement in the accuracy and quality of the final structure, especially with sparse and noisy data. Our findings suggest that no universally optimal weight exists, and that the weight should be determined based on the experimental data. Other knowledge-based potentials can be incorporated using the same approach.


Backbone dihedral angle distributions
Empirical energy functions for backbone dihedral angles, "Ramachandran potentials", have been used in biomolecular structure calculation since almost two decades. Several statistical models have been proposed ranging from two-dimensional histograms (Gong et al., 2007) to continuous representations based on linear interpolation, cubic splines and Gaussian mixtures. These models ignore two fundamental aspects of statistical distributions over angular variables, namely their periodicity and their smoothness, which can result in artifacts in the refinement (Kuszewski and Clore, 2000). An alternative is provided by statistical density estimation (Mardia et al., 2007;Boomsma et al., 2008;Ting et al., 2010).
We follow a non-parametric approach based on the principle of Maximum Entropy (Jaynes, 1957). As proposed by Pertsemlidis et al. (2005), we use a Fourier basis to represent the joint distribution of ϕ and ψ backbone dihedral angles. This representation is inherently smooth and periodic and has the advantage that it can easily cope with multi-modality (as opposed to the unimodal von Mises or Kent distribution which need to be combined into mixtures (Mardia et al., 2007;Boomsma et al., 2008;Ting et al., 2010) in order to represent the multi-modal Ramachandran plot). The functional form of the distribution of backbone dihedral angles is given by: where the Ramachandran potential E(ϕ, ψ) is given by: {a ij cos(iϕ) cos(jψ) + b ij cos(iϕ) sin(jψ) + c ij sin(iϕ) cos(jψ) + d ij sin(iϕ) sin(jψ)} (2) and Z (a, b, c, d) normalizes the dihedral angle distribution; k is the order of the Fourier expansion. We fit the expansion coefficients a, b, c, d to pairs of ϕ and ψ angles, extracted from a set of 850 proteins (Dunbrack, 2002) by Maximum Likelihood. During the optimization procedure, we need to evaluate the normalization constant Z (a, b, c, d). Because there is no closed form expression for this integral we evaluate it numerically using the two-dimensional trapezoidal rule. To circumvent overfitting of data and to encourage sparse parameter sets, we introduce a Gaussian prior probability with unknown precision λ over the expansion coefficients a, b, c, d: The precision of the prior λ is not known and is estimated simultaneously with the expansion coefficients. We use an iterative scheme in which we cycle through conditional updates of the expansion coefficients and of the precision. For fixed precision, the log-posterior probability of the expansion coefficients is a convex function, which we optimize using the Powell minimizer (Press et al., 1989). The update of the precision for given a, b, c, d can be calculated analytically. Bayesian information criterion Figure S1: BIC analysis to determine the optimal order of the Fourier expansion. The BIC curves of the backbone dihedral angle distributions of all twenty amino acids (indicated as one letter codes in the legend) is shown with increasing Fourier order.
To avoid over-fitting of dihedral angle distributions, we truncate the Fourier series at small orders k. To determine the optimal order of the Fourier expansion we apply the Bayesian information criterion (BIC) (Schwarz, 1978). BIC is a general quantity to select the optimal number of parameters m given data of size n and is defined as where L max is the maximum of the likelihood of the data set for m parameters (here m = 4k(k − 1)). The model with the lowest BIC is considered the most probable one and describes the data best. We fitted distributions with Fourier order ranging from 1 to 10 and computed the resulting BIC ( Figure  S1). We observe that for all amino acids, BIC plateaus at an order of four to five, in accordance with Pertsemlidis et al. (2005). Based on the BIC analysis we fix the maximum order of the Fourier series to five, which amounts to 80 expansion coefficients and results in a good fit that can be evaluated very efficiently and adds negligible computational cost. Figure S2 shows the estimated dihedral angle distributions of all amino acids. Each distribution is a linear combination of 80 two-dimensional planar waves with different combinations of five frequencies. The resulting estimates capture features such as the α-helix peak at ϕ = −60 o , ψ = −40 o and regions corresponding to parallel and anti-parallel β-sheets. Also rare secondary structure elements including left-handed helices are well represented by the Maximum Entropy distribution. The backbone dihedral angle distributions of Glycine and Proline deviate from the standard Ramachandran plot and are also fitted well by the Fourier model.
For comparison, we also generated backbone conformations from the physical force field alone and applied the same fitting procedure as for the data collected from the database. The resulting distributions are shown in Figure S3. Although the distributions have similar outlines they differ greatly in detail. Generally, the ϕ/ψ distributions resulting from the force field are broader and populate dihedral angles corresponding to the canonical secondary structure different from what is observed in high-resolution crystal structures. These differences are the effect of interactions such as hydrogen bonds that are not included in the physical potential function. 2 Dependence beetween quality scores, potential functions The addition of a new potential function in structure calculation can affect all aspects of the generated structures. We assess the impact of our new backbone potential on various quality criteria in Figure  S5. Figure S5 shows the mean and standard deviation of the quality scores calculated from an ensemble of 100 structures dependent on w rama . None of the these scores is able to reliably select a w rama that corresponds to an accurate ensemble. Taking the variation within an ensemble into account, the NQACHK score for the optimal weight structure of alpha-spectrin SH3 is well within the range of normal fluctuation and does not argue against our algorithm.
In Figure S6 we show the correlation of these quality scores for the Ubiquitin ensemble. Some of the scores seem to be highly correlated (e.g. RAMCHK and BBCCHK) and it is not clear whether it is possible to maximize all scores simultaneously. Rather we have to find a comprise between the different quality criteria, and this is exactly what our weighting scheme achieves. Furthermore, none of the scores is well correlated with the overall RMSD, supporting the notion that quality scores are not a good indicator of accuracy. 3 Impact on structure ensembles from sparse and noisy NMR data Restraint Error σ Figure S8: Impact on the estimation of model parameters. The red dots and error bars indicate the estimated error σ of the distance restraints for the sparse distance data (left) and the noisy restraints measured with solid-state NMR (right). The model evidence is shown as grey distribution.