Fig 1.
Overview of ensemble comparison methods implemented in ENCORE.
ENCORE implements three different methods for ensemble comparison. (A) As input ENCORE takes two or more conformational ensembles in a number of formats including PDB files and various trajectory formats. (B) In the harmonic ensemble similarity method (HES) each ensemble is represented as a high-dimensional Gaussian distribution, N(μ,∑) whose mean (μ) and covariance matrix (∑) ENCORE estimates from the conformational ensembles. A similarity score is then calculated as a symmetrized Kullback-Leibler divergence () between each pair of probability distributions. (C) In the calculation of the two other similarity measures, the first step is to calculate the pairwise RMSD between all structures in all ensembles. In ENCORE, this step can be parallelized over multiple computing cores and the results can be stored on disk for additional later analyses. (D) In the clustering-based ensemble similarity (CES) method, the matrix of pairwise RMSDs is used as input to the Affinity Propagation algorithm to cluster the structures from all ensembles together. The cluster populations from each ensemble are then used as basis for calculating the Jensen-Shannon divergence (DJS) as a measure of ensemble similarity. (E) In the dimensionality-reduction-based ensemble similarity (DRES) method, the matrix of pairwise RMSDs is used as input to the Stochastic Proximity Embedding algorithm to project the high-dimensional conformational ensemble into a low-dimensional space. Using a kernel density estimation method in this lower dimensional space, ENCORE creates probability distributions for each ensemble, which are used as basis for calculating the Jensen-Shannon divergence between the ensembles. See ref. [6] for additional details on the three ensemble similarity methods.
Fig 2.
Comparing molecular simulations using ENCORE.
We used ENCORE to calculate the similarity between seven molecular dynamics simulations of (A, B) protein G and (C) ubiquitin. (A) The plots show the pairwise similarity between the seven ensembles computed using the three different ensemble comparison methods. (B) Using the similarities calculated by the CES method as an example, a tree-preserving embedding method was used to represent the ensembles in two-dimensions. In this plot, the distance between pairs of ensembles mimics (to the extent possible in two dimensions) the similarity between different ensembles. In agreement with the pairwise similarities, three pairs of ensembles (CHARMM22*/CHARMM27, ff99SB-ILDN/ff99SB*-ILDN, and ff03/ff03*) are located relatively close to one another, in line with the similar origins of each pair of force fields. (C) We performed similar calculations on seven ubiquitin simulations, again using the CES method as an example and projecting the similarities into two-dimensions. A similar organization is found for the different ensembles for both proteins, as is also evident from directly comparing the matrices of ensemble similarities. Note that in the projections, the axes have no direct physical meaning beyond their scale, which are determined so that the distance in the projections are close to the calculated DJS. Note also that since these distances are invariant to rotations, translations and inversions of the projections, it is the relative positions in the two plots that should be compared.
Table 1.
Calculating uncertainties of similarity scores using a bootstrap procedure.
The uncertainties were calculated for representative entries from Fig 2 as standard deviations over 100 bootstrapped subsamples of the ensembles.
Fig 3.
Assessing the rate of convergence in molecular simulations.
Using the CES score we used ENCORE to assess the rate of convergence in seven molecular dynamics simulations of (A) protein G and (B) ubiquitin. In each case, we compared simulations of increasing length to the full ensemble obtained after 10μs of simulation. Per definition, the similarities thus decrease to zero at 10μs, but the rate at which low-values are obtained indicates how quickly the simulations have reached a distribution of conformations that is similar to the full ensemble. For example, simulations of both proteins with Amber ff99SB-ILDN and ff99SB*-ILDN quickly drop to very low values, reflecting the fact that the ensembles obtained after a few microseconds are very similar to those obtained at the end of the simulation. In contrast, for example, simulations with OPLS continue to explore new regions of conformational space during the entire simulations.
Fig 4.
Comparing ubiquitin ensembles from simulations and experiments.
We used ENCORE to compare 13 previously determined conformational ensembles of human ubiquitin: seven ensembles were obtained by molecular dynamics simulations with different force fields, five ensembles were generated via replica-averaged simulations that used experimental NMR data as restraints in molecular simulations and a single ensemble was obtained as a collection of 46 different crystal structures of ubiquitin. (A) Using CES we calculated the pairwise similarity of all 13 ensembles and (B) projected the results into two dimensions. Note how the molecular simulations (yellow labels) result in a broader range of conformational ensembles whereas the ensembles restrained via different kinds of experimental NMR data (blue labels) are all more similar to one another. This observation is evidence of the fact that experimental restraints, when used in replica-averaged simulations, can be thought of as system specific correction to the energy function used, which guides the simulations towards the correct conformational ensemble. Finally, note how the NMR-restrained simulations are also relatively similar to a collection of ubiquitin X-ray structures. This observation reiterates the notion that ubiquitin in solution samples a conformational ensemble that is similar to the variability observed in different ubiquitin structures, and also that such ensembles can be derived relatively robustly by combining NMR data and molecular simulations. Importantly, the five NMR ensembles were obtained using different procedures, force fields and sources and types of experimental data.
Fig 5.
Computational scaling of ENCORE calculations.
We determined how the runtime of ENCORE scales with (A) the number of computer cores when executing ENCORE in a parallel fashion and (B) when varying the number of frames for a fixed number of computer cores. In (A) the black line illustrates how the pairwise RMSD calculation can be sped up by distributing the calculations over multiple cores. As the clustering and dimensionality reduction methods that have so far been implemented have not been parallelized they appear as horizontal lines that will eventually (above 8–16 cores) limit the overall calculations. In (B) we show how the overall runtime increases for both the RMSD calculations, clustering and dimensionality reduction as the number of frames are also increased. We used 16 processor cores for the calculations.
Fig 6.
Effect of sparsifying the simulation data.
We evaluated the robustness of the calculated similarity scores when decreasing the ensemble size. In particular, we took 8192 (213) frames separated by 1ns from a simulation of GB3 using Amber ff03 as a reference and created subensembles of various sizes by iteratively removing every second frame. We subsequently calculated the three different similarity scores between the full ensemble and the various subensembles that contained between 128 and 4096 frames. The results show that even when only every 16th frame is retained the pairwise similarity is very high (divergence close to zero), demonstrating both the robustness of the calculations and that such sparsification likely is an efficient way of improving computational efficiency in practice.