Skip to main content
Advertisement

< Back to Article

Fig 1.

MODELLER’s statistical approach to homology modeling: The unknown distance d between two atoms in residues i and j of the query protein (Q) is described by a probability distribution Prob(d) that is peaked around the distance dt between the corresponding atoms in residues i′ and j′ of the template protein (T).

This distribution Prob(d) is a probabilistic distance restraint for the distance d. To model a protein, tens to hundreds of thousands of such distance restraints between pairs of atoms in the query protein are derived. The product of all these restraint functions, which is called the likelihood function in statistics, quantifies how well a model structure satisfies all restraints at the same time. Therefore, the model structure that maximises the likelihood function represents the best solution.

More »

Fig 1 Expand

Fig 2.

Empirical log distance distributions between pairs of atoms are well modelled by a two-component Gaussian mixture composed of a signal component and a background component.

The background component originates from pairs of residues with an alignment error. The plots show the empirical distribution of log d − log dt = log dij − log dij for thousands of sampled pairs of residues (i, i′), (j, j′) from real, error-containing pairwise sequence alignments generated with HHalign [15]. The two-component Gaussian mixture distribution predicted by the mixture density network in Fig 3B is plotted in red. From (A) to (C), the reliability of the alignments at (i, i′) and (j, j′) (as measured by pp and sim values) decreases. Consequently, the weight of the background component increases at the expense of the signal component. (D) Same as (C) but showing the distribution of N − O distances instead of Cα − Cα distances.

More »

Fig 2 Expand

Fig 3.

(A) Illustration of the two-component Gaussians mixture distribution in Eq (1). (B) Mixture density network to predict the parameters (w, μ, σ, μbg, σbg) of the Gaussian mixture distribution given the three variables θ = (log dt, pp, sim) (dt: distance in template, pp: posterior probability for both aligned residue pairs to be correctly aligned, sim: sequence similarity). Since the background component does not depend on dt, the nodes for μbg and σbg are only connected to the two lowest hidden nodes that are not connected to log dt.

More »

Fig 3 Expand

Fig 4.

Comparison of how restraints from multiple templates are combined in Modeller (top row) and in our new approach (bottom row).

(A) In Modeller, two restraints functions (green and blue) are additively mixed with mixing weights that have to be learned on a set of triples of aligned protein structures. (B) Our new restraints are multiplied instead of being added. The background component ensures that the restraint function becomes constant and the restraint thus becomes inactive (i.e. ignored) when the distance d is far from the distance in the template. (C) Modeller’s additive mixing leads to a total restraint function that is wider than any of the single-template restraints, not narrower as it should. (D) The multiplication of restraints functions according to probability theory leads to the desired behaviour of the total restraint function becoming more pointed with each restraint. Note that our new restraints are expressed as odds instead of densities (see also Eq 6).

More »

Fig 4 Expand

Fig 5.

Iterative scheme for computing weights for templates by transforming the phylogenetic tree connecting them and the query protein into an equivalent tree with star-like topology with the query in the center.

(A) Templates t1 and t2 are closely related and should be down-weighted with respect to t3. (B) Any tree with a structure at an internal node with unknown distance dh to which all templates are connected in a star-like topology (top) can be transformed into an equivalent tree (bottom) with star-like topology, where equivalence means that the restraint on the distance d0 of the top node is the same for both trees. τ1, … τK indicate evolutionary distances. (C) Iterative restructuring of a phylogenetic tree. In each step, the basic transformation from Fig 5B is applied to the subtree colored in blue. Weights and edge lengths get updated until all templates are directly connected to the query.

More »

Fig 5 Expand

Fig 6.

Selection of multiple templates.

is the set of accepted templates, is the set of template candidates. For each template in , its score is calculated according to Eq (14) and the template with the highest score (t4) is added to . This process is iterated until there is no more template with a positive score, or contains more than 8 templates.

More »

Fig 6 Expand

Table 1.

Average model quality scores for different variations of template selection strategies and restraints used with Modeller on a test set of 1000 single- and multi-domain proteins in the pdb20 database.

The GDC-all score is similar to GDT-ha but also includes side-chain atoms in its assessment. Percent improvements are with respect to the first line. P-values are calculated based on a paired t-test with respect to the GDT-ha score in the previous line.

More »

Table 1 Expand

Fig 7.

(A) Our two-component mixture restraints improve GDT-ha model quality over Modeller’s default restraints in multi-template modelling by 2.5% on average. (B) Our multi-template selection strategy improves GDT-ha scores over the simple multi-template selection strategy by 3.9% on average. (C) Multi-template modeling improves GDT-ha scores over single-template modelling (using Modeller restraints) by 4.3% on average. (D) Overall improvements through new restraints, template weights, and the new multiple template selection over the baseline, single-template version (s.1st.old in Table 1) is 11.1%.

More »

Fig 7 Expand

Table 2.

Multi-template homology modeling and the new restraints improve models within core regions independent of increased query sequence coverage.

Mean GDT-has on query protein core regions, defined as the residues that are covered by the first template. Percent improvement with respect to the previous line.

More »

Table 2 Expand

Table 3.

The probabilistic multi-template modeling approach is less negatively affected by bad templates.

Mean GDT-ha scores of 1000 models built with templates sets containing 0, 1 and 2 bad templates (TMscore < 0.3) along with two good templates (TMscore > 0.5).

More »

Table 3 Expand

Fig 8.

Cumulative Z-score of all server predictions in the template-based modeling category of the CASP9 and CASP10 community-wide assessment of techniques for protein structure prediction [1, 3].

HHpred servers are red, other servers using our HHsuite software are shown in green.

More »

Fig 8 Expand