Automatic Prediction of Protein 3D Structures by Probabilistic Multi-template Homology Modeling

doi:10.1371/journal.pcbi.1004343

Fig 1.

MODELLER’s statistical approach to homology modeling: The unknown distance d between two atoms in residues i and j of the query protein (Q) is described by a probability distribution Prob(d) that is peaked around the distance d_t between the corresponding atoms in residues i′ and j′ of the template protein (T).

This distribution Prob(d) is a probabilistic distance restraint for the distance d. To model a protein, tens to hundreds of thousands of such distance restraints between pairs of atoms in the query protein are derived. The product of all these restraint functions, which is called the likelihood function in statistics, quantifies how well a model structure satisfies all restraints at the same time. Therefore, the model structure that maximises the likelihood function represents the best solution.

More »

Expand

Fig 2.

Empirical log distance distributions between pairs of atoms are well modelled by a two-component Gaussian mixture composed of a signal component and a background component.

The background component originates from pairs of residues with an alignment error. The plots show the empirical distribution of log d − log d_t = log d_ij − log d_i′j′ for thousands of sampled pairs of residues (i, i′), (j, j′) from real, error-containing pairwise sequence alignments generated with HHalign [15]. The two-component Gaussian mixture distribution predicted by the mixture density network in Fig 3B is plotted in red. From (A) to (C), the reliability of the alignments at (i, i′) and (j, j′) (as measured by pp and sim values) decreases. Consequently, the weight of the background component increases at the expense of the signal component. (D) Same as (C) but showing the distribution of N − O distances instead of C_α − C_α distances.

More »

Expand

Fig 3.

(A) Illustration of the two-component Gaussians mixture distribution in Eq (1). (B) Mixture density network to predict the parameters (w, μ, σ, μ_bg, σ_bg) of the Gaussian mixture distribution given the three variables θ = (log d_t, pp, sim) (d_t: distance in template, pp: posterior probability for both aligned residue pairs to be correctly aligned, sim: sequence similarity). Since the background component does not depend on d_t, the nodes for μ_bg and σ_bg are only connected to the two lowest hidden nodes that are not connected to log d_t.

More »

Expand

Fig 4.

Comparison of how restraints from multiple templates are combined in Modeller (top row) and in our new approach (bottom row).

(A) In Modeller, two restraints functions (green and blue) are additively mixed with mixing weights that have to be learned on a set of triples of aligned protein structures. (B) Our new restraints are multiplied instead of being added. The background component ensures that the restraint function becomes constant and the restraint thus becomes inactive (i.e. ignored) when the distance d is far from the distance in the template. (C) Modeller’s additive mixing leads to a total restraint function that is wider than any of the single-template restraints, not narrower as it should. (D) The multiplication of restraints functions according to probability theory leads to the desired behaviour of the total restraint function becoming more pointed with each restraint. Note that our new restraints are expressed as odds instead of densities (see also Eq 6).

More »

Expand

Fig 5.

Iterative scheme for computing weights for templates by transforming the phylogenetic tree connecting them and the query protein into an equivalent tree with star-like topology with the query in the center.

(A) Templates t₁ and t₂ are closely related and should be down-weighted with respect to t₃. (B) Any tree with a structure at an internal node with unknown distance d_h to which all templates are connected in a star-like topology (top) can be transformed into an equivalent tree (bottom) with star-like topology, where equivalence means that the restraint on the distance d₀ of the top node is the same for both trees. τ₁, … τ_K indicate evolutionary distances. (C) Iterative restructuring of a phylogenetic tree. In each step, the basic transformation from Fig 5B is applied to the subtree colored in blue. Weights and edge lengths get updated until all templates are directly connected to the query.

More »

Expand

Fig 6.

Selection of multiple templates.

is the set of accepted templates, is the set of template candidates. For each template in , its score is calculated according to Eq (14) and the template with the highest score (t₄) is added to . This process is iterated until there is no more template with a positive score, or contains more than 8 templates.

More »

Expand

Table 1.

Average model quality scores for different variations of template selection strategies and restraints used with Modeller on a test set of 1000 single- and multi-domain proteins in the pdb20 database.

The GDC-all score is similar to GDT-ha but also includes side-chain atoms in its assessment. Percent improvements are with respect to the first line. P-values are calculated based on a paired t-test with respect to the GDT-ha score in the previous line.

More »

Expand

Fig 7.

(A) Our two-component mixture restraints improve GDT-ha model quality over Modeller’s default restraints in multi-template modelling by 2.5% on average. (B) Our multi-template selection strategy improves GDT-ha scores over the simple multi-template selection strategy by 3.9% on average. (C) Multi-template modeling improves GDT-ha scores over single-template modelling (using Modeller restraints) by 4.3% on average. (D) Overall improvements through new restraints, template weights, and the new multiple template selection over the baseline, single-template version (s.1st.old in Table 1) is 11.1%.

More »

Expand

Table 2.

Multi-template homology modeling and the new restraints improve models within core regions independent of increased query sequence coverage.

Mean GDT-has on query protein core regions, defined as the residues that are covered by the first template. Percent improvement with respect to the previous line.

More »

Expand

Table 3.

The probabilistic multi-template modeling approach is less negatively affected by bad templates.

Mean GDT-ha scores of 1000 models built with templates sets containing 0, 1 and 2 bad templates (TMscore < 0.3) along with two good templates (TMscore > 0.5).

More »

Expand

Fig 8.

Cumulative Z-score of all server predictions in the template-based modeling category of the CASP9 and CASP10 community-wide assessment of techniques for protein structure prediction [1, 3].

HHpred servers are red, other servers using our HHsuite software are shown in green.

More »

Expand