Remote homology search with hidden Potts models

doi:10.1371/journal.pcbi.1008085

Fig 1.

Hidden Potts model architecture.

Squares are conserved match states and diamonds are insert states. No delete states exist. Silent begin and end states are represented by circles. An HPM is a hybrid between a Potts model and a pHMM: correlated character generation (including deletion “characters” rather than delete states) in match columns (consensus sites in an MSA) comes from a Potts distribution (dotted orange arrows), while transition probabilities linking states and site-independent character emissions in unaligned insert columns come from a pHMM (solid and dashed black arrows, respectively).

More »

Expand

Fig 2.

Importance sampling alignment algorithm schematic.

Toy example of aligning an RNA sequence CUCUGGAAG to models of its sequence/structure consensus, where the true structural alignment has two nested base pairs (brackets in structure line) and one pseudoknot (‘Aa’). Suboptimal alignments of the sequence are sampled probabilistically using a pHMM. A pHMM does not capture residue correlations due to base-pairing, so only some proposed alignments satisfy the expected consensus nested (cyan) or pseudoknotted (pink) base pairs. The proposed alignments are re-scored and re-ranked under the HPM, which does capture correlation structure; the correct alignment with the highest probability under the HPM is identified (green), and the sequence’s total probability is obtained by importance-weighted summation over the sampled alignments.

More »

Expand

Fig 3.

Additional information contained in pairwise covariation in RNA MSAs.

For each of the 127 Rfam 14.1 seed MSAs with more than 100 sequences [36], we infer a predicted consensus structure (including nested and non-nested base pairs) using CaCoFold and R-scape [41, 42]. (A) Primary information content (X axis) versus the sum of primary and secondary information content (Y axis). (B) Sum of primary and secondary information content (X axis) versus the information content from all three levels of structure (Y axis) (C) Secondary information content (X axis) versus the sum of information content from secondary and tertiary structure (Y axis). Not included in (C) is ROOL (RF03087), with 305.3 bits of primary information, 174.5 bits of secondary information, and 10.7 bits of tertiary information.

More »

Expand

Table 1.

Benchmark dataset statistics and training alignment information content.

More »

Expand

Table 2.

Remote homology alignment benchmark results.

More »

Expand

Fig 4.

Remote homology scoring benchmark results.

Receiver operating characteristic (ROC) plots for the tRNA (A) and Twister type P1 ribozyme (B) benchmarks. The X axis is the fraction of decoys that score higher than a certain threshold (false positive rate), and the Y axis is the fraction of homologous test sequences that score higher than the same threshold (true positive rate, or sensitivity). In the SAM riboswitch benchmark, each model perfectly discriminates homologs from decoys.

More »

Expand

Fig 5.

Hidden Potts model emission probabilities do not match training alignment statistics.

(A) Rfam consensus structure for the class I SAM riboswitch. Base pairs supported by statistically significant covariation in analysis by R-scape in the RF00162 seed alignment are shaded green [41]. (B) Observed pairwise nucleotide frequencies for one base pair in the P3 stem (sites 52 and 62) in our RF00162 benchmark training alignment (192 sequences). (C) Pairwise nucleotide probabilities at sites 52 and 62 under the RF00162 training HPM. (D) Pairwise nucleotide probabilities at sites 52 and 62 under the RF00162 training Infernal pSCFG. Infernal’s informative priors lead to a UG base pair being given significant probability despite being rarely seen in the training data. However, as U is much more common than G at site 52 and vice versa at site 62, Infernal assigns a higher probably to UG than GU at these sites.

More »

Expand

Fig 6.

Gremlin-trained Potts models accurately predict 3D RNA contacts.

Top predicted 3D contacts from Gremlin-trained Potts models (cyan) against annotated nested base pairs (gray) and annotated tertiary contacts/pseudoknotted positions (pink) from the tRNA (A), Twister type P1 ribozyme (B), and SAM riboswitch (C) training alignments. For tRNA, annotated tertiary contacts are from a yeast tRNA-phe crystal structure [48]. A non-structural covariation between the tRNA discriminator base and the middle nucleotide in the anticodon (caused by aminoacyl-tRNA synthetase recognition preferences, not conserved base-pairing) is noted in orange [49].

More »

Expand

Fig 7.

Pseudolikelihood maximization does not fully recapitulate the pairwise correlation structure of input alignments.

(A) Average value of the Potts Hamiltonian at intermediate iterations of the Markov chain Monte Carlo sequence emission algorithm across 1000 synthetic sequences. Error bars represent standard deviation. We accept sequences after 5000 iterations. (B-C) Comparison of single-site (A) and pairwise (B) Potts model terms of a ground truth HPM (“training HPM”) versus in an HPM trained on aligned sequences emitted from the ground truth HPM (“synthetic HPM”). (D-F) Comparisons of marginal single-site probabilities (D), marginal pairwise probabilities (E), and mutual information (F) for the posterior sequence distributions sampled from the training HPM (statistics estimated from MSA2) against the synthetic HPM (statistics estimated from MSA3).

More »

Expand