Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties

doi:10.1371/journal.pcbi.1004936

Fig 1.

GISMO block-based and hidden Markov models.

A. Schematic of a hypothetical GISMO phase 1 block based alignment, which is initialized to consist of many short, ungapped aligned blocks. B. Architecture for the GISMO phase 2 HMM. Red transition arrows between states emit residues. Transition probabilities are inferred from the sequence data. Note that the HMM is local with respect to each sequence but global with respect to the model.

More »

Expand

Fig 2.

Comparison of GISMO to five other MSA programs.

As described in the text, for each analysis the CDD test sets were first ordered based on the property specified on the x-axis and then split into four equal-sized partitions. The x-coordinates for all data points are averages, for the property in question, over the test sets assigned to the various partitions; similarly, the GISMO ΔSP-scores for each program are averages taken over these partitions. A. GISMO ΔSP-scores as a function of the number of sequences. For comparison, an additional, leftmost set of data points (shown with back-glow) corresponds to 162 out of 218 Balibase 3 test sets; for the remaining 56 Balibase sets, GISMO failed to find a statistically significant alignment presumably due to sparse data: some of these sets have as few as 4 sequences. B.GISMO ΔSP-scores as a function of the number of truncated sequences, as defined in the text. C. GISMO ΔSP-scores as a function of the ratio between the domain length and mean sequence length. For sequence sets with low ratios, the shared domain is more challenging to align due to a larger search space. D. GISMO ΔSP-scores as a function of average relative entropy (with respect to a standard background amino acid distribution and expressed in nats, with 1 nat = 1/ln(2) bits) over all column positions in each benchmark MSA; sequence diversity can be understood as inversely related to relative entropy. For sequence sets with low relative entropy, the shared domain is more difficult to align due to weaker conservation.

More »

Expand

Fig 3.

Variability in SP-scores among six GISMO runs and among the six programs GISMO, MAFFT, CLUSTAL-Ω, MUSCLE, Dialign and Kalign.

SP-scores are based upon the CDD MSAs as benchmarks and vary from 0 (no correctly aligned sequence pairs) to 1 (all pairs aligned correctly). A. The sorted SP-scores for a single GISMO run (red line with yellow back-glow) compared with the sorted scores for the five other programs. B. Run-to-run variability in SP-scores over six GISMO runs. Test set data points are sorted along the x-axis by the SP-score obtained for each set on the first run (red data points) of six. C. SP-scores for the six programs analyzed, sorted by the GISMO score on each test set. GISMO SP-scores (for a single run) are shown in red. Each red data point and the five black data points (one point for each program) plotted in the same column correspond to the same test set. D. SP-scores for the six programs, sorted by the CLUSTAL-Ω score on each test set. Data points for GISMO and for CLUSTAL-Ω are shown in red and green, respectively.

More »

Expand

Fig 4.

Log-log plots of program runtimes as a function of the total input length.

Each data point corresponds to one MSA generated by the program indicated. Estimated time complexities based on trendline slopes were for: GISMO, t ∝ N^0.96; Clustal-Ω, t ∝ N^1.6; Kalign, t ∝ N^1.6; MUSCLE, t ∝ N^2.1 and Dialign t ∝ N^2.2, where N is the total number of residues in the aligned sequences. A trendline is not shown for MAFFT because (with the–auto option) it uses one of several different algorithms depending on the input sequence set; this produces a discontinuity in the data points.

More »

Expand

Fig 5.

GISMO acetylase domain alignment.

Representative proteins of known structure are shown—no two of which share more than 27% sequence identity over the domain footprint. The full alignment consists of 2,125 sequences.

More »

Expand