The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment

doi:10.1371/journal.pcbi.1000852

Table 1.

Dirichlet mixture priors for protein sequence comparison.

More »

Expand

Table 2.

Relative entropies for DNA sequence comparison.

More »

Expand

Table 3.

The recognition of motif boundaries.

More »

Expand

Table 4.

Multiple alignment accuracy.

More »

Expand

Figure 1.

Distributions of bit scores from Api-AP2 domains and negative controls.

The histograms in A and B represent data for both positive and negative cases reported by Program 1 at different intermediate stages of a run. The input file contained 107 amino acid sequences consisting of 54 T. gondii proteins with Api-AP2 domain candidates, and 53 random sequences obtained by shuffling the concatenated sequence of 53 of the 54 Api-AP2 proteins and cutting this shuffled string into the original lengths (method of [119]). The Dirichlet mixture prior was specified. A: Results after the initial Gibbs sampling stage. The ungapped local alignment with optimal aggregate BILD score had width 53. For each sequence, we plot the incremental BILD score, resulting from the addition of a segment from that sequence to the alignment of all the other segments, minus the log of the effective length of that sequence. Scores from the real and random sequences are shown respectively in red and blue. If a prior probability for the existence of a domain in each sequence were specified, segments with scores below a calculated threshold would be rejected. Here, however, the Gibbs sampling step includes one ungapped segment from each of the 107 input sequences in the initial pattern it constructs. B: Results after the iterative gapped alignment stage. In each gapped alignment iteration of Program 1, the evolving length-53 pattern is aligned to each input sequence, perhaps multiple times, using a greedy application of the Erickson-Sellers algorithm. Incremental BILD scores are calculated from the current multiple alignment, excluding the sequence to which it is being realigned. Deletions of length are assigned a score of −8.5− bits, and insertions of length a score of −9.25–0.25 bits. The cost for the existence of a pattern is based on assuming a mean of one instance per sequence, but with uniform probability at all positions of all sequences. In addition, the score for each aligned letter is adjusted slightly to reflect a small cost for not having a gap. At each iteration, the program reports segments with score −25 bits, but only segments with positive score are included in the next iteration. We show the data reported for the highest-scoring alignment; at this stage, at least one positively scoring segment derives from each of the 54 real sequences but only 2 segments (each with score less than bits) derive from the 53 random sequences. 88 positive-scoring instances of the pattern are found, at least one from each of the real sequences, but none from the random sequences. In addition, 19 instances of the pattern with negative score are found, 2 of which derive from the random sequences. For an aligned segment, a log-odds bit score of 0 indicates an equal probability of being generated by the model implied by the other sequences, or at random by background amino acid frequencies. In B, the bars are colored according to the presence (cyan) or absence (brown) of strong sequence matches to the 3 beta-strands and the alpha-helix of the core Api-AP2 structure; the positions of these elements are shown in Figure 3. To qualify for a cyan bar, a sequence was required to contain either identities or high-structural-propensity substitutions that match the strongly conserved amino acids (with column BILD score 1.5 bits per residue) in the helix and at least 2 of the 3 beta-strands. The fairly clean separation, near 0 bits, of the cyan bars from the others indicates that a positive score is a good criterion for nominating a segment as an Api-AP2 candidate.

More »

Expand

Figure 2.

Near-identical Api-AP2 profiles from two parasites with very different background frequencies.

For P. falciparum (A, B) and T. gondii (C, D), the logos [120] (http://weblogo.berkeley.edu/) represent the letters aligned in the columns of the core Api-AP2 patterns (A, C). In the letter clouds (http://www.wordle.net/advanced) (B, D), the area occupied by each letter indicates the background frequency of an amino acid in the input sequence set (compare Fig. 2.1 of [49]). Colors represent various amino acid classes. For both organisms, Programs 1 or 2, run with Dirichlet mixture priors , or , converged on essentially the same 53- to 54-column core models that correspond to these logos. Api-AP2 models and logos almost identical to these were also obtained from other apicomplexan parasites Cryptosporidium hominis, Babesia bovis, Theilleria parva, and from the basal alveolate Perkinsus marinus, whereas the distantly related plant AP2 domains and HNH homing endonuclease/integrase domains gave distinct characteristic patterns similar in parts to Api-AP2 (data not shown). Thus, the core structural features of the Api-AP2 domain have been strongly conserved in long-diverged members of the Alveolata, following an ancestral gene expansion, whereas the background amino acid content of these organisms is strikingly different due to genome-wide drift.

More »

Expand

Figure 3.

Large insertions in the central loop region of Api-AP2 domains.

As a consequence of asymmetric gap costs, Programs 1 and 2 reported several positive Api-AP2 candidates which have long insertions but, in the other parts of the domain, show high-scoring matches to the canonical pattern. Here, the sequence of T. gondii protein TGME49_06420, which has a 45 amino acid insertion in the central loop region, is shown aligned with the two most-closely-matching domains of typical length. Program 2, run with Dirichlet mixture prior and default parameters, assigned the insertion to the central loop location shown, which avoided the more conserved columns of the secondary structural elements indicated above the sequences. In contrast, Program 1 placed the same inserted residues in three separate locations, two of which would disrupt secondary structure. Moreover, with an established HMM search method [80] (http://hmmer.janelia.org/), only the right end alignment of this TGME49_06420 domain was found, but with a negative score well below the rejection threshold. Structural assignments E (beta-strand) and H (alpha-helix) are based on homologous experimental structures [121], [122] (PDB codes 2gcc,3gcc,3igm).

More »

Expand