Figure 1.
ProtPal's estimates of insertion and deletion rates are the most accurate of any program tested, as measured by the RMSE of values aggregated over all substitution/indel rate categories.
Quantiles containing 90% of the data are shown as a bolded portion of the -axis, and RMSE is shown to the right of each distribution, the latter computed as described in 1 Equation 1. No aligner approaches the accuracy of the rates estimated with the true alignment, though ProtPal, PRANK, and ProbCons are the top three, with ProtPal as the most accurate over all. Many aligners, particularly MUSCLE, CLUSTALW, and MAFFT, significantly underestimate insertion rates and overestimate deletion rates. ProtPal and PRANK perform their own ancestral reconstruction and other alignment programs were augmented with a most-recent-common-ancestor (MRCA) parsimony as described in [55].
Figure 2.
Rate estimation accuracy is highly dependent on the simulated indel rate.
For instance, PRANK is more accurate at lower indel rates, ProbCons is more accurate at higher rates. ProtPal is more accurate than PRANK in all but one rate (0.005) and equal or more accurate than ProbCons in all but one rate (0.08). The drift towards exhibited by most programs indicates that most programs infer proportionally fewer indels as rates are increased, likely due to various forms of gap attraction. Color-coded 90% quantiles and RMSEs are shown underneath and to the right of each group of distributions, respectively. RMSE is computed as described in 1 Equation 1.
Figure 3.
Gap attraction, the canceling of nearby complementary indels, can affect insertion and deletion rates in various ways depending on the phylogenetic relationship of the sequences involved.
All programs are, to some extent, sensitive to situations A and B whereas phylogenetic aligners can avoid situation C. An insertion at a leaf requires gaps at all other leaves - an understandably costly alignment move when gaps are added without regard to the phylogeny, resulting in multiple penalization for each insertion. Such a penalization would cause most non-phylogenetic aligners to prefer the “Inferred alignment” in case C where there are fewer total gaps. Aligners treating indels as phylogenetic events would penalize each of the implied multiple deletions and only penalize each insertion once, thus preferring the “True alignment” in case C.
Figure 4.
Insertion and deletion rates in Amniota show similar distributions, with 95% of genes having rates less than approximately 0.1 indels per synonymous substitution.
Insertion and deletion rates were estimated using reconstructions done with ProtPal from a set of approximately 7,500 protein-coding genes from the OPTIC amniote database [30]. Indel rates were normalized by the synonymous substitution rate of each gene as computed with PAML [53] so that the plotted rate represents the number of expected indels per synonymous substitution. Since these rates are conditioned on the MAP reconstructed history, there are many alignments whose inferred indel rates are zero (197, 174, and 54 for insertions, deletions, and both, respectively).
Figure 5.
Each path through this state graph represents a possible evolutionary history relating sequences GL and GIV.
By using stochastic traceback algorithms (sampling paths proportional to their posterior probability, blue highlighted states and transitions), it is possible to select a high-probability subset of the full state graph. By constructing such a subset at each internal node, it is possible to maintain a bound on the state space size during progressive tree traversal while still retaining an ensemble of possible histories.