Accurate Reconstruction of Insertion-Deletion Histories by Statistical Phylogenetics

The Multiple Sequence Alignment (MSA) is a computational abstraction that represents a partial summary either of indel history, or of structural similarity. Taking the former view (indel history), it is possible to use formal automata theory to generalize the phylogenetic likelihood framework for finite substitution models (Dayhoff's probability matrices and Felsenstein's pruning algorithm) to arbitrary-length sequences. In this paper, we report results of a simulation-based benchmark of several methods for reconstruction of indel history. The methods tested include a relatively new algorithm for statistical marginalization of MSAs that sums over a stochastically-sampled ensemble of the most probable evolutionary histories. For mammalian evolutionary parameters on several different trees, the single most likely history sampled by our algorithm appears less biased than histories reconstructed by other MSA methods. The algorithm can also be used for alignment-free inference, where the MSA is explicitly summed out of the analysis. As an illustration of our method, we discuss reconstruction of the evolutionary histories of human protein-coding genes.


Simulation parameters and setup
Data generation Our simulation study is comprised of alignments simulated using 5 different indel rates (0.005, 0.01, 0.02, 0.04, and 0.08 indels per unit time), each with 3 different substitution rates (0.5, 1, and 2 expected substitutions per unit time) and 100 replicates. Time is defined such that a sequence evolving for time t with substitution rate r is expected to accumulate rt subsitutions per site. We employed an independent third-party simulation program, indel-seq-gen, specifically designed to generate realistic protein evolutionary histories [1]. indel-seq-gen is capable of modeling an empirically-fitted indel length distribution, rate variation among sites, and a neighbor-aware distribution over inserted sequences allowing for small local duplications. Since the indel and substitution model used by indel-seq-gen are separate from (and richer than) those used by ProtPal, ProtPal has no unfair advantage in this test.
indel-seq-gen v2.0.6 was run with the following command: cat guidetree.tree| indel-seq-gen -m JTT -u xia --num gamma cats 3 -a 0.372 --branch scale r/b --outfile simulated alignment.fa --quiet --outfile format f -s 10000 --write anc The above command uses the "JTT" substition model, the "xia" indel fill model (based on neighbor effects, estimated from E coli k-12 proteins [1]), and 3 gamma-distributed rate categories with shape 0.372. Branch lengths are scaled by the substitution rate for simulation rate r, normalized by the inverse of indelseq-gen's underlying substitution rate (b = 1.2) so as to adhere to the above definition of evolutionary "time". Similarly, indel rates, which are set in the guide tree file guidetree.tree, are scaled by b r so that tλ * insertions/deletions are expected over time t for rate λ * .
The root mean squared error (RMSE) for each error distribution was computed as follows: The true tree was made available to all programs which can utilize a tree (ProtPal, PRANK, MUSCLE), representing the use case in which the true tree is known (e.g. via the species tree) but the true alignment is unknown.
We ran simulations on three different phylogenies: a tree of twelve sequenced Drosophila genomes [2] and trees from the mammalian and amniotic clades of the OPTIC database. We here report results for the Drosophila tree, which we empirically observe to show trends consistent with, but more pronounced than, those of the mammalian and amniotic trees. The clearer trends may be due to the Drosophila tree being larger than the other trees (12 taxa), or having a diverse range of branch lengths (0.001 -0.59 expected substitutions/site, at the genome-wide average rate). The simulation data, reconstructions, and analysis scripts are available from http://biowiki.org/~oscar/ protpal_simulation_benchmark.bz2.
Alignment We investigated several multiple alignment tools [3][4][5][6][7][8] in combination with alignment-conditioned reconstruction methods. Programs were run with their default settings, with the exception of PRANK and MUSCLE.
To specify ancestral inference, the guide tree, and "insertions opening forever", PRANK used the extra options "-writeanc -t <treefile> +F". PRANK's -F option allows insertions to match characters at alignments closer to the root.
This can be a useful heuristic safeguard when an incorrect tree may produce errors in subtree alignments that cannot be corrected at internal nodes closer to the root. Since the true guide tree is provided to PRANK, it is safe to treat insertions in a strict phylogenetic manner via the +F option. For computational efficiency, ProtPal was provided with a CLUSTALW guide alignment. Any alignment of the sequences can be used as a guide, and we chose CLUSTALW for its general poor performance, so that ProtPal would gain no unfair advantage by the information contained in the guide alignment. MUSCLE was provided the guide tree with the additional option "-usetree <treefile>".
where the latter step assumes a flat prior, P (θ ) = const.
This statistic is not without its problems. For one thing, we use an initial guess of θ to estimateĤ. Furthermore, for an unbiased estimate, we should sum over all histories, rather than conditioning on the MAP reconstructed history. This summing over histories would, however, require multiple expensive calculations of P (S|T, θ), where conditioning onĤ requires only one such calculation. We further justify our benchmark of parameter estimates conditioned on a MAP-reconstructed history by noting that this the de facto method employed by large-scale genomics studies focusing on indels [10-13].
As well as imputing indel rates from reconstructed histories, we also tried using the lambda.pl program in the DAWG package [14], which estimates indel rates from MSAs directly (without attempting reconstruction).
Estimating substitution rates Substitution rates were estimated for each inferred alignment using XRate's built-in EM algorithm and the following simple rate matrix. Given an equilibrium distribution over amino acid characters, with π i defining the proportion of character i, the rate of character i mutating to j is set to rπ j where r is the only free rate parameter. XRate's estimate of r is taken to be the average substitution rate of the MSA.
By using indel-seq-gen's branch-scale option and changing the indel rate parameters accordingly, we are able to modulate the substitution and indel rates independently in the data generation step. This true substitution rate and the rate inferred by XRate are then directly comparable.

Additional OPTIC Figures
In addition to estimating indel rates for all genes in the OPTIC set, we performed various other analyses which were left out of the main text for reasons of space limitations. We provide figures those displaying results here. Ancestral Branch of Origin Program Figure 1: ProtPal correctly reconstructs the age of more extant residues than any other program tested. The y-axis shows the proportion of extant residues whose point of origin on the phylogenetic tree was correctly pinpointed by the reconstruction. The branch of origin was found by taking the tree node closest to the root containing a non-gap reconstructed character. All programs except FSA are in the 92%-94% range, owing to the fact that many columns (especially at low indel rates) are devoid of indels, making inference of origin trivial (as these columns' origin is pre-root).  Figure 2: Cross-comparison of AMA scores and rate estimation accuracy reveals that using a single metric to assess alignment accuracy can be unreliable. AMA scores were computed for each programs alignment of only leaf sequences using cmpalign from the DART package [15]. AMA scores are comparable across programs until higher indel rates, where FSA performs best-contrasting with Figures 1 and 2 (main text). MUSCLE's accurate deletion rate measurements at high rates and the low corresponding AMA scores suggest a "cancellation of biases".   Figure 3: Most programs are relatively robust to variations in the simulated substitution rate, as evidenced by the benchmark data grouped according to substitution rate. Accuracy of rate estimation is plotted as |true − inf erred| on the y-axis, with bars grouped by program for each indel rate and 3-tuple of substitution rates. Higher substitution rates often lead to higher error, presumably because they obscure homologies, making it more difficult to distinguish substitutions from indels. FSA appears more sensitive to increased substitutions than other programs -at indel rate 0.02, FSA's insertion rates are as accurate as ProtPal's at 0.5 and 1.0 substitutions per site, whereas at the highest substitution rate (2.0), its error exceeds that of CLUSTALW.  Branch -speci c Indel Rates -OPTIC Figure 5: Reconstruction allows for estimation of branch-specific indel rates, revealing possibly interesting signals of evolution. Indel rates were averaged over all alignments, using the species tree shown in Figure 6. The human branch (Euarchontoglires -H.sapiens) appears to have experienced unusually many insertions. The Amniota -Australophenids (pink) branch has a higher deletion than insertion rate, though it is difficult to distinguish an insertion on this branch from a deletion on the Amniota -G.gallus (navy) branch. All other branches are comparable between insertions and deletions. Each bar is colored according to branches in Figure 6.  Figure 6: The phylogenetic tree used for analysis of OPTIC data, colored to inform the branch-specific Figure 5.  Figure 7: Distributions over amino acids are highly non-uniform, and differ between insertions, deletions, and the overall distribution seen in OPTIC. Inserted, deleted, and all sequences were separately pooled across all OPTIC genes reconstructed and amino acid distributions were computed for each.  Figure 9: Indels are highly non-uniform in their distribution across genes: we see a 6-fold enrichment for insertions within the N-terminal 1% of the protein sequence, and a 1.4-fold increase within the C-terminal 1%. There is an 14-fold enrichment in deletions within the N-terminal 1% of the protein sequence, and a 1.8-fold increase within the C-terminal 1%. Indel locations are normalized by gene length to enable combining data across all OPTIC genes analyzed. This may be a mix of genuine biology (e.g. indels occur more often near the ends of genes) and artifacts (annotation errors are more likely to occur at the ends of genes).