Evolutionary Distances in the Twilight Zone—A Rational Kernel Approach

Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.


Method comparison
The generated sequences were passed on to the different methods used to reconstruct phylogenetic trees. Multiple alignments were performed with Muscle 3.7 [9] and ProbCons 1.12 [10], distance matrices were successively computed with DNADIST from the Phylip 3.68 package [11]. We used Jukes-Cantor and PMB (Probability Matrix from Blocks, [12]) distance measures for reconstructing trees to be as similar as possible to the sequence generating process. Fast maximumlikelihood tree reconstruction was performed using RAxML 7.0.4 [13], using the GTRGAMMA and PROTGAMMABLOSUM62 substitution models for nucleotide and protein trees respectively. Other related substitution models as well as estimation of base frequencies from the alignment in RAxML were tested but did not make a significant difference in terms of topological reconstruction accuracy (data not shown). Pairwise global and semi-global alignments were computed using Needle and Stretcher from the EMBOSS 6.0.1 package [14]. The scoring matrices used were EDNAFULL for DNA, and Blosum62 for protein sequences as contained in the EMBOSS package. We used affine gap cost models with gap open costs of 10 and gap extend costs of 1. In general, the same or the most similar available scoring matrices and gap costs were used for all external programs and our own transducer-based approach to make results as comparable as possible. All trees from distance matrices were reconstructed using BioNJ [15]. To determine accuracy of the tree reconstruction method, quartet distances between the reconstructed and the original trees were computed using QDist 2.0.2 [16]. Computations were carried out in parallel on a 160 core computer cluster (80 Xeon 5140 CPUs, 448 GB RAM in total) running SUSE Linux Enterprise Server 10 (x86-64) as operating system. The total calculation time was about 7350 CPU hours.
Finite-state transducers A weighted finite-state transducer over a semiring 1 K is a 8-tuple T = (Σ, ∆, Q, I, F, E, λ, ρ) where Σ is the finite input alphabet and ∆ is the finite output alphabet. Q is a finite set of states of which I ⊆ Q is the set of initial states and F ⊆ Q is the set of final states.
is a finite set of transitions, λ : I → K the initial weight function and ρ : F → K the final weight function. Like other finite-state machines, finite-state transducers can be depicted graphically (for an example, see main text).
Using the ⊕-sum and ⊗-product of the semiring K, a transducer assigns an output weight to any pair of input-output strings (x, y) via i.e. the ⊕-sum over all possible paths transforming string x into string y, thereby going from the initial to the final states.
When the semiring K is the tropical semiring the final output weight of the transducer is the minimum over all possible paths transforming x to y (the Viterbi approximation).
For any transducer T , its inverse T −1 is defined by the transducer obtained by exchanging input and output symbol at every transition and exchanging the input and output alphabets.
Two transducers T 1 = (Σ * , ∆ * , Q 1 , I 1 , F 1 , E 1 , λ 1 , ρ 1 ) and T 2 = (∆ * , Ω * , Q 2 , I 2 , F 2 , E 2 , λ 2 , ρ 2 ) can be composed to form a new, usually more complex, transducer, where the composition is defined as: The resulting transducer is of the form T = (Σ * , Ω * , Q, I, F, E, λ, ρ), where the states of T can be identified with pairs of a state of T 1 and a state of T 2 , Q ⊆ Q 1 × Q 2 . In the same sense, initial states are states which are initial in both T 1 and T 2 . Final states are defined equivalently. Transitions are obtained by matching transitions of T 1 with transitions of T 2 in which output symbols of the T 1 transitions have to be input symbols of the T 2 transitions. So, for two transitions e 1 = (q 1 , a, b, w 1 , q 2 ) ∈ E 1 and e 2 = (q 1 , b, c, w 2 , q 2 ) ∈ E 2 , the composed transition is e = ((q 1 , q 1 ), a, c, w 1 ⊗ w 2 , (q 2 , q 2 )). Note that the ⊗-product depends on the semiring used.
Rational kernels A function K over two non-empty sets X and Y defining the map K : X × Y → R is called a kernel. A kernel K over Σ * × ∆ * is a rational kernel, if there exists a weighted transducer T = (Σ, ∆, Q, I, F, E, λ, ρ) over the semiring K and a function ψ : K → R such that for all x ∈ Σ * and y ∈ ∆ * K(x, y) = ψ( T (x, y)) defines the kernel function. A rational kernel K is further symmetric positive semi-definite (PSD), iff it can be decomposed into another transducer S and its inverse S −1 via T = S • S −1 [18].
The problem of scoring two sequences with a transducer (or a rational kernel) is equivalent to solving the single-source shortest distance problem of the transducer composed with its input and output strings in form of finite-state acceptors [19].
In this paper we used the natural link function ψ = exp(−t T (x, y)) for some normalizing constant 0 < t < 1.
Implementation All scripting, parsing of file formats and data handling implemented in Python 2.6 using BioPython 1.54 [20]. All statistical analyses, projection to next positive semi-definite and plotting done in R 2.11.0 [21]. All finite-state transducers implemented using the OpenFST library [22].