Skip to main content
Advertisement
  • Loading metrics

Insertions and deletions as phylogenetic signal in an alignment-free context

  • Niklas Birth,

    Roles Investigation, Methodology, Software, Validation, Writing – review & editing

    Affiliation Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany

  • Thomas Dencker,

    Roles Methodology, Resources, Software, Writing – original draft

    Affiliation Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany

  • Burkhard Morgenstern

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Supervision, Writing – original draft, Writing – review & editing

    bmorgen@gwdg.de

    Affiliations Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany, Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany, Campus-Institute Data Science (CIDAS), Göttingen, Germany

Abstract

Most methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies from substitutions that may have occurred at the aligned sequence positions. Gaps in alignments are usually not employed as phylogenetic signal. In this paper, we explore an alignment-free approach that uses insertions and deletions (indels) as an additional source of information for phylogeny inference. For a set of four or more input sequences, we generate so-called quartet blocks of four putative homologous segments each. For pairs of such quartet blocks involving the same four sequences, we compare the distances between the two blocks in these sequences, to obtain hints about indels that may have happened between the blocks since the respective four sequences have evolved from their last common ancestor. A prototype implementation that we call Gap-SpaM is presented to infer phylogenetic trees from these data, using a quartet-tree approach or, alternatively, under the maximum-parsimony paradigm. This approach should not be regarded as an alternative to established methods, but rather as a complementary source of phylogenetic information. Interestingly, however, our software is able to produce phylogenetic trees from putative indels alone that are comparable to trees obtained with existing alignment-free methods.

Author summary

Phylogenetic tree inference based on DNA or protein sequence comparison is a fundamental task in computational biology. Given a multiple alignment of a set of input sequences, most approaches compare aligned sequence positions to each other, to find a suitable tree, based on a model of molecular evolution. Insertions and deletions that may have happened since the input sequences evolved from their last common ancestor are ignored by most phylogeny methods. Herein, we show that insertions and deletions can provide an additional source of information for phylogeny inference, and that such information can be obtained with a simple alignment-free approach. We provide an implementation of this idea that we call Gap-SpaM. The proposed approach is complementary to existing phylogeny methods since it is based on a completely different source of information. It is, thus, not meant to be an alternative to those existing methods but rather as a possible additional source of information for tree inference.

This is a PLOS Computational Biology Methods paper.

1 Introduction

Most phylogenetic studies are based on multiple sequence alignments (MSAs), either of partial or complete genomes or of individual genes or proteins. If MSAs of multiple genes or proteins are used, there are two possibilities to infer a phylogenetic tree: (1) the alignments can be concatenated to form a so-called superalignment or supermatrix. Tree building methods such as Maximum-Likelihood [1, 2], Bayesian Approaches [3] or Maximum-Parsimony [46] can then be applied to these superalignments. (2) One can calculate a separate tree for each gene or protein family and then use a supertree approach [7] to amalgamate these different trees into one final tree, with methods such as ASTRAL [8] or MRP [9].

Multiple sequence alignments usually contain gaps representing insertions or deletions (indels) that are assumed to have happened since the aligned sequences evolved from their last common ancestor. Gaps, however, are usually not used for phylogeny reconstruction. Most of the above tree-reconstruction methods are based on substitution models for nucleotide or amino-acid residues. Here, alignment columns with gaps are either completely ignored, or gaps are treated as ‘missing information’, for example in the frequently used tool PAUP* [6]. Some models have been proposed that can include gaps in a Maximum-Likelihood setting, such as TKF91 [10] and TKF92 [11], see also [1214]. Unfortunately, these models do not scale well to genomic data. Thus, indels are rarely used as a source of information for the phylogenetic analysis.

In those studies that actually make use of indels, this additional information is usually encoded in some simple manner. The most straightforward way of doing this is to treat the gap character as a fifth character for DNA comparison, or as a 21st character in protein comparison, respectively. This means that the lengths of gaps are not explicitly considered, so a gap of length > 1 is considered to represent independent insertion or deletion events. Some more issues with this approach are discussed in [15]; these authors introduced the ‘simple encoding’ of indel data as an alternative. For every indel in the multiple sequence alignment, an additional column is appended. This column contains a present/absent encoding for an indel event which is defined as a gap with given start and end positions. If a longer gap is fully contained in a shorter gap in another sequence, it is considered as missing information. Such a simple binary encoding is an effective way of using the length of the indels to gain additional information and can be used in some maximum-parsimony framework. A disadvantage of these approaches is their relatively long runtime. The above authors also proposed a more complex encoding of gaps [15] which they further refined in a subsequent paper [16]. The commonly used approaches to encode gaps for phylogeny reconstruction are compared in [17].

The ‘simple encoding’ of gaps has been used in many studies; one recent study obtained additional information on the phylogeny of Neoaves which was hypothesized to have a ‘hard polytomy’ [18]. Despite such successes, indel information is still largely ignored in phylogeny reconstruction. Oftentimes, it is unclear whether using indels is worth the large overhead and increased runtime. On the hand, it has also been shown that gaps can contain substantial phylogenetic information [19].

All of the above mentioned approaches to use indel information for phylogeny reconstruction require MSAs of the compared sequences. Nowadays, the amount of the available molecular data is rapidly increasing, due to the progress in next-generation sequencing technologies. If the size of the analyzed sequences increases, calculating multiple sequence alignments quickly becomes too time consuming. Thus, in order to provide faster and more convenient methods to phylogenetic reconstruction, many alignment-free approaches have been proposed in recent years. Most of these approaches calculate pairwise distances between sequences, based on sequence features such as k-mer frequencies [2022] or the number [23] or length [2426] of word matches. Distance methods such as Neighbor-Joining [27] or BIONJ [28] can then reconstruct phylogenetic trees from the calculated distances. For an overview, the reader is referred to recent reviews of alignment-free methods [2931].

Some recently proposed alignment-free methods use inexact word matches between pairs of sequences [3234], where mismatches are allowed to some degree. Such word matches can be considered as pairwise, gap-free ‘mini-alignments’. So, strictly spoken, these methods are not ‘alignment-free’. In the literature, they are still called ‘alignment-free’, as they circumvent the need to calculate full sequence alignments of the compared sequences. The advantage of such ‘mini-alignments’ is that inexact word matches can be found almost as efficiently as exact word matches, by adapting standard word-matching algorithms.

A number of these methods use so-called spaced-words [22, 35, 36]. A spaced-word is a word that, in addition to nucleotide or amino-acid symbols, contains wildcard characters at certain positions that are specified by a pre-defined binary pattern P representing ‘match positions’ and ‘don’t-care positions’, see Fig 1 for an example. If the same ‘spaced word’ occurs in two different sequences, this is called a Spaced-word Match or SpaM, for short. One way of using spaced-word matches–or other types of inexact word matches–in alignment-free sequence comparison is to use them as a proxy for full alignments, to estimate the number of mismatches per position in the (unknown) full sequence alignment. This idea has been implemented in the software Filtered Spaced Word Matches (FSWM) [34]; it has also been applied to protein sequences [37], and to unassembled reads [38].

thumbnail
Fig 1. Binary pattern P = ‘110101'’ (‘1’) and don’t-care positions (‘0’) and a spaced word ‘A G * C * A’ with respect to P, occurring in sequences S1 and S2.

The occurrence of the same spaced word in two different sequences is called a Spaced-word Match (SpaM). A SpaM w.r.t. P is, thus, a local gap-free alignment where matching residues are aligned at the match positions of P, while mismatches are possible at the don’t-care positions. In the above toy example, we find at the don’t care positions one mismatch (A-G) and one match (T-T). A score can be calculated for each SpaM based on the residues aligned to each other at the don’t-care positions. If the number of don’t-care positions in the underlying pattern P is sufficiently large, ‘homologous’ SpaMs can be reliably distinguished from background by their scores [34].

https://doi.org/10.1371/journal.pcbi.1010303.g001

In such approaches, it is crucial to use only those SpaMs that align homologous segments of the compared sequences and to discard random SpaMs. FSWM and related programs filter out non-homologous SpaMs by comparing the residues aligned to each other at the don’t-care positions of the SpaMs. As shown in Fig 1, a score can be calculated based on these residue pairs, and all SpaMs with a score below a certain threshold are discarded. As we have shown in previous papers, this approach can reliably distinguish between homologous and background SpaMs [34]. Other approaches have been proposed recently, that use the number of SpaMs to estimate phylogenetic distances between DNA sequences [36, 39], see [40] for a review of the various SpaM-based methods.

Multi-SpaM [41] is a recent extension of the SpaM approach to multiple sequence comparison. For a set of four or more input sequences, and for a binary pattern P, Multi-SpaM finds occurrences of the same spaced word with respect to P in four different input sequences. Such a spaced-word match is called a quartet P-block, or quartet block, for short. A quartet block, thus, consists of four occurrences of the same spaced-word, with respect to a specific pattern P, as in Fig 2. For each such block, Multi-SpaM identifies an optimal quartet tree topology based on the nucleotides aligned to each other at the don’t-care position of P, using the program RAxML [1]. Finally, the quartet trees calculated in this way are used to find a supertree of the full set of input sequences. To this end, Multi-SpaM uses the program Quartet MaxCut [42].

thumbnail
Fig 2. Quartet block with respect to the binary pattern 110101 representing match positions (‘1’) and don’t-care positions (‘0’).

The shown quartet block involves sequences S2, S3, S5, S7; the spaced word ‘A G * C * A’ occurs in all four sequences. A quartet block can be seen as a local, gap-free four-way alignment with matching residues at the match positions and possible mismatches at the don’t-care positions of the underlying binary pattern. Note that this is a toy example, in practice we are using binary patterns of length 110 with 10 match and 100 don’t-care positions.

https://doi.org/10.1371/journal.pcbi.1010303.g002

In the present paper, we use pairs of quartet blocks involving the same four sequences. We consider the distances between two blocks in the four sequences, to obtain hints about potential insertions and deletions that may have occurred between two quartet blocks. If these distances are different for two of the sequences, this would indicate that an insertion or deletion has happened since these sequences evolved from their last common ancestor. The distances between two quartet blocks can therefore support one of three possible quartet topologies for the four involved sequences. If, for example, in a pair of quartet blocks involving sequences Si, Sj, Sk, Sl, the distance between these blocks is equal in Si and Sj as well as in Sk and Sl but the distance in Si and Sj is different from the one in Sk and Sl, this would support a quartet tree where Si and Sj are neighbours, as well as Sk and Sl; an example is shown in Fig 3.

thumbnail
Fig 3. Two quartet blocks B1 and B2 (in green and purple) with respect to binary patterns 1101 and 10111, and with the matching spaced words ‘A G * C’ and ‘C * G T A’, respectively, involving the same four sequences S2, S4, S5, S8.

The distances between B1 and B2 in these sequences are D2 = D5 = 2 and D4 = D8 = 3. In the sense of maximum parsimony, these distances would support the quartet topology S2S5|S4S8, since this topology would require only one insertion/deletion (indel) event to explain the distances Di while the alternative two quartet topologies for the involved sequences would require two indel events. With our terminology, we say that this topology is strongly supported by the four distance values Di.

https://doi.org/10.1371/journal.pcbi.1010303.g003

To evaluate the phylogenetic signal that is contained in such pairs of quartet blocks, we first evaluate the inferred quartet topologies directly, by comparing them to trusted reference trees. Next, we use two different methods to infer a phylogenetic tree for the full set of input sequences, based on the distances between quartet blocks. (A) We calculate super trees based on the inferred quartet trees using the software Quartet MaxCut. (B) We use distances between pairs of blocks as characters in a maximum-parsimony setting, to find a tree that minimizes the number of insertions and deletions that have to be assumed, given the different distances between the quartet blocks. We evaluate these approaches on data sets that are commonly used as benchmark data in alignment-free sequence comparison. Our evaluation shows that the majority of the inferred quartet trees is correct and should therefore be useful additional information for phylogeny reconstruction. Moreover, the quality of the trees that we can infer from our quartet block pairs alone is roughly comparable to the quality of trees obtained with existing alignment-free methods.

The goal of our study is to show that insertions and deletions can be used as phylogenetic signal in an alignment-free context. Note that the information from putative indels is complementary to the information used in standard phylogeny approaches where aligned residues are used to infer substitutions that may have happened in the evolution of the sequences. Consequently, our approach is not competing with these existing methods but may be used as additional evidence that might support or call into question phylogenies inferred by more traditional approaches.

2 Methods

2.1 Spaced words, quartet blocks and distances between quartet blocks

We are using standard notation from stringology as defined, for example, in [43]. For a sequence S over some alphabet, S(i) denotes the i-th symbol of S. In order to investigate the information that can be obtained from putative indels in an alignment-free context, we use the P-blocks generated by the program Multi-SpaM [41]. At the start of every run, a binary pattern P ∈ {0, 1} is specified for some integer . Here, a “1” in P denotes a match position, a “0” stands for a don’t-care position. The number of match positions in P is called its weight and is denoted by w. By default, we are using parameter values = 110 and w = 10, so by default the pattern P has 100 don’t-care positions.

A spaced word W with respect to a pattern P is a word over the alphabet {A, C, G, T} ∪ {*} of the same length as P, and with W(i) = * if and only if i is a don’t care position of P, i.e. if P(i) = 0. If S is a sequence of length N over the nucleotide alphabet {A, C, G, T}, and W is a spaced word, we say that W occurs at some position i ∈ {1, …, }, if S(i + j − 1) = W(j) for every match position j in P. For two sequences S and S′ and positions i and i′ in S and S′, respectively, we say that there is a spaced-word match (SpaM) between S and S′ at (i, i′), if the same spaced word W occurs at i in S and at i′ in S′. A SpaM can be considered as a local pairwise alignment without gaps. Given a nucleotide substitution matrix, the score of a spaced-word match is defined as the sum of the substitution scores of the nucleotides aligned to each other at the don’t-care positions of the underlying pattern P. In FSWM and Multi-SpaM, we are using a substitution matrix described in [44]. In FSWM, only SpaMs with positive scores are used. It has been shown that this SpaM-filtering step can effectively eliminate most random spaced-word matches [34].

For a set of ≥4 input sequences and a binary pattern P of length , the program Multi-SpaM is based on quartet (P)-blocks, where a quartet block is defined as four occurrences of some spaced word W in four different sequences, see Fig 2 for an example. A quartet block B can, thus, be considered as a local gap-free four-way alignment, aligning length- segments of four sequences; we say that B ‘involves’ these four sequences. To exclude spurious random quartet blocks, Multi-SpaM removes quartet blocks with a low degree of similarity between the aligned segments. Technically, a quartet block is required to contain one occurrence of the spaced-word W, such that the other three occurrences of W have positive similarity scores with this first occurrence. For a given nucleotide substitution matrix, the similarity score of two spaced words (with respect to the same pattern P) is defined as the sum of the substitution scores of the nucleotides aligned to each other at the don’t-care positions of P.

2.2 Phylogeny inference using distances between quartet blocks

In this paper, we are considering pairs of quartet blocks involving the same four sequences, and we are using the distances between the two blocks in these sequences as phylogenetic signal. The first block in a block pair is called the reference block. To find reference blocks, we use the program Multi-SpaM. This program identifies quartet blocks with respect to a binary pattern P1, as explained in the first section of this paper. Here, a score is calculated for each quartet block, based on the don’t-care positions, to exclude random spaced-word matches, as detailed above. For each reference block with a score above the threshold, our new approach then searches for a second quartet block, involving the same four sequences, possibly with a different pattern P2, and within a window of L nucleotides in each sequence, to the right of the reference block. By default, we are using a window size of L = 500. For the second block, we do not calculate a score, since the probability of finding a quartet block within such a window by chance is very small.

Let us consider two quartet blocks—a reference block B1 and a corresponding second block B2 as described above –, with respect to patterns P1 and P2, respectively, involving the same four sequences Si, Sj, Sk, Sl. By definition, B1 is strictly to the left of B2, in the sense that the last position of B1 is smaller than the first position of B2 in all four sequences. Next, let Dι be the distance between B1 and B2 in sequence Sι, ι = i, …, l. More formally, if in sequence Sι block B1 starts at position k1 and block B2 starts at position k2, then we define Dι to be k2k1 − ℓ1, where ℓ1 is the length of the pattern P1. In other words, Dι is the length of the segment between B1 and B2 in Sι, see Figs 3 and 4 for examples. As explained, we can assume that the blocks B1 and B2 are representing true homologies, i.e. for each of them the respective segments go back to a common ancestor in evolution. Then, if we find for two sequences, say Si and Sj, that their distances Di and Dj between B1 and B2 are different from each other, this would imply that at least one insertion or deletion must have happened since Si and Sj have evolved from their last common ancestor. If, by contrast, the Di = Dj holds, no such insertion or deletion needs to be assumed.

thumbnail
Fig 4. Two quartet blocks, similar as in Fig 3, but involving S1, S4, S5, S6, and with distances D1 = D4 = 2, D5 = 3 and S6 = 4.

Here, we say that the distances weakly support the topology S1S4|S5S6, since only D1 and D4 are equal, while D5 and D6 are different from each other and from D1 and D4.

https://doi.org/10.1371/journal.pcbi.1010303.g004

There are three possible fully resolved (i.e. binary) quartet topologies for the four sequences Si, …, Sl that we denote by SiSj|SkSl etc. In the sense of the parsimony paradigm, we can consider the distance between two blocks as a character and Dι as the corresponding character state associated with sequence Sι. If two distances, say Di and Dj, are equal, and the other two distances, Dk and Dl are also equal to each other, but different from Si and Sj, respectively, this would support the tree topology SiSj|SkSl: with this topology, one would have to assume only one insertion or deletion to explain the character states, while for SiSk|SjSl or SiSl|SjSk, two insertions or deletions would have to be assumed. In this situation—i.e. if we have Di = DjDk = Dl –, we say that the pair (B1, B2) strongly supports topology SiSj|SkSl.

Next, we consider the situation where two of the distances are equal, say Di = Dj, and Dk and Dl would be different from each other, and also different from Di and Dj. From a parsimony point-of-view, all three topologies would be equally good in this case, since each of them would require two insertions or deletions. It may still seem more plausible, however, to prefer the topology SiSj|SkSl over the two alternative topologies. In fact, if we would use a simple probabilistic model where an insertion/deletion event has a fixed probability p, with 0 < p < 0.5, along each branch of the topology, then it is easy to see that the topology SiSj|SkSl would have a higher likelihood than the two alternative topologies. In this situation, we say that the pair (B1, B2) weakly supports the topology SiSj|SkSl. Finally, we call a pair of quartet blocks informative, if it–strongly or weakly—supports one of the three quartet topologies for the involved four sequences.

For a set of input sequences S1, …, SN, N ≥ 4, we implemented two different ways of inferring phylogenetic trees from quartet-block pairs. With the first method, we calculate the quartet topology for each quartet-block pair that supports one of the three possible quartet topologies. We then calculate a supertree from these topologies. Here, we use the program Quartet MaxCut [42, 45] that we already used in our previous software Multi-SpaM where we inferred quartet topologies from the nucleotides aligned at the don’t-care positions of quartet blocks.

Our second method uses the distances between quartet blocks as input for Maximum-Parsimony [4, 5]. To this end, we generate a character matrix as follows: the rows of the matrix correspond, as usual, to the input sequences, and each informative quartet block pair corresponds to one column. The distances between the two quartet blocks are encoded by characters ‘0’, ‘1’ and ‘2’, such that equal distances in an informative quartet-block pair are encoded by the same character (this encoding is necessary, since some parsimony programs accept only simple characters as input, so we cannot use the distances themselves as characters in the matrix). For sequences not involved in a quartet-block pair, the corresponding entry in the matrix is empty and is considered as ‘missing information’. In Fig 3, for example, the entries for S2, S4, S5, S8 would be ‘0’, ‘1’, ‘0’, ‘1’, respectively; in Fig 4, the entries for S1, S4, S5, S6 would be ‘0’, ‘0’, ‘1’, ‘2’.

Fig 5 shows an informative block pair, a character matrix encoding the distances Di for this block pair in the first column and the distances for three additional hypothetical block pairs in columns 2 to 4, together with a tree topology inferred from this matrix with maximum parsimony. Here, we used the the program pars form the PHYLIP package [46]. Note that all four block pairs in the matrix strongly support one of the three possible quartet topologies, since a block pair that only weakly supports a topology would not be informative in the sense of the parsimony principle. Therefore, in each of the four block pairs, we have only two different distances, and we need only two characters, ‘0’ and ‘1’.

thumbnail
Fig 5.

(A) Single block pair in a set of 6 sequences and distances Di, (B) character matrix encoding distances Di from four different quartet-block pairs and (C) tree topology, calculated from this matrix with maximum parsimony. Each column in the matrix represents one informative block pair. For the four sequences involved in a block pair, the distances Di are represented by characters ‘0’ and ‘1’, such that equal distances are represented by the same character. The characters themselves are arbitrary, the matrix only encodes if the distances Di between two blocks are equal or different in the four involved sequences. Dashes in a column represent ‘missing information’, for sequences that are not involved in the respective quartet-block pair. The quartet-block pair in (A) would be represented by the first column of the matrix (B), as we have D1 = D2D3 = D6. Thus, for S1 and S2 we have the same (arbitrary) symbol ‘1’, while S3 and S6 we have the symbol ‘0’. Since S4 and S5 are not involved in this quartet-block pair, they have dashes in the first column, representing ‘missing information’. The matrix represents four quartet-block pairs that strongly support one quartet topology, namely column 1 supporting S1S2|S3S6, column 2 supporting S3S6|S4S5, column 3 supporting S1S6|S4S5 and column 4 supporting S1S4|S2S5.

https://doi.org/10.1371/journal.pcbi.1010303.g005

In order to find suitable quartet-block pairs for the two described approaches, we are using our software Multi-SpaM. This program samples up to 1 million quartet blocks. We use the quartet blocks generated by Multi-SpaM as reference blocks, and for each reference block B1, we search for a second block in a window of L nucleotides to the right of B1 for a second block B2 involving the same four sequences (default: L = 500). We use the first block that we find in this window, provided that the involved spaced-word matches are unique within the window. If the pair (B1, B2) supports a topology of the involved four sequences—either strongly or weakly –, we use this block pair, otherwise the pair (B1, B2) is discarded.

3 Test results

In order to evaluate the above described approaches to phylogeny reconstruction, we used five sets of genome sequences from AF-Project [47] that are frequently used as benchmark data for alignment-free methods. In addition, we used a set of Wolbachia genomes [48], and sets of mitochondrial genomes from Piroplasmida [49] and from Termites [50]. These data sets are summarized in Table 1; for each data set, the number of genome sequences and their average length is given, together with the average pairwise phylogenetic distance in the set. As distance measure, we used the number of substitutions per position, estimated with our program FSWM. For these sets of genomes, trusted phylogenetic trees are available that can be used as reference trees; these genomes have also been used as benchmark data to evaluate Multi-SpaM [41].

thumbnail
Table 1. Benchmark data sets used to evaluate our approach with number of sequences and average sequence length.

The last column contains the average phylogenetic distance in the respective data set, measured as substitutions per position, estimated by the program FSWM.

https://doi.org/10.1371/journal.pcbi.1010303.t001

Note that our indel-based approach is not meant to be an alternative to existing phylogeny approaches that are based on substitutions. Since we are using a complementary source of information, we are not competing with those existing methods, but we wanted to know if our approach might be useful as an additional input for tree reconstruction. The comparison with alternative alignment-free phylogeny methods in this section is not done to find out which approach performs better—we rather wanted to find out if or to what extent our indel-based approach can provide relevant information for phylogeny inference at all.

3.1 Quartet trees from quartet-block distances

First, we tested, how many informative quartet block pairs we could find, i.e. how many of the identified quartet-block pairs would either strongly or weakly support one of the three possible quartet topologies for the corresponding four sequences.

As explained above, for each set of genome sequences, we first sampled up to 1,000,000 quartet blocks with Multi-SpaM [41], we call these blocks the ‘reference blocks’. For each of these blocks, we then searched for a second block in a window of 500 nt to the right of the reference block. For the second block, we used a pattern P = 1111111, i.e. we generated blocks of exact word matches of length seven. If no second block could be found in the window, the reference block was discarded. Table 2 shows the percentage of informative quartet block pairs, among the quartet block pairs that we used. To evaluate the correctness of the obtained quartet topologies, we compared them to the topologies of the respective quartet sub-trees of the reference trees using the Robinson-Foulds (RF) distance [51] between the two quartet topologies. If the RF distance is zero, the inferred quartet topology is in accordance with the reference tree. To compare the obtained quartet topologies to the full reference trees, we used Sarah Lutteropp’s program Quartet Check that is available through GitHub [52]. We slightly modified the original code to adapt it to our purposes; the modified code used in our study is also available through GitHub [53].

thumbnail
Table 2. Test results on different sets of genomes.

As benchmark data, we used five sets of genome sequences from AF-Project [47] and sets of genomes from Wolbachia [48], Piroplasmida [49] and Termites [50]. For each data set, we generated up to 1,000,000 pairs of quartet blocks as described in the main text. The table shows the number of informative block pairs (‘# inf bp’), i.e. the number of block pairs for which we obtained either strong or weak support for one of the three possible quartet topologies of the involved sequences. In addition we show the percentage of correct quartet topologies (with respect to the respective reference tree), out of all informative block pairs, as well as the ‘coverage’ by quartet blocks, i.e. the percentage of sequence quartets for which we found at least one informative block pair. Standard deviations are shown in parentheses. For each data set, we obtained 1,000,000 block pairs, except for the three sets of mitochondrial genomes, where it was not possible to find this number of block pairs.

https://doi.org/10.1371/journal.pcbi.1010303.t002

We want to use the quartet trees that we obtain from informative quartet block pairs, to generate a tree of the full set of input sequences. Therefore, it is not sufficient for us to have a high percentage of correct quartet trees, but we also want to know how many of the sequence quartets are covered by these quartet trees. Generally, the results of super-tree methods depend on the coverage of the used quartet topologies [54, 55]. For a set of N input sequences, there are possible ‘sequence quartets’, i.e. sets of four sequences. Ideally, for every such set, we should have at least one quartet tree, in order to find the correct super tree. Table 2 reports the quartet coverage, i.e. the percentage of all sequence quartets, for which we obtained at least one quartet tree.

Note that Multi-SpaM uses randomly sampled quartet blocks, the program can thus return different results for the same set of input sequences. We therefore performed 10 program runs on each set of sequences and report the average correctness and coverage of these test runs.

3.2 Full phylogeny reconstruction

Finally, we applied our quartet-block pairs to reconstruct full tree topologies for the above sets of benchmark sequences. Here, we used two different approaches, namely Quartet MaxCut and Maximum-Parsimony, as described above. As is common practice in the field, we evaluated the quality of the reconstructed phylogenies by comparing the the respective reference trees from AFproject using the normalized Robinson-Foulds (RF) distances between the inferred and the reference topologies. For a data set with N taxa, the normalized RF distances are obtained from the RF distances by dividing them by 2 * N − 6, i.e. by the maximum possible distance for trees with n leaves. The results of other alignment-free methods on these data are reported in [41, 47].

We applied the program Quartet MaxCut first to the quartet topologies derived from the set of all informative quartet-block pairs. As a comparison, we then inferred topologies using only those quartet-block pairs that strongly support one of the three possible topologies for the four involved sequences. The results of these test runs are shown in Table 3. Next, we used the program PAUP* [6] to calculate the most parsimonious tree, using the distances between quartet blocks as characters, as explained above. Here, we used the TBR [56] heuristic. In some cases, this resulted in multiple optimal, i.e. most parsimonious trees. In these cases, we somewhat arbitrarily picked the first of these trees in the PAUP* output. The results of these test runs are also shown in Table 3, together with the results from Multi-SpaM.

thumbnail
Table 3. Average normalized Robinson-Foulds (RF) distances between trees, reconstructed with various alignment-free methods, for the genome sets listed in Table 2.

For each data sets, we performed 10 program runs with our indel-based approach. The average over these program runs is shown in the table; standard deviations are shown in parentheses.

https://doi.org/10.1371/journal.pcbi.1010303.t003

3.3 Runtime

The program runtime of our approach on the data sets that we used in our evaluation is shown in Table 4. Test runs were performed on an Intel Xeon Processor E7- 4850 with 2.00 GHz (4 processors with 10 kernels each/20 threads) and 1 TB RAM. Here, the runtime is shown separately for identifying the reference blocks by running the corresponding sub-routine of Multi-SpaM (first column) and for finding a second block for each reference block (third column). The time to calculate the resulting tree from the distances between these block pairs with parsimony or Quartet MaxCut was negligible. The total runtime of our approach is, thus, roughly the sum of the values in the first and the third column. As a comparison, we report the runtime of the well-known program RAxML [1] that Multi-SpaM uses on the don’t-care positions of the reference blocks. Thus, the total run time of Multi-Spam is obtained as the sum of the values of the first two columns.

thumbnail
Table 4. Program runtime for the approach described in this paper, in comparison to Multi-SpaM.

Column reference blocks contains the time to calculate the set of reference quartet blocks with the program Multi-SpaM. Column gaps contains the remaining runtime of our method, i.e. the time to find the respective second block for each reference block. As a comparison, column RAxML contains the runtime for running RAxML on the don’t-care positions of the reference blocks.

https://doi.org/10.1371/journal.pcbi.1010303.t004

4 Discussion

Sequence-based phylogeny reconstruction usually relies on nucleotide or amino-acid residues aligned to each other in multiple alignments. Information about insertions and deletions (indels) is neglected in most studies, despite evidence that this information may be useful for phylogeny inference. There are several difficulties when indels are to be used as phylogenetic signal: it is difficult to derive probabilistic models for insertions and deletions, and there are computational issues if gaps of different lengths are spanning multiple columns in multiple alignments. Finally, gapped regions in sequence alignments are often considered less reliable than un-gapped regions, so the precise number and length of insertions and deletions that have happened may not be easy to infer from multiple alignments.

In recent years, many fast alignment-free methods have been proposed to tackle the ever increasing amount of sequence data. Most of these methods are based on counting or comparing words, and gaps are usually not allowed within these words. It is therefore not straight-forward to adapt standard alignment-free methods to use indels as phylogenetic information.

In the present paper, we proposed to use pairs of blocks of sequence segments, based on our previously proposed alignment-free spaced-word approach. Within such blocks, no gaps are allowed. These blocks can be used to obtain information about possible insertions and deletions between two blocks since the compared sequences have evolved from a common ancestor. To this end, we consider the distances between these blocks in the respective sequences. If these distances are different for two sequences, this indicates that there has been an insertion or deletion since they evolved from their last common ancestor. If the two distances are the same, no indel event needs to be assumed. This information can be used to infer a tree topology for the sequences involved in a pair of blocks. To our knowledge, this is the first attempt to use insertions and deletions as phylogenetic signal in an alignment-free context.

In this study, we restricted ourselves, for simplicity, to quartet blocks i.e. to blocks involving four input sequences each; we used pairs of blocks involving the same four sequences. We did not consider the length of hypothetical insertions and deletions, but only asked whether or not such an event has to be assumed between two sequences in the region bounded by the two blocks. Since indels are relatively rare events, compared to substitutions, the maximum parsimony paradigm seems to be suitable in this situation. In the sense of parsimony, however, only those block pairs are informative that, in our notation, strongly support one of three possible quartet topologies, in the sense of the definition that we introduced in this paper. Indeed, if two distances between two blocks are equal, and the third and fourth distance are different from them–and also different from each other –, then each of the three possible quartet topologies would require two insertion or deletion events. That is, all three topologies would be equally good from a parsimonious viewpoint.

Intuitively, however, one may want to use the information from such quartet blocks pairs that, in our terminology, weakly support one of the possible topologies. It is easy to see that, with a simple probabilistic model under which an insertion between two blocks occurs with a probability p < 0.5, independently of the length of the insertion and the distance between the blocks, a weakly supported topology would have a higher likelihood than the two alternative topologies—although all three topologies are considered equally good from a parsimony point-of-view. So it might be interesting to apply such a simple probabilistic model to our approach, instead of maximum parsimony. Also, while we restricted ourselves to quartet blocks in this study, it might be worthwhile to use block pairs involving more than four sequences.

Our approach has only few parameters that can be adjusted by the user, essentially concerning the underlying binary pattern P and the threshold that is used to separate random quartet blocks form quartet blocks that represent true homologies. In our study, we used patterns with a length of 110 and with 10 match positions, i.e. with 100 don’t-care positions. Our results in the present and in previous studies indicate that with our default parameter value, random spaced-word matches can be reliably distinguished from background matches [34, 41]. Adapting these parameter values mainly affects the number of quartet-block pairs. So this mainly comes down to a trade off between program run time and the amount of information that one obtains form the block pairs, i.e. the strength of the signal.

Using standard benchmark data, we could show that phylogenetic signal from putative insertions and deletions between quartet blocks is mostly in accordance with the reference phylogenies that we used as standard of truth. Interestingly, the quality of the tree topologies that we constructed from our ‘informative’ pairs of quartet blocks—i.e. from indel information alone—is roughly comparable to the quality of topologies obtained with existing alignment-free methods.

As mentioned above, our approach is not competing with existing phylogeny approaches. In fact, we did not expect to obtain trees with our approach that are of comparable quality as trees obtained with standard methods. Our goal was to find out if information about putative insertions and deletions can provide useful phylogenetic information at all in an alignment-free setting. Since the phylogenetic signal from indels is complementary to the information that is used by those existing approaches, any such information might be useful as additional evidence, no matter if substitution-based or indel-based trees are superior. It is all the more surprising that our rather simplistic approach is already able to infer trees that are roughly comparable to trees obtained with established alignment-free approaches.

There is a certain limitation of our approach, if it is used as a stand-alone approach, to infer trees without additional information, as we did in our evaluation: To infer trees from putative indels alone, we need a large enough number of ‘informative’ block pairs to obtain quartet trees, or as an input for maximum parsimony. The number of informative block pairs that we can obtain depends, however, on the sequence length and on the degree of similarity among the compared sequences. If sequences are to short or to distantly related, our approach cannot find sufficiently many reference quartet blocks with a score above the employed threshold value.

In our test runs, we used three data sets with a low degree of similarity, the plants data set and the mitochondrial genomes from Piroplasmida and Termites, see Table 1. The plant genomes that we used were large enough to obtain a sufficient number of informative block pairs and, as a result, a phylogenetic tree that is of comparable quality as the tree produced with FSWM and Multi-SpaM. The trees we obtained from the Piroplasmida and Termites mitochondrial DNA were of poor quality, though. Note however, that even for these two data sets, a rather large fraction of the ‘strongly informative’ block pairs were in accordance with the reference phylogeny, namely 63.78 and 47.98 percent, respectively, see Table 2. This indicates that, while these block pairs are not sufficient to infer the correct tree topology, when used as the sole source of information, they may still be useful as additional input information when combined with other approaches to phylogeny reconstruction. Therefore, it seems worthwhile to investigate how our indel-based approach can be used together with other alignment-free approaches.

References

  1. 1. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–1313. pmid:24451623
  2. 2. Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010;59:307–321. pmid:20525638
  3. 3. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–1574. pmid:12912839
  4. 4. Farris JS. Methods for Computing Wagner Trees. Systematic Biology. 1970;19:83–92.
  5. 5. Fitch W. Toward defining the course of evolution: minimum change for a specific tree topology. Systematic Zoology. 1971;20:406–416.
  6. 6. Swofford D. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4.0b10. Sinauer Associates, Sunderland, Massachusetts. 2003;.
  7. 7. Bininda-Emonds ORP. The evolution of supertrees. Trends in Ecology and Evolution. 2004;19:315–322. pmid:16701277
  8. 8. Zhang C, Rabiee M, Sayyari E, Mirarab S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics. 2018;19(Suppl 6):153. pmid:29745866
  9. 9. Ragan MA. Phylogenetic inference based on matrix representation of trees. Mol Phylogenet Evol. 1992;1:53–58. pmid:1342924
  10. 10. Thorne JL, Kishino H, Felsenstein J. An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol. 1991;33:114–124. pmid:1920447
  11. 11. Thorne JL, Kishino H, Felsenstein J. Inching toward reality: An improved likelihood model of sequence evolution. Journal of Molecular Evolution. 1992;34:3–16. pmid:1556741
  12. 12. Holmes IH. Solving the master equation for Indels. BMC Bioinformatics. 2017;18:255. pmid:28494756
  13. 13. Alekseyenko AV, Lee CJ, Suchard MA. Wagner and Dollo: a stochastic duet by composing two parsimonious solos. Systematic Biology. 2008;57:772–784. pmid:18853363
  14. 14. Miklós I, Lunter GA, Holmes I. A “Long Indel” Model For Evolutionary Sequence Alignment. Molecular Biology and Evolution. 2004;21:529–540. pmid:14694074
  15. 15. Simmons MP, Ochoterena H. Gaps as characters in sequence-based phylogenetic analyses. Syst Biol. 2000;49:369–381. pmid:12118412
  16. 16. Müller K. Incorporating information from length-mutational events into phylogenetic analysis. Mol Phylogenet Evol. 2006;38(3):667–676. pmid:16129628
  17. 17. Ogden TH, Rosenberg MS. How should gaps be treated in parsimony? A comparison of approaches using simulation. Mol Phylogenet Evol. 2007;42:817–826. pmid:17011794
  18. 18. Houde P, Braun EL, Narula N, Minjares U, Mirarab S. Phylogenetic Signal of Indels and the Neoavian Radiation. Diversity. 2019;11.
  19. 19. Dessimoz C, Gil M. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol. 2010;11:R37. pmid:20370897
  20. 20. Sims GE, Jun SR, Wu GA, Kim SH. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences. 2009;106:2677–2682. pmid:19188606
  21. 21. Qi J, Luo H, Hao B. CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Research. 2004;32(suppl 2):W45–W47. pmid:15215347
  22. 22. Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B. Fast Alignment-Free sequence comparison using spaced-word frequencies. Bioinformatics. 2014;30:1991–1999. pmid:24700317
  23. 23. Lippert RA, Huang H, Waterman MS. Distributional regimes for the number of k-word matches between two random sequences. Proceedings of the National Academy of Sciences. 2002;99:13980–13989. pmid:12374863
  24. 24. Leimeister CA, Morgenstern B. kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics. 2014;30:2000–2008. pmid:24828656
  25. 25. Ulitsky I, Burstein D, Tuller T, Chor B. The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology. 2006;13:336–350. pmid:16597244
  26. 26. Morgenstern B, Schöbel S, Leimeister CA. Phylogeny reconstruction based on the length distribution of k-mismatch common substrings. Algorithms for Molecular Biology. 2017;12:27. pmid:29238399
  27. 27. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution. 1987;4:406–425. pmid:3447015
  28. 28. Gascuel O. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution. 1997;14:685–695. pmid:9254330
  29. 29. Vinga S, Almeida J. Alignment-free sequence comparison—a review. Bioinformatics. 2003;19:513–523. pmid:12611807
  30. 30. Haubold B. Alignment-free phylogenetics and population genetics. Briefings in Bioinformatics. 2014;15:407–418. pmid:24291823
  31. 31. Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, et al. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Briefings in Bioinformatics. 2019;22:426–435. pmid:28673025
  32. 32. Yi H, Jin L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Research. 2013;41:e75. pmid:23335788
  33. 33. Haubold B, Klötzl F, Pfaffelhuber P. andi: Fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics. 2015;31:1169–1175. pmid:25504847
  34. 34. Leimeister CA, Sohrabi-Jahromi S, Morgenstern B. Fast and Accurate Phylogeny Reconstruction using Filtered Spaced-Word Matches. Bioinformatics. 2017;33:971–979. pmid:28073754
  35. 35. Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, Leimeister CA, et al. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Research. 2014;42:W7–W11. pmid:24829447
  36. 36. Morgenstern B, Zhu B, Horwege S, Leimeister CA. Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms for Molecular Biology. 2015;10:5. pmid:25685176
  37. 37. Leimeister CA, Schellhorn J, Dörrer S, Gerth M, Bleidorn C, Morgenstern B. Prot-SpaM: Fast alignment-free phylogeny reconstruction based on whole-proteome sequences. GigaScience. 2019;8:giy148. pmid:30535314
  38. 38. Lau AK, Dörrer S, Leimeister CA, Bleidorn C, Morgenstern B. Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage. BMC Bioinformatics. 2019;20:638. pmid:31842735
  39. 39. Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLOS ONE. 2020;15:e0228070. pmid:32040534
  40. 40. Morgenstern B. Sequence Comparison without Alignment: The SpaM approaches. In: Katoh K, editor. Multiple Sequence Alignment. Methods in Molecular Biology. Springer; 2020. p. 121–134.
  41. 41. Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B. Multi-SpaM: a Maximum-Likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees. NAR Genomics and Bioinformatics. 2020;2:lqz013. pmid:33575565
  42. 42. Snir S, Rao S. Quartet MaxCut: A fast algorithm for amalgamating quartet trees. Molecular Phylogenetics and Evolution. 2012;62:1–8. pmid:21762785
  43. 43. Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge, UK: Cambridge University Press; 1997.
  44. 44. Chiaromonte F, Yap VB, Miller W. Scoring Pairwise Genomic Sequence Alignments. In: Altman RB, Dunker AK, Hunter L, Klein TE, editors. Pacific Symposium on Biocomputing. Lihue, Hawaii; 2002. p. 115–126.
  45. 45. Snir S, Rao S. Quartets MaxCut: A Divide and Conquer Quartets Algorithm. IEEE/ACM Trans Comput Biology Bioinform. 2010;7:704–718. pmid:21030737
  46. 46. Felsenstein J. PHYLIP—Phylogeny Inference Package (Version 3.2). Cladistics. 1989;5:164–166.
  47. 47. Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, et al. Benchmarking of alignment-free sequence comparison methods. Genome Biology. 2019;20:144. pmid:31345254
  48. 48. Gerth M, Bleidorn C. Comparative genomics provides a timeframe for Wolbachia evolution and exposes a recent biotin synthesis operon transfer. Nature Microbiology. 2017;2:16241.
  49. 49. Schreeg ME, Marr HS, Tarigo JL, Cohn LA, Bird DM, Scholl EH, et al. Mitochondrial Genome Sequences and Structures Aid in the Resolution of Piroplasmida phylogeny. PLOS ONE. 2016;11:e0165702. pmid:27832128
  50. 50. Cameron SL, Lo N, Bourguignon T, Svenson GJ, Evans TA. A mitochondrial genome phylogeny of termites (Blattodea: Termitoidae): Robust support for interfamilial relationships and molecular synapomorphies define major clades. Molecular Phylogenetics and Evolution. 2012;65:163–173. pmid:22683563
  51. 51. Robinson DF, Foulds L. Comparison of phylogenetic trees. Mathematical Biosciences. 1981;53:131–147.
  52. 52. Lutteropp S. Quartet Check; 2021. https://github.com/lutteropp/quartet_check.
  53. 53. Birth N. Single Quartet Check; 2021. https://github.com/njbirth/single_quartet_check.
  54. 54. Avni E, Yona Z, Cohen R, Snir S. The Performance of Two Supertree Schemes Compared Using Synthetic and Real Data Quartet Input. J Mol Evol. 2018;86:150–165. pmid:29460038
  55. 55. Swenson MS, Suri R, Linder CR, Warnow T. An experimental study of Quartets MaxCut and other supertree methods. Algorithms Mol Biol. 2011;6:7. pmid:21504600
  56. 56. Swofford DL, Olsen GJ. Phylogeny reconstruction. In: Hillis DM, Moritz C, editors. Molecular Systematics. Sinauer Associates; 1990. p. 407–511.