Insertions and deletions as phylogenetic signal in an alignment-free context

Most methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies from substitutions that may have occurred at the aligned sequence positions. Gaps in alignments are usually not employed as phylogenetic signal. In this paper, we explore an alignment-free approach that uses insertions and deletions (indels) as an additional source of information for phylogeny inference. For a set of four or more input sequences, we generate so-called quartet blocks of four putative homologous segments each. For pairs of such quartet blocks involving the same four sequences, we compare the distances between the two blocks in these sequences, to obtain hints about indels that may have happened between the blocks since the respective four sequences have evolved from their last common ancestor. A prototype implementation that we call Gap-SpaM is presented to infer phylogenetic trees from these data, using a quartet-tree approach or, alternatively, under the maximum-parsimony paradigm. This approach should not be regarded as an alternative to established methods, but rather as a complementary source of phylogenetic information. Interestingly, however, our software is able to produce phylogenetic trees from putative indels alone that are comparable to trees obtained with existing alignment-free methods.

Most methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies from substitutions that may have occurred at the aligned sequence positions. Gaps in alignments are usually not employed as phylogenetic signal. In this paper, we explore an alignment-free approach that uses insertions and deletions (indels) as an additional source of information for phylogeny inference. For a set of four or more input sequences, we generate so-called quartet blocks of four putative homologous segments each. For pairs of such quartet blocks involving the same four sequences, we compare the distances between the two blocks in these sequences, to obtain hints about indels that may have happened between the blocks since the respective four sequences have evolved from their last common ancestor. A prototype implementation that we call Gap-SpaM is presented to infer phylogenetic trees from these data, using a quartet-tree approach or, alternatively, under the maximum-parsimony paradigm. This approach should not be regarded as an alternative to established methods, but rather as a complementary source of phylogenetic information. Interestingly, however, our software is able to produce phylogenetic trees from putative indels alone that are comparable to trees obtained with existing alignment-free methods.

Author summary
Phylogenetic tree inference based on DNA or protein sequence comparison is a fundamental task in computational biology. Given a multiple alignment of a set of input sequences, most approaches compare aligned sequence positions to each other, to find a suitable tree, based on a model of molecular evolution. Insertions and deletions that may have happened since the input sequences evolved from their last common ancestor are ignored by most phylogeny methods. Herein, we show that insertions and deletions can provide an additional source of information for phylogeny inference, and that such information can be obtained with a simple alignment-free approach. We provide an implementation of this idea that we call Gap-SpaM. The proposed approach is complementary to existing phylogeny methods since it is based on a completely different source of information. It is, thus, not meant to be an alternative to those existing methods but rather as a possible additional source of information for tree inference.
This is a PLOS Computational Biology Methods paper.

Introduction
Most phylogenetic studies are based on multiple sequence alignments (MSAs), either of partial or complete genomes or of individual genes or proteins. If MSAs of multiple genes or proteins are used, there are two possibilities to infer a phylogenetic tree: (1) the alignments can be concatenated to form a so-called superalignment or supermatrix. Tree building methods such as Maximum-Likelihood [1,2], Bayesian Approaches [3] or Maximum-Parsimony [4][5][6] can then be applied to these superalignments. (2) One can calculate a separate tree for each gene or protein family and then use a supertree approach [7] to amalgamate these different trees into one final tree, with methods such as ASTRAL [8] or MRP [9]. Multiple sequence alignments usually contain gaps representing insertions or deletions (indels) that are assumed to have happened since the aligned sequences evolved from their last common ancestor. Gaps, however, are usually not used for phylogeny reconstruction. Most of the above tree-reconstruction methods are based on substitution models for nucleotide or amino-acid residues. Here, alignment columns with gaps are either completely ignored, or gaps are treated as 'missing information', for example in the frequently used tool PAUP � [6]. Some models have been proposed that can include gaps in a Maximum-Likelihood setting, such as TKF91 [10] and TKF92 [11], see also [12][13][14]. Unfortunately, these models do not scale well to genomic data. Thus, indels are rarely used as a source of information for the phylogenetic analysis.
In those studies that actually make use of indels, this additional information is usually encoded in some simple manner. The most straightforward way of doing this is to treat the gap character as a fifth character for DNA comparison, or as a 21st character in protein comparison, respectively. This means that the lengths of gaps are not explicitly considered, so a gap of length ℓ > 1 is considered to represent ℓ independent insertion or deletion events. Some more issues with this approach are discussed in [15]; these authors introduced the 'simple encoding' of indel data as an alternative. For every indel in the multiple sequence alignment, an additional column is appended. This column contains a present/absent encoding for an indel event which is defined as a gap with given start and end positions. If a longer gap is fully contained in a shorter gap in another sequence, it is considered as missing information. Such a simple binary encoding is an effective way of using the length of the indels to gain additional information and can be used in some maximum-parsimony framework. A disadvantage of these approaches is their relatively long runtime. The above authors also proposed a more complex encoding of gaps [15] which they further refined in a subsequent paper [16]. The commonly used approaches to encode gaps for phylogeny reconstruction are compared in [17].
The 'simple encoding' of gaps has been used in many studies; one recent study obtained additional information on the phylogeny of Neoaves which was hypothesized to have a 'hard polytomy' [18]. Despite such successes, indel information is still largely ignored in phylogeny reconstruction. Oftentimes, it is unclear whether using indels is worth the large overhead and increased runtime. On the hand, it has also been shown that gaps can contain substantial phylogenetic information [19].
All of the above mentioned approaches to use indel information for phylogeny reconstruction require MSAs of the compared sequences. Nowadays, the amount of the available molecular data is rapidly increasing, due to the progress in next-generation sequencing technologies. If the size of the analyzed sequences increases, calculating multiple sequence alignments quickly becomes too time consuming. Thus, in order to provide faster and more convenient methods to phylogenetic reconstruction, many alignment-free approaches have been proposed in recent years. Most of these approaches calculate pairwise distances between sequences, based on sequence features such as k-mer frequencies [20][21][22] or the number [23] or length [24][25][26] of word matches. Distance methods such as Neighbor-Joining [27] or BIONJ [28] can then reconstruct phylogenetic trees from the calculated distances. For an overview, the reader is referred to recent reviews of alignment-free methods [29][30][31].
Some recently proposed alignment-free methods use inexact word matches between pairs of sequences [32][33][34], where mismatches are allowed to some degree. Such word matches can be considered as pairwise, gap-free 'mini-alignments'. So, strictly spoken, these methods are not 'alignment-free'. In the literature, they are still called 'alignment-free', as they circumvent the need to calculate full sequence alignments of the compared sequences. The advantage of such 'mini-alignments' is that inexact word matches can be found almost as efficiently as exact word matches, by adapting standard word-matching algorithms.
A number of these methods use so-called spaced-words [22,35,36]. A spaced-word is a word that, in addition to nucleotide or amino-acid symbols, contains wildcard characters at certain positions that are specified by a pre-defined binary pattern P representing 'match positions' and 'don't-care positions', see Fig 1 for an example. If the same 'spaced word' occurs in two different sequences, this is called a Spaced-word Match or SpaM, for short. One way of using spaced-word matches-or other types of inexact word matches-in alignment-free sequence comparison is to use them as a proxy for full alignments, to estimate the number of mismatches per position in the (unknown) full sequence alignment. This idea has been implemented in the software Filtered Spaced Word Matches (FSWM) [34]; it has also been applied to protein sequences [37], and to unassembled reads [38].
In such approaches, it is crucial to use only those SpaMs that align homologous segments of the compared sequences and to discard random SpaMs. FSWM and related programs filter out non-homologous SpaMs by comparing the residues aligned to each other at the don't-care positions of the SpaMs. As shown in Fig 1, a score can be calculated based on these residue pairs, and all SpaMs with a score below a certain threshold are discarded. As we have shown in previous papers, this approach can reliably distinguish between homologous and background SpaMs [34]. Other approaches have been proposed recently, that use the number of SpaMs to estimate phylogenetic distances between DNA sequences [36,39], see [40] for a review of the various SpaM-based methods.
Multi-SpaM [41] is a recent extension of the SpaM approach to multiple sequence comparison. For a set of four or more input sequences, and for a binary pattern P, Multi-SpaM finds occurrences of the same spaced word with respect to P in four different input sequences. Such a spaced-word match is called a quartet P-block, or quartet block, for short. A quartet block, thus, consists of four occurrences of the same spaced-word, with respect to a specific pattern P, as in Fig 2. For each such block, Multi-SpaM identifies an optimal quartet tree topology based on the nucleotides aligned to each other at the don't-care position of P, using the program RAxML [1]. Finally, the quartet trees calculated in this way are used to find a supertree of the full set of input sequences. To this end, Multi-SpaM uses the program Quartet MaxCut [42].
In the present paper, we use pairs of quartet blocks involving the same four sequences. We consider the distances between two blocks in the four sequences, to obtain hints about potential insertions and deletions that may have occurred between two quartet blocks. If these distances are different for two of the sequences, this would indicate that an insertion or deletion has happened since these sequences evolved from their last common ancestor. The distances between two quartet blocks can therefore support one of three possible quartet topologies for the four involved sequences. If, for example, in a pair of quartet blocks involving sequences S i , S j , S k , S l , the distance between these blocks is equal in S i and S j as well as in S k and S l but the distance in S i and S j is different from the one in S k and S l , this would support a quartet tree where S i and S j are neighbours, as well as S k and S l ; an example is shown in Fig 3. To evaluate the phylogenetic signal that is contained in such pairs of quartet blocks, we first evaluate the inferred quartet topologies directly, by comparing them to trusted reference trees. Next, we use two different methods to infer a phylogenetic tree for the full set of input sequences, based on the distances between quartet blocks. (A) We calculate super trees based on the inferred quartet trees using the software Quartet MaxCut. (B) We use distances between pairs of blocks as characters in a maximum-parsimony setting, to find a tree that minimizes the number of insertions and deletions that have to be assumed, given the different distances between the quartet blocks. We evaluate these approaches on data sets that are commonly used as benchmark data in alignment-free sequence comparison. Our evaluation shows that the The occurrence of the same spaced word in two different sequences is called a Spaced-word Match (SpaM). A SpaM w.r.t. P is, thus, a local gapfree alignment where matching residues are aligned at the match positions of P, while mismatches are possible at the don't-care positions. In the above toy example, we find at the don't care positions one mismatch (A-G) and one match (T-T). A score can be calculated for each SpaM based on the residues aligned to each other at the don't-care positions. If the number of don't-care positions in the underlying pattern P is sufficiently large, 'homologous' SpaMs can be reliably distinguished from background by their scores [34].
https://doi.org/10.1371/journal.pcbi.1010303.g001 majority of the inferred quartet trees is correct and should therefore be useful additional information for phylogeny reconstruction. Moreover, the quality of the trees that we can infer from our quartet block pairs alone is roughly comparable to the quality of trees obtained with existing alignment-free methods.
The goal of our study is to show that insertions and deletions can be used as phylogenetic signal in an alignment-free context. Note that the information from putative indels is complementary to the information used in standard phylogeny approaches where aligned residues are used to infer substitutions that may have happened in the evolution of the sequences. Consequently, our approach is not competing with these existing methods but may be used as additional evidence that might support or call into question phylogenies inferred by more traditional approaches.

Spaced words, quartet blocks and distances between quartet blocks
We are using standard notation from stringology as defined, for example, in [43]. For a sequence S over some alphabet, S(i) denotes the i-th symbol of S. In order to investigate the information that can be obtained from putative indels in an alignment-free context, we use the P-blocks generated by the program Multi-SpaM [41]. At the start of every run, a binary pattern P 2 {0, 1} ℓ is specified for some integer ℓ. Here, a "1" in P denotes a match position, a "0" stands for a don't-care position. The number of match positions in P is called its weight and is denoted by w. By default, we are using parameter values ℓ = 110 and w = 10, so by default the pattern P has 100 don't-care positions.
A spaced word W with respect to a pattern P is a word over the alphabet {A, C, G, T} [ {�} of the same length as P, and with W(i) = � if and only if i is a don't care position of P, i.e. if P(i) = 0. If S is a sequence of length N over the nucleotide alphabet {A, C, G, T}, and W is a spaced word, we say that W occurs at some position i 2 {1, . . ., ℓ}, if S(i + j − 1) = W(j) for every match position j in P. For two sequences S and S 0 and positions i and i 0 in S and S 0 , respectively, we say that there is a spaced-word match (SpaM) between S and S 0 at (i, i 0 ), if the same spaced word W occurs at i in S and at i 0 in S 0 . A SpaM can be considered as a local pairwise alignment without gaps. Given a nucleotide substitution matrix, the score of a spaced-word match is defined as the sum of the substitution scores of the nucleotides aligned to each other at the don't-care positions of the underlying pattern P. In FSWM and Multi-SpaM, we are using a substitution matrix described in [44]. In FSWM, only SpaMs with positive scores are used. It has been shown that this SpaM-filtering step can effectively eliminate most random spacedword matches [34].
For a set of �4 input sequences and a binary pattern P of length ℓ, the program Multi-SpaM is based on quartet (P)-blocks, where a quartet block is defined as four occurrences of some spaced word W in four different sequences, see Fig 2 for an example. A quartet block B can, thus, be considered as a local gap-free four-way alignment, aligning length-ℓ segments of four sequences; we say that B 'involves' these four sequences. To exclude spurious random quartet blocks, Multi-SpaM removes quartet blocks with a low degree of similarity between the aligned segments. Technically, a quartet block is required to contain one occurrence of the spacedword W, such that the other three occurrences of W have positive similarity scores with this first occurrence. For a given nucleotide substitution matrix, the similarity score of two spaced words (with respect to the same pattern P) is defined as the sum of the substitution scores of the nucleotides aligned to each other at the don't-care positions of P.

Phylogeny inference using distances between quartet blocks
In this paper, we are considering pairs of quartet blocks involving the same four sequences, and we are using the distances between the two blocks in these sequences as phylogenetic signal. The first block in a block pair is called the reference block. To find reference blocks, we use the program Multi-SpaM. This program identifies quartet blocks with respect to a binary pattern P 1 , as explained in the first section of this paper. Here, a score is calculated for each quartet block, based on the don't-care positions, to exclude random spaced-word matches, as detailed above. For each reference block with a score above the threshold, our new approach then searches for a second quartet block, involving the same four sequences, possibly with a different pattern P 2 , and within a window of L nucleotides in each sequence, to the right of the reference block. By default, we are using a window size of L = 500. For the second block, we do not calculate a score, since the probability of finding a quartet block within such a window by chance is very small.
Let us consider two quartet blocks-a reference block B 1 and a corresponding second block B 2 as described above -, with respect to patterns P 1 and P 2 , respectively, involving the same four sequences S i , S j , S k , S l . By definition, B 1 is strictly to the left of B 2 , in the sense that the last position of B 1 is smaller than the first position of B 2 in all four sequences. Next, let D ι be the distance between B 1 and B 2 in sequence S ι , ι = i, . . ., l. More formally, if in sequence S ι block B 1 starts at position k 1 and block B 2 starts at position k 2 , then we define D ι to be k 2 − k 1 − ℓ 1 , where ℓ 1 is the length of the pattern P 1 . In other words, D ι is the length of the segment between B 1 and B 2 in S ι , see Figs 3 and 4 for examples. As explained, we can assume that the blocks B 1 and B 2 are representing true homologies, i.e. for each of them the respective segments go back to a common ancestor in evolution. Then, if we find for two sequences, say S i and S j , that their distances D i and D j between B 1 and B 2 are different from each other, this would imply that at least one insertion or deletion must have happened since S i and S j have evolved from their last common ancestor. If, by contrast, the D i = D j holds, no such insertion or deletion needs to be assumed.
There are three possible fully resolved (i.e. binary) quartet topologies for the four sequences S i , . . ., S l that we denote by S i S j |S k S l etc. In the sense of the parsimony paradigm, we can consider the distance between two blocks as a character and D ι as the corresponding character state associated with sequence S ι . If two distances, say D i and D j , are equal, and the other two distances, D k and D l are also equal to each other, but different from S i and S j , respectively, this would support the tree topology S i S j |S k S l : with this topology, one would have to assume only one insertion or deletion to explain the character states, while for S i S k |S j S l or S i S l |S j S k , two insertions or deletions would have to be assumed. In this situation-i.e. if we have D i = D j 6 ¼ D k = D l -, we say that the pair (B 1 , B 2 ) strongly supports topology S i S j |S k S l .
Next, we consider the situation where two of the distances are equal, say D i = D j , and D k and D l would be different from each other, and also different from D i and D j . From a parsimony point-of-view, all three topologies would be equally good in this case, since each of them would require two insertions or deletions. It may still seem more plausible, however, to prefer the topology S i S j |S k S l over the two alternative topologies. In fact, if we would use a simple probabilistic model where an insertion/deletion event has a fixed probability p, with 0 < p < 0.5, along each branch of the topology, then it is easy to see that the topology S i S j |S k S l would have a higher likelihood than the two alternative topologies. In this situation, we say that the pair (B 1 , B 2 ) weakly supports the topology S i S j |S k S l . Finally, we call a pair of quartet blocks informative, if it-strongly or weakly-supports one of the three quartet topologies for the involved four sequences.
For a set of input sequences S 1 , . . ., S N , N � 4, we implemented two different ways of inferring phylogenetic trees from quartet-block pairs. With the first method, we calculate the quartet topology for each quartet-block pair that supports one of the three possible quartet topologies. We then calculate a supertree from these topologies. Here, we use the program Quartet MaxCut [42,45] that we already used in our previous software Multi-SpaM where we inferred quartet topologies from the nucleotides aligned at the don't-care positions of quartet blocks.
Our second method uses the distances between quartet blocks as input for Maximum-Parsimony [4,5]. To this end, we generate a character matrix as follows: the rows of the matrix correspond, as usual, to the input sequences, and each informative quartet block pair corresponds to one column. The distances between the two quartet blocks are encoded by characters '0', '1' and '2', such that equal distances in an informative quartet-block pair are encoded by the same character (this encoding is necessary, since some parsimony programs accept only simple characters as input, so we cannot use the distances themselves as characters in the matrix). For sequences not involved in a quartet-block pair, the corresponding entry in the matrix is empty and is considered as 'missing information'. In Fig 3,   parsimony. Here, we used the the program pars form the PHYLIP package [46]. Note that all four block pairs in the matrix strongly support one of the three possible quartet topologies, since a block pair that only weakly supports a topology would not be informative in the sense of the parsimony principle. Therefore, in each of the four block pairs, we have only two different distances, and we need only two characters, '0' and '1'.
In order to find suitable quartet-block pairs for the two described approaches, we are using our software Multi-SpaM. This program samples up to 1 million quartet blocks. We use the quartet blocks generated by Multi-SpaM as reference blocks, and for each reference block B 1 , we search for a second block in a window of L nucleotides to the right of B 1 for a second block B 2 involving the same four sequences (default: L = 500). We use the first block that we find in this window, provided that the involved spaced-word matches are unique within the window. If the pair (B 1 , B 2 ) supports a topology of the involved four sequences-either strongly or weakly -, we use this block pair, otherwise the pair (B 1 , B 2 ) is discarded.

Test results
In order to evaluate the above described approaches to phylogeny reconstruction, we used five sets of genome sequences from AF-Project [47] that are frequently used as benchmark data for alignment-free methods. In addition, we used a set of Wolbachia genomes [48], and sets of Thus, for S 1 and S 2 we have the same (arbitrary) symbol '1', while S 3 and S 6 we have the symbol '0'. Since S 4 and S 5 are not involved in this quartet-block pair, they have dashes in the first column, representing 'missing information'. The matrix represents four quartet-block pairs that strongly support one quartet topology, namely column 1 supporting S 1 S 2 | S 3 S 6 , column 2 supporting S 3 S 6 |S 4 S 5 , column 3 supporting S 1 S 6 |S 4 S 5 and column 4 supporting S 1 S 4 |S 2 S 5 .
https://doi.org/10.1371/journal.pcbi.1010303.g005 mitochondrial genomes from Piroplasmida [49] and from Termites [50]. These data sets are summarized in Table 1; for each data set, the number of genome sequences and their average length is given, together with the average pairwise phylogenetic distance in the set. As distance measure, we used the number of substitutions per position, estimated with our program FSWM. For these sets of genomes, trusted phylogenetic trees are available that can be used as reference trees; these genomes have also been used as benchmark data to evaluate Multi-SpaM [41].
Note that our indel-based approach is not meant to be an alternative to existing phylogeny approaches that are based on substitutions. Since we are using a complementary source of information, we are not competing with those existing methods, but we wanted to know if our approach might be useful as an additional input for tree reconstruction. The comparison with alternative alignment-free phylogeny methods in this section is not done to find out which approach performs better-we rather wanted to find out if or to what extent our indel-based approach can provide relevant information for phylogeny inference at all.

Quartet trees from quartet-block distances
First, we tested, how many informative quartet block pairs we could find, i.e. how many of the identified quartet-block pairs would either strongly or weakly support one of the three possible quartet topologies for the corresponding four sequences.
As explained above, for each set of genome sequences, we first sampled up to 1,000,000 quartet blocks with Multi-SpaM [41], we call these blocks the 'reference blocks'. For each of these blocks, we then searched for a second block in a window of 500 nt to the right of the reference block. For the second block, we used a pattern P = 1111111, i.e. we generated blocks of exact word matches of length seven. If no second block could be found in the window, the reference block was discarded. Table 2 shows the percentage of informative quartet block pairs, among the quartet block pairs that we used. To evaluate the correctness of the obtained quartet topologies, we compared them to the topologies of the respective quartet sub-trees of the reference trees using the Robinson-Foulds (RF) distance [51] between the two quartet topologies. If the RF distance is zero, the inferred quartet topology is in accordance with the reference tree. To compare the obtained quartet topologies to the full reference trees, we used Sarah Lutteropp's program Quartet Check that is available through GitHub [52]. We slightly modified the original code to adapt it to our purposes; the modified code used in our study is also available through GitHub [53].
We want to use the quartet trees that we obtain from informative quartet block pairs, to generate a tree of the full set of input sequences. Therefore, it is not sufficient for us to have a high percentage of correct quartet trees, but we also want to know how many of the sequence quartets are covered by these quartet trees. Generally, the results of super-tree methods depend on the coverage of the used quartet topologies [54,55]. For a set of N input sequences, there are N 4 À � possible 'sequence quartets', i.e. sets of four sequences. Ideally, for every such set, we should have at least one quartet tree, in order to find the correct super tree. Table 2 reports the quartet coverage, i.e. the percentage of all sequence quartets, for which we obtained at least one quartet tree.
Note that Multi-SpaM uses randomly sampled quartet blocks, the program can thus return different results for the same set of input sequences. We therefore performed 10 program runs on each set of sequences and report the average correctness and coverage of these test runs.

Full phylogeny reconstruction
Finally, we applied our quartet-block pairs to reconstruct full tree topologies for the above sets of benchmark sequences. Here, we used two different approaches, namely Quartet MaxCut and Maximum-Parsimony, as described above. As is common practice in the field, we evaluated the quality of the reconstructed phylogenies by comparing the the respective reference trees from AFproject using the normalized Robinson-Foulds (RF) distances between the inferred and the reference topologies. For a data set with N taxa, the normalized RF distances are obtained from the RF distances by dividing them by 2 � N − 6, i.e. by the maximum possible distance for trees with n leaves. The results of other alignment-free methods on these data are reported in [41,47].
We applied the program Quartet MaxCut first to the quartet topologies derived from the set of all informative quartet-block pairs. As a comparison, we then inferred topologies using only those quartet-block pairs that strongly support one of the three possible topologies for the four involved sequences. The results of these test runs are shown in Table 3. Next, we used the program PAUP � [6] to calculate the most parsimonious tree, using the distances between quartet blocks as characters, as explained above. Here, we used the TBR [56] heuristic. In some cases, this resulted in multiple optimal, i.e. most parsimonious trees. In these cases, we somewhat arbitrarily picked the first of these trees in the PAUP � output. The results of these test runs are also shown in Table 3, together with the results from Multi-SpaM. Table 2. Test results on different sets of genomes. As benchmark data, we used five sets of genome sequences from AF-Project [47] and sets of genomes from Wolbachia [48], Piroplasmida [49] and Termites [50]. For each data set, we generated up to 1,000,000 pairs of quartet blocks as described in the main text. The table shows the number of informative block pairs ('# inf bp'), i.e. the number of block pairs for which we obtained either strong or weak support for one of the three possible quartet topologies of the involved sequences. In addition we show the percentage of correct quartet topologies (with respect to the respective reference tree), out of all informative block pairs, as well as the 'coverage' by quartet blocks, i.e. the percentage of sequence quartets for which we found at least one informative block pair. Standard deviations are shown in parentheses. For each data set, we obtained 1,000,000 block pairs, except for the three sets of mitochondrial genomes, where it was not possible to find this number of block pairs.

Runtime
The program runtime of our approach on the data sets that we used in our evaluation is shown in Table 4. Test runs were performed on an Intel Xeon Processor E7-4850 with 2.00 GHz (4 processors with 10 kernels each/20 threads) and 1 TB RAM. Here, the runtime is shown separately for identifying the reference blocks by running the corresponding sub-routine of Multi-SpaM (first column) and for finding a second block for each reference block (third column). The time to calculate the resulting tree from the distances between these block pairs with parsimony or Quartet MaxCut was negligible. The total runtime of our approach is, thus, roughly the sum of the values in the first and the third column. As a comparison, we report the runtime of the well-known program RAxML [1] that Multi-SpaM uses on the don't-care positions of the reference blocks. Thus, the total run time of Multi-Spam is obtained as the sum of the values of the first two columns.

Discussion
Sequence-based phylogeny reconstruction usually relies on nucleotide or amino-acid residues aligned to each other in multiple alignments. Information about insertions and deletions (indels) is neglected in most studies, despite evidence that this information may be useful for phylogeny inference. There are several difficulties when indels are to be used as phylogenetic Table 3. Table 2. For each data sets, we performed 10 program runs with our indel-based approach. The average over these program runs is shown in the table; standard deviations are shown in parentheses. signal: it is difficult to derive probabilistic models for insertions and deletions, and there are computational issues if gaps of different lengths are spanning multiple columns in multiple alignments. Finally, gapped regions in sequence alignments are often considered less reliable than un-gapped regions, so the precise number and length of insertions and deletions that have happened may not be easy to infer from multiple alignments. In recent years, many fast alignment-free methods have been proposed to tackle the ever increasing amount of sequence data. Most of these methods are based on counting or comparing words, and gaps are usually not allowed within these words. It is therefore not straight-forward to adapt standard alignment-free methods to use indels as phylogenetic information.

Gap-SpaM Multi-SpaM FSWM
In the present paper, we proposed to use pairs of blocks of sequence segments, based on our previously proposed alignment-free spaced-word approach. Within such blocks, no gaps are allowed. These blocks can be used to obtain information about possible insertions and deletions between two blocks since the compared sequences have evolved from a common ancestor. To this end, we consider the distances between these blocks in the respective sequences. If these distances are different for two sequences, this indicates that there has been an insertion or deletion since they evolved from their last common ancestor. If the two distances are the same, no indel event needs to be assumed. This information can be used to infer a tree topology for the sequences involved in a pair of blocks. To our knowledge, this is the first attempt to use insertions and deletions as phylogenetic signal in an alignment-free context.
In this study, we restricted ourselves, for simplicity, to quartet blocks i.e. to blocks involving four input sequences each; we used pairs of blocks involving the same four sequences. We did not consider the length of hypothetical insertions and deletions, but only asked whether or not such an event has to be assumed between two sequences in the region bounded by the two blocks. Since indels are relatively rare events, compared to substitutions, the maximum parsimony paradigm seems to be suitable in this situation. In the sense of parsimony, however, only those block pairs are informative that, in our notation, strongly support one of three possible quartet topologies, in the sense of the definition that we introduced in this paper. Indeed, if two distances between two blocks are equal, and the third and fourth distance are different from them-and also different from each other -, then each of the three possible quartet topologies would require two insertion or deletion events. That is, all three topologies would be equally good from a parsimonious viewpoint.
Intuitively, however, one may want to use the information from such quartet blocks pairs that, in our terminology, weakly support one of the possible topologies. It is easy to see that, with a simple probabilistic model under which an insertion between two blocks occurs with a probability p < 0.5, independently of the length of the insertion and the distance between the blocks, a weakly supported topology would have a higher likelihood than the two alternative topologies-although all three topologies are considered equally good from a parsimony point-of-view. So it might be interesting to apply such a simple probabilistic model to our approach, instead of maximum parsimony. Also, while we restricted ourselves to quartet blocks in this study, it might be worthwhile to use block pairs involving more than four sequences.
Our approach has only few parameters that can be adjusted by the user, essentially concerning the underlying binary pattern P and the threshold that is used to separate random quartet blocks form quartet blocks that represent true homologies. In our study, we used patterns with a length of 110 and with 10 match positions, i.e. with 100 don't-care positions. Our results in the present and in previous studies indicate that with our default parameter value, random spaced-word matches can be reliably distinguished from background matches [34,41]. Adapting these parameter values mainly affects the number of quartet-block pairs. So this mainly comes down to a trade off between program run time and the amount of information that one obtains form the block pairs, i.e. the strength of the signal.
Using standard benchmark data, we could show that phylogenetic signal from putative insertions and deletions between quartet blocks is mostly in accordance with the reference phylogenies that we used as standard of truth. Interestingly, the quality of the tree topologies that we constructed from our 'informative' pairs of quartet blocks-i.e. from indel information alone-is roughly comparable to the quality of topologies obtained with existing alignmentfree methods.
As mentioned above, our approach is not competing with existing phylogeny approaches. In fact, we did not expect to obtain trees with our approach that are of comparable quality as trees obtained with standard methods. Our goal was to find out if information about putative insertions and deletions can provide useful phylogenetic information at all in an alignmentfree setting. Since the phylogenetic signal from indels is complementary to the information that is used by those existing approaches, any such information might be useful as additional evidence, no matter if substitution-based or indel-based trees are superior. It is all the more surprising that our rather simplistic approach is already able to infer trees that are roughly comparable to trees obtained with established alignment-free approaches.
There is a certain limitation of our approach, if it is used as a stand-alone approach, to infer trees without additional information, as we did in our evaluation: To infer trees from putative indels alone, we need a large enough number of 'informative' block pairs to obtain quartet trees, or as an input for maximum parsimony. The number of informative block pairs that we can obtain depends, however, on the sequence length and on the degree of similarity among the compared sequences. If sequences are to short or to distantly related, our approach cannot find sufficiently many reference quartet blocks with a score above the employed threshold value.
In our test runs, we used three data sets with a low degree of similarity, the plants data set and the mitochondrial genomes from Piroplasmida and Termites, see Table 1. The plant genomes that we used were large enough to obtain a sufficient number of informative block pairs and, as a result, a phylogenetic tree that is of comparable quality as the tree produced with FSWM and Multi-SpaM. The trees we obtained from the Piroplasmida and Termites mitochondrial DNA were of poor quality, though. Note however, that even for these two data sets, a rather large fraction of the 'strongly informative' block pairs were in accordance with the reference phylogeny, namely 63.78 and 47.98 percent, respectively, see Table 2. This indicates that, while these block pairs are not sufficient to infer the correct tree topology, when used as the sole source of information, they may still be useful as additional input information when combined with other approaches to phylogeny reconstruction. Therefore, it seems worthwhile to investigate how our indel-based approach can be used together with other alignment-free approaches.