Two Simple and Efficient Algorithms to Compute the SP-Score Objective Function of a Multiple Sequence Alignment

Background Multiple sequence alignment (MSA) is a crucial step in many molecular analyses and many MSA tools have been developed. Most of them use a greedy approach to construct a first alignment that is then refined by optimizing the sum of pair score (SP-score). The SP-score estimation is thus a bottleneck for most MSA tools since it is repeatedly required and is time consuming. Results Given an alignment of n sequences and L sites, I introduce here optimized solutions reaching O(nL) time complexity for affine gap cost, instead of O(n2L), which are easy to implement.


Introduction
A wide range of molecular analyses rely on multiple sequence alignments (MSA), e.g., prediction of tridimensional structures [1], phylogenetic inference [2] or detection of positive selection [3]. In all these studies, the initial MSA can strongly impact conclusions and biological interpretations [4,5]. As a consequence, MSA is a richly developed area of bioinformatics and computational biology.
Most MSA software use a greedy approach to construct a first alignment that is then refined by optimizing the sum of pair score (SP-score) [6,7] using a 2-cut strategy. This strategy consists in partitioning the current solution into two sub-alignments that are subsequently realigned; the resulting MSA replaces the previous one if its SP score is improved [7,8,9,10]. The SP-score estimation is thus a bottleneck for most MSA tools since it is repeatedly required and is time consuming as existing solutions have a time complexity of O(n 2 L) for an alignment of n sequences and L sites. The practical importance of having an efficient solution to compute SPscore lies within the following 'Muscle software paper' [8] quotation: "Notice that computation of the SP score dominates the time complexity of refinement and of MUSCLE overall[. . .]. It is natural to seek an O(nL) expression for [SP-score estimation. . .], but to the best of our knowledge no solution is known." The main result of this paper is an optimized algorithmic solution to estimate SP-score for affine gap costs in O(nL). I also introduce a more versatile solution able to handle more general gap cost penalty functions in O(nL + n 2 G max ), with G max L being the maximum number of gap intervals within one aligned sequence.

Definitions and notations
A multiple sequence alignment (MSA) for a set of sequences {s 1 . . .s n }, defined with alphabet S, is a set of n sequences {S 1 . . .S n } which are defined on an enriched alphabet S [{'-'} so that all S i have the same length L and, 8i, removing '-' from S i leads to s i . The aim of MSA tools is to position gaps (stretches of '-') so that characters at the same position (constituting a site) are (most likely) homologous. This is usually achieved through a heuristic optimization of the sum of pair score (SP-score). The SP-score of an MSA is obtained by considering all pairwise alignments it induced. Given two sequences S i and S j of MSA A the corresponding pairwise restriction (A |{S i ,S j }) is the alignment made up of two sequences S 0 i and S 0 j obtained by removing the '-' of S i (resp. S j ) whenever S j (resp. S i ) also has a gap at this position/site. The score of this pairwise alignment is the sum of the substitution/match scores induced by sites without gaps, plus the sum of scores associated to the gap intervals (maximal sub-sequences of consecutive gap symbols) observed in S 0 i and S 0 j . The SP-score of an alignment is classically obtained in O(n 2 L) by summing up the score of its n 2 À Á induced pairwise alignments (Algorithm 1 and 2), and can be decomposed into two terms: SPs, the contribution of substitutions/matches and, SPg the contribution of induced gap intervals (denoted IG').
In molecular biology |S| is a constant so typically small (4 for nucleotides and 20 for amino acids) that L|S| 2 and n|S| 2 , compared to nL, can safely be ignored in asymptotic complexity analysis. Under this assumption, all solutions described here have an O(nL) space complexity.
Algorithm. 1. Basic O(n 2 L) computation of the SP-score of an alignment A of n sites and L sequences given subst(.,.) and g_cost(.) functions. The subst(.,.) function provides, in O(1), the elementary score for two non gap characters on the same site, e.g. using BLOSUM matrix [11]. The g_cost(.) function provides, also in O(1), the cost of a gap interval based on its position and size. The compute_gap_intervals(.) subroutine (Algorithm 2) returns in O(|S|) the list of the gap intervals of its input sequence S. Algorithm 1: compute_SP_score Input: -The n aligned sequences S i of alignment A -a function subst(x,y) returning in O(1) the score for two non gap characters x and y on the same site of A -a function g_cost(IG') returning in O(1) the cost of a gap interval IG' Output: the SP score of this alignment SP s = 0; SP g = 0  The SPs part of the SP-score can be computed in O(nL) by using a table of size L|S| containing for each site the number of each (non gap) symbol (e.g. [8]). This strategy does not work for SPg except for the basic, but unrealistic, constant gap cost where g_cost(IG') = IG'[length].gapcost. I introduce here a more efficient solution to the SP-score computation problem accounting for most gap function penalties (including affine, log, log-affine penalties). The main idea is to pre-compute the list of gap intervals of each sequence S i , ordered by gap start position, this can easily be done in O(L) using Algorithm 2. The compute_gap_intervals(S 0 i ) and compute_-gap_intervals(S 0 j ) of Algorithm 1, observed in A|{S i ,S j }, can then be efficiently deduced by processing the gaps of compute_gap_intervals(S i ) [ compute_gap_intervals(S j ) according to order of opening (as done to merge two sorted lists in linear time, e.g. during a merge step of the 'merge sort' algorithm) while maintaining the number of gaps facing gaps encountered so far (i.e. the shift between S and S' site coordinate systems for current position). The resulting SPscore algorithm (Algorithm 3) has a complexity of O(nL + n 2 G max ), with G max dL/2e the maximum number of gap intervals within one aligned sequence, instead of O(n 2 L). Note that the difference with the naïve algorithm is especially important when sequences contain few long gap stretches but that in the worst case, where most sequences have a number of gap intervals close to L, this algorithm has the same O(n 2 L) complexity as the naïve solution.
Optimal algorithm to compute SP-score using affine gap cost penalties Affine gap penalties (where g_cost(IG') = gap O + IG' [length].gap ext ) are frequently used. For such gap penalties, the total of gap extension penalties (SP ge ) can also be efficiently computed in O(nL), by counting the number of gaps per site. However, gap opening cannot be counted exactly based on local site information (e.g [8]) only approximated. Though pessimistic gap count approximation [9] is often used during the dynamic programming steps producing new candidate alignments, the exact SP score is generally preferred to decide which alignments are better than the current one. Algorithm 4 provides a simple and exact solution to compute the SP-score under affine gap penalties in O(nL), which is also the time complexity for just reading an alignment of n sequences of L sites.
The key idea is to note that a gap IG i will add a number of gap opening penalties equal to n minus the number of interval IG j so that IG i IG j . In order to find out how many gaps encompass IG i , sites are processed from left to right while maintaining an array indicating, for each left site, the number of gap stretches already opened at this position and not yet closed. For all gaps IG i closing at the current position i, the value stored at index IG i [ deb] of this array provides, in O(1), the number of gap stretches encompassing IG i ; before considering site i+1, this array is maintained updated by decreasing by 1 all values stored at indices between IG i [ deb] and i for all IG i ending at i-hence updates for all sites overall require O(nL). External and internal gaps are often penalized with different affine functions. The proposed O(nL) solution can handle this refinement by: firstly, using different characters (e.g. '-' and '_') to represent the two different gap types while computing SP ge ; and secondly, testing each gap interval type in Algorithm 3 (using gap interval start/end positions) to select the adequate gap opening cost.
Algorithm. 3. Given the gap interval lists LG i , LG j of sequences S i 2 A and S j 2 A; this algorithm returns in O(|LG i | + |LG j |) the restricted gap interval lists LG 0 i , LG 0 j that would be observed in A|{S i ,S j } without actually building this restricted alignment.
Algorithm 3: compute_pairwise_restricted_gap_intervals Input:-LG i , LG j the ordered lists of gap intervals for S i 2 A and S j 2 A Output:

Conclusion
This paper introduces an optimized algorithmic solution to estimate SP-score for affine gap costs in O(nL) and a more versatile solution able to handle more gap cost penalty functions in O(nL + n 2 G max ), with G max dL/2e being the maximum number of gap intervals per sequence. These optimizations will obviously be part of the next release of MACSE [10], the MSA software we developed to align nucleic sequences with respect to their amino acid translation while allowing them to contain frameshifts and/or stop codons (http://bioweb.supagro.inra.fr/macse/). Moreover, once stated those two algorithms are quite straightforward and can easily be included in the numerous existing MSA software relying on SP-score. Algorithm. 4. Efficient O(nL) computation of the contribution of gap opening cost (SP go ) for an alignment A of n sites and L sequences.