An Integer Programming Formulation of the Minimum Common String Partition Problem

We consider the problem of finding a minimum common string partition (MCSP) of two strings, which is an NP-hard problem. The MCSP problem is closely related to genome comparison and rearrangement, an important field in Computational Biology. In this paper, we map the MCSP problem into a graph applying a prior technique and using this graph, we develop an Integer Linear Programming (ILP) formulation for the problem. We implement the ILP formulation and compare the results with the state-of-the-art algorithms from the literature. The experimental results are found to be promising.


Introduction
String comparison is one of the major problems in computer science with extensive applications in different areas that includes genome sequencing, text processing and compression.In this paper, we address the problem of finding a minimum common partition (MCSP) of two strings.MCSP is closely related to genome arrangement, an important field in Computational Biology.More detailed study of the application of MCSP can be found at [10], [3] and [4].
In the MCSP problem, we are given two related strings (X, Y ).Two strings are related if every letter appears the same number of times in each of them.Clearly, two strings have a common partition if and only if they are related.So, the length of the two strings are also the same (say, n).A partition of a string X is a sequence P = (B 1 , B 2 , • • •, B c ) of strings whose concatenation is equal to X, that is B 1 B 2 • • • B c = X.The strings B i are called the blocks of P .Given a partition P of a string X and a partition Q of a string Y , we say that the pair π =< P, Q > is a common partition of X and Y if Q is a permutation of P .The minimum common string partition problem is to find a common partition of X, Y with the minimum number of blocks, that is to minimize c.For example, if (X, Y ) = ("ababcab","abcabab"), then one of the minimum common partition sets is π ={"ab","abc","ab"} and the minimum common partition size is 3.The restricted version of MCSP where each letter occurs at most k times in each input string, is denoted by k-MCSP.
In this paper, we present an Integer Programming formulation for the MCSP problem.In particular, we use a graph mapping that was designed in our prior work [7] to solve the MCSP problem using the Ant Colony Optimization technique [6].Here we exploit this graph to present an Integer Programming (IP) formulation for the problem.Then we implement the MIP formulation, conduct extensive experiments and compare the results with the state-of-the-art algorithms in the literature.As will be reported in a later section, the results clearly indicate that the IP formulation is accurate, effective and provides excellent results.
The rest of the paper is organized as follows.In Section 2 we present a brief literature review.Section 3 presents the notations and definitions needed to understand the concept presented in the paper later.In Section 4 we present the Integer Programming formulation for the MCSP problem.We present our experimental results in Section 5. Finally, we briefly conclude in Section 6.

Related Works
The 1-MCSP is essentially the breakpoint distance problem [13] between two permutations which is to count the number of ordered pairs of symbols that are adjacent in the first string but not in the other; this problem is obviously solvable in polynomial time [10].The 2-MCSP is known to be NP-hard and moreover APX-hard in [10].The authors in [10] also presented several approximation algorithms.Chen et al. [3] studied the problem called the Signed Reversal Distance with Duplicates (SRDD), which is a generalization of MCSP.Furthermore, they gave a 1.5-approximation algorithm for 2-MCSP.In [5], the author analyzed the fixed-parameter tractability of MCSP considering different parametrs.In [11], the authors investigated k-MCSP along with two other variants, namely, M CSP c , where the alphabet size is at most c and x-balanced MCSP, which requires that the length of the blocks must be within the range (n/d − x, n/d + x), where d is the number of blocks in the optimal common partition and x is a constant integer.They showed that M CSP c is NP-hard when c ≥ 2. As for k-MCSP, they presented an FPT algorithm which runs in O * ((d!) 2k ) time.
Chrobak et al. [4] analyzed a natural greedy heuristic for MCSP: iteratively, at each step, it extracts a longest common substring from the input strings.They showed that for 2-MCSP, the approximation ratio (for the greedy heuristic) is exactly 3.They also proved that for 4-MCSP the ratio would be log n and for the general MCSP, between Ω(n 0.43 ) and O(n 0.67 ).
In our prior work [7] we have developed a meta-heuristc algorithm, namely, MAX-MIN ant system to solve the MCSP problem.In particular, in [7] we have mapped MCSP into a graph, namely, the common substring graph.MAX-MIN Ant System has been implemented over this graph.Very recently in [2], the authors have proposed an iterative probabilistic tree search algorithm for solving this problem.The algorithm is an iterative probabilistic variant of the greedy algorithm [4].The authors have tested their approach with the dataset introduced in [7].

Preliminaries
In this section, we present some definitions and notations that are used throughout the paper.Two strings (X, Y ), each of length n, over an alphabet are called related if every letter appears the same number of times in each of them.A block B = ([id, i, j]), 0 ≤ i ≤ j < n, of a string S is a data structure having three fields: id is an identifier of S and the starting and ending positions of the block in S are represented by i and j, respectively.Naturally, the length of a block [id, i, j] is (j − i + 1).We use substring([id, i, j]) to denote a substring of S induced by the block [id, i, j].
We use the notion of a common substring graph as introduced in [7].The following definitions are borrowed from [7].A common substring graph, G cs (V, E, id(X)) of two strings (X, Y ) as follows.Here V is the vertex set of the graph and E is the edge set.Vertices are the positions of string X, i.e., for each Two vertices v i ≤ v j are connected with and edge, i.e, (v i , v j ) ∈ E, if the substring induced by the block [id(X), v i , v j ] matches some substring of Y .More formally, if S y denotes the set of all substrings of Y , we have: In other words, each edge in the edge set corresponds to a block satisfying the above condition.For convenience, we will denote the edges as edge blocks and use the list of edge blocks (instead of edges) to define the edgeset E.

IP Formulation
Given two related strings X and Y each of length n, we create two graphs namely, are the vertex sets and E 1 and E 2 denote two edge block sets from the two graphs respectively.We define two sets of binary variables, namely, x t1 and y t2 where t1 ∈ E 1 and t2 ∈ E 2 .We also write δ k (v) − and δ k (v) + for the sets of incoming and outgoing edge blocks from E k where v ∈ V k and k ∈ {1, 2}.With the above settings, we develop the IP formulation for the MCSP as follows: such that, x t1 ≤ b∈matchList(E2,t1) b1∈matchList(E1,t1) x t1 ∈ {0, 1}, y t2 ∈ {0, 1}

Explanation of the Formulation
Objective function: Eq. ( 1) is the objective function that is to be minimized.The function simply calculates the size of the partition.Equality constraint: Eq. ( 2) states that two partitions on the two substring graphs must be equal in size.In other words, the number of blocks in the factorization of the first string X is equal to the number of blocks in the factorization of the second string Y .
Factorization constraint: Eqs. ( 3)-( 4) together imply that a unit flow enters at the source (the vertex labeled with 0) and arrives at the sink (the vertex labeled with n − 1) for string X.So, the string is factorized.For string Y the factorization is achieved in a similar fashion by Eqs.( 5)-( 6).These constraints ensure that the srtings get factorized by non-overlapping blocks.
One to one match constraint: We have two sets of blocks after the factorization.We must ensure that there is a one to one matching between the two sets of blocks.By matching we mean that, for each selected blocks (those with x t = 1 where t ∈ E 1 ) of the first edge block set E 1 , there must be one and only one corresponding selected block (with y t = 1 where t ∈ E 2 ) with the same substring in the second edge block set E 2 and vice versa.Eqs. ( 7)-( 8) achieve the matching between the two sets of blocks.Further, Eq. 9 is needed for the one to one matching.

Experiments
We have conducted our experiments in a computer with Intel Core 2 Quad CPU 2.33 GHz.The available RAM was 4.00 GB.The operating system was Windows 7. The programming environment was Matlab.We have used SCIP (version 3.1.0)stand alone solver [1] to solve the IP formulation (referred to as the IP algorithm henceforth).

Datasets
We have used the dataset used in our previous work [7].Notably, the same dataset has also been adopted to conduct experimental analysis and comparison by other researchers [2].The data sets are briefly described below.There are two types of data: randomly generated DNA sequences and real gene sequences.
Random DNA sequences: In [7], we have generated 30 random DNA sequences each of length at most 600 using [12].The fraction of bases A, T , G and C is assumed to be 0.25 each.For each DNA sequence we shuffle it to create a new DNA sequence.The shuffling is done using the online toolbox [12].The original random DNA sequence and its shuffled pair constitute a single input (X, Y ) in our experiment.This dataset is divided into 3 groups.The first 10 (Group1) have lengths less than or equal to 200 bps (base-pairs), the next 10 (Group2) have lengths within [201,400] and the rest 10 (Group3) have lengths within [401, 600] bps.
Real Gene Sequences: We have collected the real gene sequence data used in [7] and collected from the NCBI GenBank3 .This data correspond to the first 15 gene sequences of Bacterial Sequencing (part 14) whose lengths are within [200,600].We will denote it by "Real".

Implementation
SCIP [1] (version 3.1.0)is used to solve the IP formulation.We have used the stand alone solver.From the interface of the stand alone solver we have recorded the values of primal solution and the relative gap (gap = |primal − dual|/M IN (|dual|, |primal|)) periodically.We have enforced a time limit of 15 minutes, 30 minutes and 60 minutes for the Group1, Group2 and Group3 dataset respectively.For the Real dataset, we have given a time according to the length of the sequence.If the length is not more than 200 bps, we have assigned 15 minutes.On the other hand if the sequence length is in between 200 and 400 bps, we have given 30 minutes.The rest of the instances are assigned 60 minutes time.All other parameters are left default.

Results and Analysis
In an updated and extended version [8] 4 of our earlier work [7], MAX-MIN ACO (referred to as MMAS henceforth) has been compared with the greedy algorithm (referred to as Greedy henceforth) of [4].In [2], the authors have compared their two versions of iterative probabilistic tree search (referred to TS1 and TS2 henceforth) with Greedy and MMAS.Here, we compare the IP algorithm with the above four algorithms, namely, MMAS [9,8,7], Greedy [4] and both of TS1 and TS2 [2].
Tables ( 1)-( 4) present the results for the Group1, Group2, Group3 and Real dataset respectively.We have taken the results of MMAS and Greedy from [8].The results of TS1 and TS2 are taken from [2].As has been reported in [8], in MMAS, for a particular DNA sequence, the experiment was run for 15 times and the average result is reported.The first column under any group reports the partition size computed by the greedy approach, the second column is the average partition size found by MMAS and the third column represents the average time in seconds for the MMAS solution.The fourth and fifth column report the partition size of TS1 and TS2 respectively.The sixth and seventh column represent the average time of TS1 and TS2 respectively.The eighth and ninth column report the partition size by IP algorithm and the time to achieve this result.We report the feasible primal solution value here.The relative gap of dual and primal solution is reported in the tenth column.
Tables ( 5)-( 8) report the improvement achieved by IP algorithm over Greedy [4], MMAS [7], TS1, TS2 [2] for the Group1, Group2, Group3 and Real dataset, respectively.Some statistical test results are also reported in these tables.The first four columns of each table here represent the differences in the partition size between IP algorithm and the other four approaches (Greedy, MMAS, TS1 and TS2 respectively).A negative (positive) result indicates that the IP algorithm is better (worse) than the other algorithm by that amount.The fifth to seventh columns of each table report the result of student's t-test between the MMAS and IP algorithm.For each instance we have a vector of 15 sample observations from MMAS.The result of IP algorithm is replicated 15 times.The t-test is performed on these two vectors.The fifth column reports the t-stat value.The sixth column represent the p-value of the test.A positive t-stat value with a low p-value indicates improvement.The seventh column reports a test decision for the null hypothesis that the data in two vectors (MMAS and IP results) come from independent random samples from normal distributions with equal means Table 1.Comparison among Greedy [4], MMAS [7], TS1, TS2 [2] and IP on random DNA sequences (Group 1, 200 bps).
Greedy MMAS(Avg.)Avg.Time(MMAS) TS1(Avg.)TS2(Avg.)Avg.Time(TreeSearch1) Avg. and equal but unknown variances, using the two-sample t-test.We use +,-,≈ to indicate a better (a positive t-stat value), worse (a negative t-stat value ) or equal (higher p-value than the significance level) result than MMAS.Here the significance level is 5%.
Similarly, The next three columns report the t-test results between TS1 and IP algorithm.Finally, the last three columns represent the statistics result between TS2 and IP algorithm.The two pair sampled t-stat value for TS1 and TS2 is obtained by the average and standard deviation of the 10 independent runs of TS1 and TS2 reported in [2].
IP algorithm is not a stochastic algorithm.So it has always a standard deviation of zero.As a result the application of t-test is dependent on the stochastic nature of the other algorithms.If the comparing algorithm also has a zero standard deviation than we will not get any t-test result.Those cases are shown in the tables as "NA".As it can be seen from the tables all of the differences are negative.That means our result is better than all other approaches.The result is also justified statistically.All of the p-values are zero and all of the t-stat values are positive.Figure 1 shows the average percentage of improvement of IP Algorithm over the other four approaches.The percentage of improvement is between 4% to 16%.

Running Time
In the previous section, we have shown that the IP algorithm provides much better partition size.In this section we will explore the runtime of the IP algorithm and compare it with the other four approaches.The runtime of TS1 and TS2 is taken from [1].The runtime of MMAS is taken from [8].The Greedy algorithm is very fast.It gives the output within 2 minutes.So, in the analysis, we   Although the reported time (in tables (1)-( 4)) of IP algorithm is higher than Greedy, TS1 and TS2 approaches in some instances but from the Figures (2)-( 5), it can be easily observed that the IP algorithm reaches to better solutions much earlier.From the figures it is clear that the IP algorithm is better than Greedy at any stage of time.Even if we stop the IP algorithm at or earlier than the average runtime of MMAS, TS1 or TS2, the IP algorithm provides better solutions.

Conclusion
In this paper, we have presented an Integer Programming formulation for the MCSP problem.We have conducted extensive experiments and compare the results with the state of the art algorithms in the literature.The results clearly indicate that the IP formulation is accurate, effective and provides excellent results.

Table 3 .
Comparison among Greedy