A k-mismatch string matching for generalized edit distance using diagonal skipping method

This paper proposes an approximate string matching with k-mismatches when calculating the generalized edit distance. When the edit distance is generalized, more sophisticated string matching can be provided. However, the execution time increases because of the bundle of complex computations for calculating complicated edit distances. The computational costs for finding which steps or edit distances are over k-mismatches cannot be significant in the generalized edit distance metric. Therefore, we can reduce the execution time by determining steps over k-mismatches and then skipping them. The diagonal step calculations using the pruning register skips unnecessary distance calculations over k-mismatches. The overhead of control statements and reordered memory accesses can be amortized by skipping multiple steps. Even though the proposed skipping method requires additional overhead, the proposed scheme’s practical embodiments show that the execution time of string matching is reduced significantly when k is small.


Introduction
In the field of computer science, information retrieval is a fundamental problem. Notably, string matching is essential to digital information retrieval. String matching searches the sequence of characters or pattern to determine whether the pattern matches with an input sequence or not. In exact string matching, when a pattern is the same as an input sequence, it determines that the pattern is matched with the input sequence. On the other hand, approximate string matching evaluates the similarity between the input sequence and pattern based on its metric. With sophisticated data analysis and various applications, the approximate string matching can get more attention in the big-data era [1][2][3][4][5][6].
The similarity between two strings can be quantitated by the minimum number of basic operations that makes an input sequence equal to the target pattern. Traditionally, approximate string matching assumes that the insertion, deletion, replacement, and transposition of characters in a string make the difference [1,7]. They are used as basic operators to calculate the distance between the input sequence and target pattern. In the Levenshtein distance calculation, string matching can be simplified because each basic operator has the unified cost of one. When estimating the distance between two strings. the Hamming distance [8] calculation counts '1' bits after applying bitwise exclusive-OR. On the other hand, the semantics or relationship between subsequences make approximate string matching sophisticated. For example, a human can feel that pattern "catch" is more similar to input sequence "cotch" than to input sequence "ctch" although the Levenshtein distance from pattern "catch" is one for both two input sequences, respectively. Therefore, a more complicated edit distance metric can be adopted, categorized into the generalized edit distance [9,10] or the normalized edit distance [11,12]. However, when complex functions generalize the edit distance, significant computational resources are required. When calculating the edit distance between input sequence and pattern, the edit distance between each input subsequence and subpattern is needed, which is called step. Moreover, in the traditional sequential dynamic programming [13], all steps should be calculated in order, which is a very time-consuming job due to the data dependency in calculating steps. Several mathematical approaches can show better computational complexity [14,15]. However, the overhead of control statements and reordered memory accesses is not considered for practical applications. The parallel string matching methods have been researched using the parallelism equipped in GPU (Graphics Processing Unit) [7,[16][17][18][19][20][21][22][23][24] and FPGA (Field Programmable Gate Array) [25][26][27][28][29], where the parallel programming requires multiple computational resources. However, sequential string matching based on a processing unit is still an attractive and fundamental topic in many practical applications. Our study reduces the execution time of the sequential approximate string matching when performed by a processing unit.
Naively, in a step calculation, if its data-dependent previous steps have over k-mismatches, the evaluation of operators with these previous steps over k can be skipped. However, in the Levenshtein distance metric, the overhead ratio for finding whether data-dependent steps are over k-mismatches is relatively high. In previous theoretical approaches [14,15], this overhead is not considered, so their implementations cannot have better performance than that of the dynamic programming-based method [13] in the Levenshtein distance metric. With the generalized edit distance, more sophisticated string matching can be confirmed. The execution time increases because of the bundle of complex computations for performing complex edit calculations. Therefore, if the step calculation that is expected to be over k can be skipped, the total execution time can be significantly reduced, which motivates our research.
This paper proposes an approximate string matching with k-mismatches for the edit distance metric. Our research is motivated that when previous steps are over k, the information can be used to skip unnecessary step calculations. This paper focuses on the practical embodiment of our method and its evaluation. Without finding which data-dependent previous steps are over k, the diagonal step calculations using the pruning register can skip unnecessary step calculations over k-mismatches. Each bit in the pruning register contains the information of step calculations to be skipped. Even though there is an additional overhead of control statements and reordered memory accesses, skipping multiple steps at a time can reduce execution time significantly. For realistic experiments, generalized edit distance metrics are assumed based on the similarity in shapes and keyboard character positions. The proposed string matching and other dynamic programming methods are coded and then evaluated using the generalized edit distance metrics. Despite additional overhead in the diagonal step calculations and pruning register accesses, experiments show that the proposed skipping method can reduce the execution time of approximate string matching when k is small. similarity is quantified in the approximate string matching. The distance between the input sequence and pattern refers to the calculation result based on the distance metric adopted in string matching.
In [1], for strings X i = x 1 , x 2 , . . .,x i−1 , x i and Y j = y 1 , y 2 . . .,y j−1 , y j where characters x a ; y b 2 C for 1 � a � i and 1 � b � j, the distance between X i and Y j denoted as D(X i , Y j ) is the minimum number of edit operations to make X i and Y i the same. The distance D(X i , Y j ) should satisfy: Besides, for a given string Z k = z 1 , z 2 , . . .,z k−1 , z k where z c 2 C for 1 � c � k, the edit distance satisfies the condition of D( Significantly, the Levenshtein distance [30] is the most popular edit distance metric in string matching, so the edit distance has been interchangeably used with the Levenshtein distance sometimes. However, because the edit distance can include several meanings of other metrics different from the Levenshtein distance metric, this paper denotes that the Levenshtein distance metric adopts simple operators with the cost of one.
We define input subsequence X α of input sequence X i and subpattern Y β of pattern Y j for 1 � α � i and 1 � β � j, as follows: Definition 1 For strings X α = x 1 , x 2 , . . .,x α−1 , x α , and Y β = y 1 , y 2 , . . .,y β−1 , y β , when subscripts α � i and β � j, X α and Y β are the input subsequence and subpattern of input sequence X i and pattern Y j , respectively.
For input subsequence X α and subpattern Y β , when the initial edit distance D(X 0 , Y 0 ) is 0, the minimum edit distance D(X α , Y β ) is formulated as follows: Black arrows 1, 2, and 3 in Fig 1(a) mean substitution, deletion, and insertion operators, which correspond to cost functions substitution(x α , y β ), deletion(x α ), and insertion(y β ) in Eq (1), respectively. The function substitution(x α , y β ) means the cost of substituting x α of X α into y β of Y β . The function deletion(x α ) provides the cost of deleting x α from X. On the other hand, the function insertion(y β ) means the cost of inserting y β to the end of Y β−1 .
For example, let's assume that X 3 = "bat" and Y 3 = "bad" with α = β = 3. In this case, by substituting x 3 ('t') with y 3 ('d') in X 3 in substitution(x 3 , y 3 ), the converted X 3 can be the same as Y 3 , which adds the cost of substitution('t', 'd') to D("ba", "ba"). When D("ba", "bad") is given, character x 3 ('t') is removed from "bat", and the given D("ba", "bad") is required to convert "ba" to "bad". Therefore, the cost of deletion('t') is added to calculate D("bat", "bad"). When D ("bat", "ba") is given, after attaching y 3 = 'd' to "ba", D("bat", "bad") can be calculated, which means that the cost of insertion('d') should be added. Therefore, the minimum edit distance for X 3 = "bat" and Y 3 = "bad" formulated as D("bat", "bad") is calculated on Eq (1) as: The Levenshtein distance metric simplifies the cost of each operator into 1 or 0, which makes the Levenshtein distance calculation very simple. Fig 1(b) illustrates an example of the Levenshtein distance matrix. In Fig 1(b), input subsequence "ca" can be the same as subpattern "cat" after attaching "t" to the end of input subsequence "ca". When using the insertion operator, an input subsequence can be equal to the subpattern. Substitution and deletion operators are also applied to the input sequence to match with the pattern. In Fig 1(b), the rightmost bottom cell is numbered as 4, which is the final Levenshtein distance between input sequence "ccatese" and pattern "catch".
We denote the edit distance between input subsequence and subpattern as step. In a traversal, the steps included in the traversal are calculated in order. The traversal method determines the order of calculating steps.

Generalized edit distance
Unlike the Levenshtein distance, the generalized edit distance adopts more sophisticated cost functions in Eq (1). Depending on the operator type, Cost functions can output other values different from 1 or 0. If the cost of one deletion operation increases twice as much as that of one substitution operation, it follows as: For example, the similarity in character shapes can be used in another generalized distance metric. Because 'h' is similar to that of 'b' in shape, a human can feel that input sequence "catcb" is more similar to pattern "catch" than input sequence "catco". In this case, the similarity can be estimated by the substitution operator in the generalized edit distance metric. In another example, the misspelling can happen depending on the character positions in a keyboard. In the US computer keyboard, 'q' has a high possibility of being mistyped as 'w' because the key of 'w' is located next to that of 'q'. However, the shape of 'p' is totally different from that of 'w', so that other functions are needed to quantify the difference between key positions. Besides, the generalized edit distance can consider the pattern length. Intuitively, we feel that the difference between "ca" and "cat" is expected to be greater than that between "catasrophe" and "catastrophe" even though the difference from both cases is caused by one deleted character 't'. Therefore, the costs from the insertion and deletion operations can be inversely proportional to the pattern length. In this case, condition D(X α , Y β ) = D(Y α , X β ) cannot be met when the costs for insertion and deletion operations are different from each other. In conclusion, the generalized edit distance metric requires more complicated operations. Besides, these generalized edit distance metrics can adopt fractions to represent the distance.

k-mismatch string matching
A k-mismatch approximate string matching is defined as: Definition 2 In k-mismatch string matching, for input sequence X i and pattern Y j , when D Term k denotes the threshold for determining whether X i is matched with Y j or not. Because the cost of any operation is a positive value, Eq (1) can be modified for k-mismatch string matching with input subsequence X α and subpattern Y β as: In Eq (4), when the edit distance of a data-dependent previous step (D( However, an additional overhead is required to find whether its data-dependent previous steps are over k or not.

Motivations
From Eq (4), when the edit distance of data-dependent previous steps (D( ) over k-mismatches is pre-known, we determine whether its related operation is needed or not. Our motivation starts from the fact that unnecessary step calculations over kmismatches can be skipped depending on data-dependent previous steps. In the existing dynamic programming-based method, the distance matrix is filled by calculating edit distances between input subsequences and subpatterns, so that the data-dependent previous steps are accessed in the edit distance matrix. Fig 3 is the conceptual figure that illustrates overhead ratios of conditional statements for finding data-dependent previous steps according to the computational overhead for performing operations (substitution, insertion, and deletion). In a simple edit distance metric such as

PLOS ONE
the Levenshtein distance, each operation only compares characters and makes binary output, so that several operations are performed in the pipelining [31]. If there is any conditional jump, these predicted instructions of many operations can be cancelled, which degrades the performance. In other words, it is expected that the conditional jump for skipping evaluations cannot reduce the total execution time. The overhead ratio of conditional statements for finding whether a data-dependent previous step is over k is too high. Therefore, in a simple edit distance metric, there could be no benefits by skipping evaluations in range (a) of Fig 3. On the other hand, when the edit distance metric requires more computational resource to evaluate complicated operators, the skipping method can be useful. As the computational resources for performing each operator increase, the overhead ratio of conditional statements becomes very small. In range (c), it can be better to skip each operator evaluation when finding its data-dependent previous steps over k. In range (b), the overhead of conditional statements and operator evaluations is not negligible. If the remaining iterations can be skipped in the loop for calculating each step, many operator evaluations can be reduced. The implementation of dynamic programming [13] using a nested loop cannot provide such functions.
Therefore, we propose a new string matching method for skipping the remaining iterations for the distance metric. In the following, the problem definition is discussed in detail, and the proposed diagonal skipping method is explained. Fig 4 shows examples of calculating steps and their traversals considering data dependency between steps, in which Fig 4(a) shows simple vertical traversals. In vertical traversal, steps on the next column depend on those on the previous column for substitution and insertion operators. Therefore, after calculating steps on a column, steps neighbouring on the right column

PLOS ONE
can be calculated. Besides, for the deletion operator, the traversal should proceed from top to bottom. This dynamic programming considers data dependency between its neighbouring steps that exists from Eq (1) [13]. After calculating steps in a traversal, the vertical traversal is performed on the right column. Therefore, when n and m are denoted as the input sequence and pattern lengths, the computational complexity can be O(mn).
Several approximate string matching algorithms have been studied to reduce the dynamic programming's computational complexity [14,15]. In general, several previous works about kmismatch string matching enhance the throughput of string matching for the long input string such as network traffic data [2] and DNA sequences [4,5]. Therefore, multiple occurrences of the pattern are searched in the long input string, where string matching with input subsequences can be considered. For example, when input sequence and pattern are "baseball player" and "catastrophe", the Levenshtein distance between subsequence "base" and pattern "catastrophe" is 10. Considering the distance of 10, if k < 10, input subsequence "baseball" cannot be matched. Then, another string matching with another subsequence "player" begins. In this case, the calculation of the edit distance matrix cannot be avoidable. This paper proposes a new method that reduces the execution time of obtaining the edit distance matrix for k-mismatches.

Diagonal traversal and skipping method
Our method adopts the diagonal traversal to skip unnecessary step calculations over k-mismatches. Unlike the vertical traversal performed on each column, the diagonal traversal calculates steps across columns. Even though the work in [14] proposes the diagonal evaluation

PLOS ONE
based on reordered data structure, the step calculations are not skipped for k-mismatch string matching.
In Fig 4(b), diagonal traversals are illustrated, where an arrow illustrates the order of steps calculated in each traversal. Fig 4(b) describes that the upper right step is calculated before the lower left step in a diagonal traversal. It is denoted that t is the index of a traversal, and the traversals indicated by arrows traversal(t − 2), traversal(t − 1), and traversal(t) are performed in order. The calculations of steps on a diagonal traversal traversal(t) do not have data dependency with each other. Each step calculation of traversal(t) has data dependency with three steps of traversal(t − 1) and traversal(t − 2). For substitution operation, a step can be calculated after obtaining the step in traversal(t − 2). For insertion and deletion operations, two steps can be calculated depending on steps of traversal(t − 1). These calculations require the values of two steps for the substitution and insertion operations on the left column and one step for deletion operation on the same column.
When the previous diagonal traversals traversal(t − 2) and traversal(t − 1) finish the calculations of all data-dependent previous steps, traversal(t) can use the calculation results to skip unnecessary operator evaluations. Our proposed method adopts so-called pruning register to avoid multiple iterations in a loop without accessing each element in the edit distance matrix. Each pruning bit in the pruning register is assigned into a column of the edit distance matrix. When the pruning bit for its column is set as '1', there is no need to calculate all steps in the column. In this case, the steps to be calculated have distances over k. The pseudocode of the proposed string matching is as follows: Algorithm 1 Diagonal Skipping In the pseudocode, the procedure Diagonal_Skipping has two arguments: input sequence and pattern. Terms i, j, and k denote the input sequence length, pattern length, and mismatch threshold k, respectively. Firstly, several elements in two-dimensional (i + 1) × (j + 1) array D and pruning register pruning_reg are initialized. In this initialization, D[0, . . .,i][0] is initialized using only deletion operators. On the other hand, D[0][1, . . .,j] is initialized using only insertion operators. These steps can be simply calculated without considering min() function in Eq (1). In the pruning register, the bit indicating the leftmost column (the 0-th column) is set as '1', and other bits are set as '0'.
Then, each step in a traversal is calculated in order. The direction of the arrow is from right top to left bottom, as shown in Fig 4. As shown in Preliminaries section, X α and Y β mean a subsequence of input sequence X i and a subpattern of pattern Y j for 1 � α � i and 1 � β � j. An element D[α] [β] in the two-dimensional array contains edit distance D(X α , Y β ). For each step of D[α] [β] for distance D(X α , Y β ), if the pruning bit of the β-th column is '1', the next steps calculated in a traversal can be over k. Therefore, the break statement means that this procedure skips other iterations that calculate steps of the traversal; otherwise, each step in the traversal is calculated. When the pruning bit of (β−1)-th column is '0', data-dependent previous steps are accessed, and function cost min calculates the minimum distance. Except for the calculation D [s] [1], 1 � s � i, when the pruning bit of (β−1)-th column is ' [β] is over k and the pruning bit of the (β − 1)-th column is '1', the pruning bit of the β-th column becomes '1'. The number of diagonal traversals is proportional to the pattern length j. Therefore, the computational complexity can be O(jk), which means that the computations can be mainly limited by j and k. Fig 5 illustrates the string matching operation with input sequence "ccatese" and pattern "catch". Firstly, the bit for the leftmost column in the pruning register is initialized as '1', as shown in Fig 5(a)

PLOS ONE pruning bit of pruning_reg(1) is '1', function cost(D[3][2], D[3][1]) is performed to calculate D ([4][2])
, where the insertion operation is skipped. Fig 5(d) shows intermediate progress based on Algorithm 1. The proposed diagonal skipping method can determine the steps to be skipped just by accessing the pruning register instead of using all data-dependent previous steps. Also, the proposed method skips multiple-step calculations at a time, which reduces the execution time.
In the proposed method, the number of traversals on arrows is proportional to the pattern length j, where several step calculations over k are not skipped in the proposed method. If all neighbouring data-dependent steps on the same column over k are checked before evaluating operators, the unnecessary operator evaluations can be skipped, which can make the complexity O (min(j, k)). However, unlike the proposed diagonal skipping method using only one pruning register, complicated conditional statements and additional memory accesses are required. This method can be valid when the computational overhead of operator evaluations is significant, which is described in the range (c) of

Experimental results and analysis
Based on realistic environments, we show the experimental results depending on different edit distance metrics. Firstly, when Levenshtein distance is calculated, it is expected that the skipping method is not effective due to the overhead of conditional statements. Then, when adopting the generalized edit distance metrics considering the visual similarity in shapes or keyboard character positions, the proposed skipping method can show better performance than the dynamic programming for small k-mismatches and the method using the reordered data structure. Besides, the overhead of conditional statements for finding data-dependent previous steps is discussed.

Experimental environments
In experiments, the proposed method was coded and complied by C language and GCC 5.4.0, respectively. For apple-to-apple comparisons, we implemented the dynamic programming and the skipping method that found neighbouring data-dependent previous steps to skip each step calculation over k. These implemented codes have been uploaded in [32], where the execution times of the proposed method and other counterparts were measured. The tests were performed on a single core of Intel Xeon CPU E5-2630 v3 @ 2.40GHz machine with 16 Gigabyte main memory and Ubuntu 16.04 operating system. The experiments randomly selected 100,000 pairs of the input sequence and pattern from the English dictionary with 370,099 words [33], where the average and standard deviation of the input sequence and pattern lengths were 9.4 and 2.90, respectively.
We evaluated the proposed method based on three different distance metrics. Firstly, we calculated the Levenshtein distances to know the benefits of the proposed method in the simple edit distance metric. Secondly, for the evaluation using a highly computational edit distance metric, the similarity in shapes between two alphabet characters was quantified in a twodimensional array D considering [34]. This array was used to calculate the cost in the substitution operator. For example, substitution('a', 'b') = 1/2.13 � C and substitution('o', 'e') = 1/4.13 � C, where C was the scaling factor for normalizing the substitution cost. In this example, the cost of substitution('a', 'b') can be 1.94 times the cost of substitution('o', 'e'). For insertion and deletion operators, this experiment assumed a weighted cost depending on pattern length j, where we developed an exponential cost function using the average word length as: In Eq (5), costs were normalized by exp (1). As j increased, the costs of insertion or deletion operations decreased exponentially, so that different weights were assigned depending on j.
Finally, our experiments adopted a more complicated distance metric that considered character positions in a keyboard. In this metric, the Euclidean distance between characters was calculated to obtain each substitution operation's cost. The position of each character was stored in an array, which was used to calculate the Euclidean distance between characters. Based on the typo distance in [35], the function for calculating the cost of each substitution operation was implemented. Unlike typo distance [35] without commutative property, our edit distance metric had the same cost for the deletion and insertion operations to meet the edit distance's characteristic. The features of edit distance metrics above are summarized in Table 1. Fig 6 shows the summary of average execution times by sweeping k when using the Levenshtein distance metric. When the diagonal traversal did not adopt k-mismatches, the average execution time was longer than that of the dynamic programming using the vertical traversal because of the overhead from conditional statements and reordered memory accesses. In these experiments, the execution time increased with k. When k > 4, the execution time was over that of the vertical traversal, which means the proposed method did not have any benefits over the simple vertical traversal for large k. Besides, Fig 6 shows that the diagonal traversal without considering k-mismatches required the additional overhead of conditional statements and reordered data accesses compared with the vertical traversal. Therefore, for the Levenshtein distance metric, when k was small, we concluded that the proposed method can help reduce the execution time. Significantly, compared with the vertical and diagonal traversals, the execution times were decreased by 44.3% and 52.3% with k = 1.

Experimental analysis
For the generalized edit distance using similarity in shapes, Fig 7 illustrates the average execution times by sweeping k. Like the case using the Levenshtein distance metric, the execution time increased with k. When k < 5, the average execution times of the proposed diagonal skipping method were shorter than that of the vertical traversal, which means that many step calculations can be skipped for small k in this generalized edit distance metric. The diagonal traversal only increased the average execution time by 11.5% over the vertical traversal. Notably, when k = 1, the execution times were reduced by 55.7% and 60.3% over the vertical and diagonal traversals. Compared with the evaluation using the Levenshtein distance metric, it

PLOS ONE
was expected that the execution time can be further reduced because the overhead ratio of conditional statements and reordered memory accesses was smaller. Fig 8 summarizes the average execution times based on the distance metric considering keyboard character positions. Like Figs 6 and 7, the execution times were evaluated by sweeping k. Diagonal skipping(II) adopted the skipping method using the pruning register and reduced unnecessary operator evaluations after accessing data-dependent previous steps over k. On the other hand, Diagonal skipping(I) just used the skipping method using the pruning register. By avoiding unnecessary operator evaluations over k, the Diagonal skipping(II) can further reduce the execution time when k was small. As k increased, the difference of the average execution time between Diagonal skipping(I) and Diagonal skipping(II) was reduced because the number of reduced operator evaluations using Diagonal skipping(II) diminished. When k = 5, the difference in the average execution times was negligible. When k > 5, the average execution time of Diagonal skipping(II) was longer than that of Diagonal skipping(I). Besides, when k = 8, the average execution time of Diagonal skipping(II) was very close to those of the vertical and diagonal traversals. Like the Levenshtein distance and the generalized distance using similarity in shapes, many step calculations can be skipped for small k, and the number of skipped step calculations decreased with k. However, even when k = 8, the proposed method's average execution time was shorter than those of the vertical and diagonal traversals. In agreement with Table 1, the operators' computational costs used in this distance metric can  [β]. Secondly, if pruning bit of the β-th column was '1', the next steps in a traversal were skipped because they were over k. When k = 1, 84.6*70.9% step calculations were skipped. As increasing k, the ratios decreased rapidly, where the decreasing ratios can be different depending on the adopted edit distance metric. When k = 8, only 11.0*3.7% step calculations can be skipped, where the overhead of conditional statements and reordered memory accesses increased the average execution time compared with the vertical and diagonal traversals.
The statistical analysis was performed to know the functional relationship between the execution time and input parameters. As shown in [36], the regression approach was adopted, and the input sequence and pattern lengths were used as input parameters. This evaluation was performed with k = 2 for all adopted edit distance metrics. In these regression analyses, the coefficients of determination (R 2 ) can be used to show how much the regression model was fit for the target data [37]. When using the Levenshtein distance metric, R 2 was just 0.267. On the other hand, R 2 s of the generalized edit distance metrics using similarity in shapes (denoted as Shape) and keyboard position (denoted as Keyboard) were 0.575 and 0.818, respectively. These results showed that except for the input sequence and pattern lengths, other overheads could significantly affect the the Levenshtein distance metric's execution time.  Table 2 lists the results of the regression analysis for the adopted three edit distance metrics, where Coef., SE Coef., T, and P denote the coefficient, standard error coefficient, t-value, and p-value, respectively. Because the p-values were small, the input sequence and pattern lengths can be statistically significant. Large t-values in Table 2 show that even though the input sequence and pattern lengths were the same, the execution time can be different severely depending on the input sequence and pattern values. Moreover, the coefficient for the pattern lengths was more significant than that of input sequences, which means that the pattern lengths were more critical in the execution time.

Conclusion
This paper proposes k-mismatch approximate string matching for the generalized edit distance. When the generalized edit distance is involved, this paper shows that the step calculations' skipping can reduce the execution time. The proposed method adopts the pruning register to skip step calculations in the diagonal traversals. This paper introduces practical generalized edit distance metrics for the sophisticated experimental environments. The Levenshtein and two generalized edit distance metrics based on similarity in shapes and keyboard character positions are applied to know the effectiveness of the proposed method. In experiments, even though the overhead of conditional statements and reordered data accesses exists in the generalized edit distance metrics, the proposed method can reduce the execution time of k-mismatch string matching. Considering the experimental results with realistic edit

PLOS ONE
distance metrics, the proposed skipping method helps reduce the execution time in k-mismatch approximate string matching.