Efficient string similarity join in multi-core and distributed systems

In big data area a significant challenge about string similarity join is to find all similar pairs more efficiently. In this paper, we propose a parallel processing framework for efficient string similarity join. First, the input is split into some disjoint small subsets according to the joint frequency distribution and the interval distribution of strings. Then the filter-verification strategy is adopted in the computation of string similarity for each subset so that the number of candidate pairs is reduced before an effective pruning strategy is used to improve the performance. Finally, the operation of string join is executed in parallel. Para-Join algorithm based on the multi-threading technique is proposed to implement the framework in a multi-core system while Pada-Join algorithm based on Spark platform is proposed to implement the framework in a cluster system. We prove that Para-Join and Pada-Join cannot only avoid reduplicate computation but also ensure the completeness of the result. Experimental results show that Para-Join can achieve high efficiency and significantly outperform than state-of-the-art approaches, meanwhile, Pada-Join can work on large datasets.


Introduction
String similarity join that finds similar string pairs in a given string set or between two given string sets is a fundamental operation in many fields, such as pattern matching, computational linguistics, bioinformatics, and database integration [1].It is widely used for detection of duplicate web pages in web crawling [2], collaborative filtering [3], and entity resolution [4]. For example, given two string sets R = {Mi Li, Qi Wan, . . .} and S = {M. Li, Qin Wan, . . .}, we can find all similar pairs <r,s> 2 R × S such as <Mi Li, M Li> according to a certain similarity function.
For string similarity join, fundamental techniques include partitioning techniques (e.g. Pass-Join [5] and PartEnum [1]), prefix-filtering methods (e.g. TrieJoin [6] and PEARL [7]), and other methods (e.g. MTree [8], SSI [9], LSH [10], and FASTSS [11]). Research in this field has been carried out in various scientific disciplines and related methods often are tuned for specific ranges of allowed error thresholds or query lengths, specific hardware properties, specific alphabet sizes, or specific distributions of errors.
The big data era is the inevitable consequence of our ability to generate, collect, and store digital data at an unprecedented scale. When there are a large number of sources and a large volume of data, the traditional string join methods become inefficient and ineffective to a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 practice. To address the volume dimension, new techniques have been proposed to enable parallel string join using MapReduce. These include techniques for adaptive blocking [10] and techniques that balance load among different nodes. However, MapReduce is not adapted for the application of data join.
In this paper, we propose a parallel string join framework to address the efficiency problem by utilizing the multi-threading technique and distributed computing technique separately. Availability of high-performance CPU and the large memory make the framework very practical. The contributions of this paper are as follows: 1. We propose a parallel processing framework for string similarity join and a related pruning strategy to obtain the high efficiency. The partition-based method and the parallel processing techniques are used to improve the computational performance.
2. We propose a parallel string join algorithm Para-Join to implement the framework in the multi-core system. The multi-threading technique is used to improve the processing performance. We demonstrate that Para-Join method can not only avoid reduplicate computation but also ensure the completeness of the result.
3. We propose a parallel string join algorithm Pada-Join to implement the framework in the cluster environment based on Spark platform to obtain the high efficiency. 4. We have implemented and tested Para-Join and Pada-Join algorithms on real data sets. The experimental results show that our algorithm achieves high performance and outperforms the existing methods.
The rest of this paper is organized as follows: Section 2 gives the notations and discussion of some existing techniques. In section 3, we introduce our proposed parallel string similarity join framework and related strategies. Then, section 4 and 5 give the details of Para-Join algorithm and Pada-Join algorithm separately. We present the experimental results in section 6. Related works are introduced in section 7 and we conclude the paper in section 8.

Formal problem statement
Definition 1: String and string set. A string s is a finite sequence of symbols over an alphabet S. The length of s is denoted by |s| and the substring starting at position i with length n is denoted by s(i,n). All positions in a sequence are zero-based, i.e., the first character of s is s (0). A string set S is a collection of strings. The size of S is denoted by |S|.
Definition 2: String similarity. Given strings s and r, s is similar to r, denoted s *sim r, if and only if Sim(s,r)!δ. Sim(s,r) is a certain similarity function and δ is a threshold. If the edit distance is used as the similarity function, s is k-approximately similar to r, denoted Ed(s,r) = k, if and only if s can be transformed into r by at most k edit operations. The edit operations include replacing one symbol in s, deleting one symbol from s, and inserting one symbol into s.
Edðs; rÞ ð1 À dÞ Â Maxðjsj; jrjÞ ð2Þ Definition 3: String similarity (self) join. Given two string sets S and R, a similarity function Sim() (or an edit distance function Ed(Á)) and a similarity threshold δ (or a distance threshold τ), a similarity join finds all string pairs <s,r> (s2S, r2R) such that Sim(s,r) ! δ (or Ed(s,r) τ). When R is equal to S, it refers to string similarity self-join of S. In this paper we adopt the edit distance to measure the similarity, so we use τ as the similarity threshold directly.

Parallel processing framework
Applications continue to become more data-intensive. We assume applications may be pulled apart across the threads in the multi-core system or the nodes in the distributed systems. Although this may complicate data placement and transport, it improves the processing efficiency. Fig 1 shows the proposed parallel string join framework which can benefit from the data parallelism, task parallelism, and resilience. The input includes a string set S and a similarity threshold τ. In the second phase, S is divided into several partitions. For each partition, a thread or task is created to process it separately. The filter-verification technique is used for string join in each thread or task. Both string partitioning and string matching can be done in parallel. The output is the combination of all the similarity pairs in S and is stored in the disk.
In this framework, solutions of three issues are taken into consideration: • How to split the dataset into subsets. Since the size of the dataset is not determined and the capability of the system is unknown in advance, it is hard to determine the number of the partitions.
• How to calculate the similarity of two strings efficiently. The similarity computation is the core of the framework so we need a more efficient algorithm to deal with it.
• How to implement parallel string join algorithm to obtain high efficiency without affecting the accuracy of string matching. The parallel algorithm must guarantee that the accuracy will not be affected. It will be better to improve the accuracy of the parallel algorithms since the multi-threading and the multi-tasking techniques can reduce the time and space complexity.  joint-frequency vector between f c (s) and f c (r) about length L 1 . The function dis(v(r),v(s)) is the distance between v(s) and v(r). If the equation v(s) = v(r) exists, we can say that s and r are similar. According to the above rule, we can split S into some disjoint subsets S 1 ,S 2 ,Á Á Á,S n , S 1 & S, S 2 & S, S n & S. Then we can get the following conclusion.
From the above computation results, we can get that string s 1 and string s 2 belong to the same partition while string s 3 and string s 4 belong to different partitions.

Parallel processing in multi-core systems
In order to make full use of the capability of the multi-core system, we design and implement a parallel string join algorithm called Para-Join.

String join algorithm para-join
Algorithm 1 shows the pseudo code of Para-Join. Consider a string set S, it is firstly split into S' in parallel based by the frequency distribution function called fqSplit(Á). Then a parallel cycling alternation method is used to deal with S'. Theorem 1 shows that our algorithm can eliminate the redundant computation and guarantee the completeness of the result. The major flow can be described as follows. Firstly the set S is split into n subsets and the corresponding threads are created. In each thread, function para-RR(Á) is invoked to seek the similar string pairs of S j . For any subset S j , function para-RS(Á) helps to search all the similar pairs between S i and S j (i<j). Theorem 1. Para-Join can not only avoid repetitive computation but also ensure the completeness of the result. Proof. Given a collection of strings S, we split it into n small subsets, S 1 ,S 2 ,Á Á Á,S n . According to algorithm Para-Join, for any subset S j (j = 1,2,. . ., n), we need to find the similar pairs between S i and S j (i<j). Because the value of j ranges from 1 to n, for any S i and S j (i6 ¼j), the search processing will be executed only once. For each S i , the algorithm will search the similar pairs in S i at first. So Para-Join will not miss any similar pairs, i.e., it can ensure the completeness of the result. Furthermore, there is no redundant similarity computation between any two strings in the algorithm. So we can see that Para-Join can also avoid repetitive computation.

Data partition and similarity computation
The function fqSplit(Á) is designed for the data partition. Given a collection of strings S, there exist a lot of methods to split it into some small subsets. In this paper, we propose a parallel strategy which can achieve data partition in a shorter period of time. The pseudo-code is illustrated in algorithm 2. Firstly the frequency variance of each token in S is calculated. Then S is split into multiple subsets in parallel by Z-Collapsing algorithm. Each subset S i is called a joint-token. For each string, its joint frequency vector is calculated, and for each joint-token, the range of the frequency distribution called range-frequency is also calculated. Finally, the function splits the string set S into subsets.
In section 2 the position filtering and the extension-based verification methods have been explained in detail. We design function posFilter(Á) and function verification(Á) to implement ψ ψ[threads [j].get(); // the union of the result produced by the n processing threads 13 Processing thread with parameter j: return ψ1[ψ2; 19 end these two methods. If posFilter(s,r,τ) returns false, string s and string r are dissimilar. If posFilter(s,r,τ) returns true, pair <s,r> is added into the candidate set. Function verification(s,r,τ) returns the similarity of string s and string r. In this paper, we develop a pruning strategy by extending the position filtering to remove the dissimilar pairs. By utilizing this pruning strategy, we can get a smaller candidate set. Suppose s and r denote two different strings, v(s) and v (r) denote their interval-vector respectively. The following is the description of the process in two different cases.
If function posFilter(s,r,τ) returns true and the inequality dis(v(s), v(r)) 2τ is established, pair <s,r> can be added into the candidate set. If pair <s,r> is in the candidate set and inequality verification(s,r,τ) τ is established, we can get that s is similar to r.

v(s) = v(r).
If function posFilter(s,r,τ) returns true and the inequality dis(v(s), v(r)) 2τ is established, pair <s,r> can be added into the candidate set. If pair <s,r> is in the candidate set, and the inequality verification(s,r,τ) 2τ is established, we can also get that string s is similar to string r.

Join operation
Two functions named para-RR (Á) and para-RS(Á) are designed to do the join operation. The function para-RR(Á) extends the partition-based algorithm and implements the self-join in a subset by using the multi-threading technique [2,4]. There are three main steps in para-RR(Á): Step 1: S i is sorted by the string length in descending order.
Step 2: The inverted index L i l is built. The variable l is the string length and the variable i is the index of the string segment.
Step 3: For any two strings, their similarity is calculated by adopting the above method. For example, given two strings s and r, function para-RR(Á) first computes their joint-frequency vectors f c (s) and f c (r). If the L 1 distance of their joint-frequency vector is larger than 2τ, we can get that these two strings are dissimilar. Otherwise, it will check the string pair <s,r> by invoking function verification(Á).
The function para-RS(Á) primarily focuses on how to find the similar pairs between two different collections.
Given two different sets S i and S j ; v(S i ) and v(S j ) denote the IDs of S i and S j respectively. If dis(v(S i ),v(S j )) is larger than 2τ, it shows that they cannot be matched successfully. If dis(v(S i ), v (S j )) is not larger than 2τ, the function para-RS(Á) can find the similar pairs by employing the above pruning strategy. For example, given a string r in S i , for any string t in S j (l min length(t) l max )), the function first checks whether r and t are similar by function posFilter(Á), and then it calculates the L 1 distance of joint-frequency vector of r and t.

Parallel processing in distributed systems
A big problem for parallel processing with multi-threading technique is the incapability of the system such as the limited memory and the number of cores. One solution is to add the memory and the other solution is to run the framework in a distributed cluster environment.

String join algorithm pada-join
Hadoop as a big data processing technology has been around for 10 years and has proven to be the solution of choice for processing large data sets. MapReduce is a great solution for onepass computations, but not very efficient for use cases that require multi-pass computations and algorithms. Each step in the data processing workflow has one map phase and one reduce phase and the developers will need to convert any use case into MapReduce pattern to leverage this solution.
Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data. Spark runs on top of existing HDFS infrastructure to provide enhanced and additional functionality.
We propose a parallel string join algorithm called Pada-Join based on Spark. Algorithm 3 shows the pseudocode of Pada-Join, where the bold functions or methods are provided by Spark.
The joint frequency vector f(r) for each string of the given dataset is generated in the filter stage. In order to get the joint frequency vector, we need to obtain the token set according to the token counting algorithm. Algorithm 4 shows the pseudocode of how to compute the token set, where the bold functions or methods are provided by Spark.
If two strings are similar, the distance of their joint-frequency vectors must be less than 2τ. The candidate pairs are produced by utilizing the cartesian product of all the distinct pairs of the distribution nodes. However, this operation takes a huge amount of memory to store all the pairs that are distributed across multiple machines. To minimize the size of pairs, the vectors are taken as keys and the string IDs are taken as values so that the pairs sharing the same joint-frequency vector are assigned to the same group (seen in lines 5-6 of algorithm 3). The lines 7-13 in algorithm 3 illustrate the stage of candidate generation. To reduce data communication and data shuffling among the nodes, we store the joint frequency vector groups <f(r), list(rid)> in memory by generating a broadcast variable <f(s),list(sid)>. Then the candidate groups to meet the filtering condition are matched.
In the verification phase, rid and sid need to be converted into string r and s, and then to be verified. Line 14 does the job by joining dataset S with <rid,list(sid)>. The variable sid coming from the broadcast <f(s),list(sid)> is generated from <f(r),list(rid)>. Then the candidate pair <s,r> can be obtained by joining the dataset S with <sid,r> (seen in lines [18][19], and they are matched by calculating their similarity. The output is the final result.
Because Pada-Join and Para-Join share the same algorithmic logic, Pada-Join can also avoid repetitive computation and ensure the completeness of the result.

Join operation in spark
The following shows the computation flow of join operation in Spark. 1) To get the token set dynamically and by partitioning. In Para-Join, the token set is obtained in advance. In Pada-Join, the token set is obtained dynamically. Fig 2 is an instance to get the token set.
After getting the token set, we need to split it into subsets. The partitioning rule is the same as that in Para-Join algorithm, i.e., calculating the frequency distribution for each token and the frequency variance, and then getting the token set according to the Z-folder algorithm.
2) To get the candidate string pairs by filtering. Then delete the string pairs that are impossible similar. The way is the same as that in Para-Join algorithm. Fig 3 shows an instance to get the candidate string pairs.
3) To get the result by verification. The verification process is the same as that in Para-Join algorithm.

Experimental evaluation 6.1 Experimental environment
In this section, we evaluate the parallel string join algorithms based on the real datasets. Four datasets are used in the experiment. All the datasets can be downloaded from http://doi.org/ 10.5281/zenodo.293041. Dataset Ⅲ and Ⅳ can be downloaded from http://dbgroup.cs. tsinghua.edu.cn/dd/codes/pivotal.tar.gz too.
The first two datasets are relatively small and used to test the single-machine algorithms. The detailed information of these datasets is shown in Table 2.
Algorithms Pass-Join, Part-Join, and Para-Join are implemented in Java, and Algorithm Pada-Join is implemented in Scala. These algorithms run on three different systems: a multicore system, a cluster system with 4 nodes, and a single machine with the same configuration as the node in the cluster. The operating system used is Ubuntu 12.04 LTS and the version of JDK is 1.7.0_71. The detailed information of the systems is shown in Table 3 where system Ⅱ is consisted of 4 nodes and they are virtual machines. The virtual machines are created on a physical hardware with CPU i7-4770 3.40 GHz Ã 8, RAM 16GB, and hypervisor VMware.
We evaluate our framework in two aspects, efficiency and scalability. Efficient string similarity join in multi-core and distributed systems In efficiency aspect we evaluate the running time of parallel processing with multi-threads and multi-tasks against the existing algorithms.
In scalability aspect we evaluate the influence of the number of threads or tasks attending the computation.

Efficiency analysis
In this Section, we compare our algorithms with two existing algorithms Pass-Join and Part-Join. For Para-Join, the number of threads is set to 8. The similarity threshold τ ranges from 1 to 8. The experimental results are shown in Figs 4 and 5. Because similarity thresholds have high influence on the running time, the results are shown in different figures with varying thresholds.
When the similarity threshold τ is small, there is no big difference for the running time among algorithms. For example, when τ is 1, the running times of the three algorithms on dataset I are 22s, 25s, and 23s respectively. When the value of τ increases, algorithm Para-Join can show more advantages. For example, when τ is 8, the running time on dataset I of Para-Join is 49s while the others are 136s and 114s respectively. It maintains the same advantage on dataset II. The main reason is that our algorithm can concurrently find the similar pairs in the dataset by using the multi-threading technique.
When we test dataset Ⅱ in system Ⅱ and system Ⅲ, the running time is bigger than in other algorithms. Fig 6. shows the results. We realize that Pada-Join is not suitable for small datasets.  Efficient string similarity join in multi-core and distributed systems When we test dataset Ⅲ and dataset Ⅳ in system Ⅲ, the memory overflow error occurs. However, Pada-Join completes the work successfully. We also realize that Para-Join, Pass-Join, and Part-Join are not suitable for big datasets.
For Para-join algorithm, the implementation is to load the input into memory at first and then to process it. For Pada-join, the Spark framework will divide the input into several blocks and store them in the HDFS (Hadoop distributed file system). The size of a block is limited and it can be loaded into memory at the same time. After Spark finishes the processing of one block, it will load another block. The basic differences between Para-join and Pada-Join are their implementation ways and the platforms they run on. So Para-Join algorithm is unable to handle the larger dataset.

Scalability analysis
We have designed two cases to evaluate the scalability of the algorithms Para-Join and Pada-Join.
Case 1. Under the same dataset, we compare the running times by changing the number of threads from 2 to 8 and changing the similarity threshold τ from 1 to 8. The experimental results are shown in Figs 7 and 8.
From the figures, we can observe that the running time increases as the value of τ increases. The reason is that for the same dataset, when τ increases, candidate pairs in the dataset are also increased, result in more operations in the verification process. However, when the value of τ is big enough, e.g., τ is 8, the running time remains unchanged or even becomes larger. The reason is that a large number of threads increase the communication overhead.
Case 2. Under the same system configuration, we compare the running times for dataset Ⅲ and dataset Ⅳ by changing the similarity threshold τ from 1 to 8. These two datasets are too large to the extent that the other algorithms cannot handle. As τ becomes larger, the algorithm becomes more complicated. Because the number of the candidate pairs increases with the size of datasets, the running time also increases. The experimental results are shown in Fig 9. The algorithms can perform excellently on larger data sets.

Related work
There are many previous studies on the development of efficient solutions to the string similarity join problem .

String similarity functions
The string similarity functions are the key for all the string similarity join algorithms. String similarity functions are used to quantify the similarity of two strings. The existing string similarity functions can be roughly divided into two groups, character-based similarity, and setbased similarity. The character-based similarity considers characters in strings to quantify the similarity, such as Edit distance, Hamming distance, and character n-gram similarity [6,13,14]. The set-based similarity quantifies the similarity based on the token sets. These functions include Jaccard similarity, Cosine similarity, and Dice similarity [13,14]. Besides the above similarity functions, there are also some new functions, such as Jaro-Winkler measure and Hidden Markov Mode-based measure.

String similarity join methods
The existing methods for string similarity join can be broadly separated into two categories, based on the filtering-verification framework and the tire tree. Most of the existing methods adopt the first one. These methods include All-Pairs-Ed, ED-Join, AdaptJoin, Part-Enum, Pass-Join, and Part-Join [2,3,8,[15][16][17]. All-Pairs-Ed is a q-gram-based method, ED-Join Efficient string similarity join in multi-core and distributed systems improves All-Pairs-Ed using location-based and content-based mismatch filter by decreasing the number of grams, and AdaptJoin algorithm improves the prefix filtering fundamentally for all similarity metrics. Trie-Join and Bed-Tree use a trie tree to do similarity join [12,11]. With the improvement of these methods, many filtering techniques are proposed such as count filtering, length filtering, position filtering, prefix filtering, and content filtering [1,2,4,10,17,18]. Additionally, some parallel methods have been proposed for string similarity join, such as bitparallel, MassJoin, V-SMART-Join, et al [7,18,19].

String similarity search
It is similar to string similarity join. Firstly the index of the string collection is built. When a query request is submitted, a large number of dissimilar strings are filtered according to the given query string, and then the candidate strings are matched with the given string according to the similarity function [4,11,20].

Parallel processing techniques
There are a lot of works on implementing string join using Map-Reduce framework. Vernica et al. proposed a similarity join method using MapReduce which utilized the prefix filtering to support set-based similarity functions [16]. They selected a subset of tokens as signatures and proved that two strings are similar only if their signatures share common tokens. Afrati et al.
proposed multiple algorithms to perform similarity joins in a single MapReduce stage [21]. They analyzed the map, reduce, and communication cost. However, for long strings, it is rather expensive to transfer the strings using a single MapReduce stage. Kim et al. addressed the top-k similarity join problem using MapReduce [22]. Deng et al proposed Mass-Join which extended the existing partition-based signature scheme to support set-based similarity functions [11]. In this paper, we take the multi-threading technology and the multi-tasking technology into consideration and compare them in the string join field.

Conclusions
In this paper, a parallel processing framework for string similarity join is proposed for high efficiency. Algorithm Para-Join based on the framework adopts the multi-threading technique and runs on the multi-core system. Algorithm Pada-Join, also based on the framework, adopts the distributed computing technique and runs on the distributed systems. Some conclusions are given by the experimental results and analysis. For relatively small datasets Para-Join can provides very good scalability and outperforms state-of-the-art algorithms because it completes the computation of string similarity join in one node and avoids the overhead of network communication. However, the availability of single-machine algorithms is limited by the memory. For relatively big data set, Pada-Join shows its advantages because of the good scalability of the distributed systems. In the future, we will adopt larger datasets to test Pada-Join algorithm and improve its performance.