PACC: Large scale connected component computation on Hadoop and Spark

A connected component in a graph is a set of nodes linked to each other by paths. The problem of finding connected components has been applied to diverse graph analysis tasks such as graph partitioning, graph compression, and pattern recognition. Several distributed algorithms have been proposed to find connected components in enormous graphs. Ironically, the distributed algorithms do not scale enough due to unnecessary data IO & processing, massive intermediate data, numerous rounds of computations, and load balancing issues. In this paper, we propose a fast and scalable distributed algorithm PACC (Partition-Aware Connected Components) for connected component computation based on three key techniques: two-step processing of partitioning & computation, edge filtering, and sketching. PACC considerably shrinks the size of intermediate data, the size of input graph, and the number of rounds without suffering from load balancing issues. PACC performs 2.9 to 10.7 times faster on real-world graphs compared to the state-of-the-art MapReduce and Spark algorithms.


Introduction
A connected component in a graph is a set of nodes linked to each other by paths. Finding connected components is a fundamental graph mining task having various applications including reachability [1,2], pattern recognition [3,4], graph partitioning [5,6], random walk [7], graph compression [8,9], etc. However, graphs of interest are enormous with billions of nodes and edges; e.g., 2.4 billion monthly active users form a huge friendship network in Facebook https://newsroom.fb.com/company-info/ and the Web is a gigantic network where more than 1 trillion Web pages are linked to each other by hyperlinks https://googleblog.blogspot. com/2008/07/we-knew-web-was-big.html. Then, how do we efficiently compute all connected components in such massive graphs?
Hadoop and Spark are the de facto standard distributed data processing frameworks composing their own ecosystem with diverse libraries. The ecosystem provides a consistent way to store, process, and analyze data so that multiple types of big-data analyses run on the same data at massive scale on commodity hardware. Several algorithms to find connected components on Hadoop and Spark have been proposed to take the ecosystem's advantages; however, ironically, they do not scale enough due to unnecessary data IO and processing, massive intermediate data, numerous rounds, and load balancing issues.
In this paper, we propose PACC (Partition-Aware Connected Components), a fast, scalable, and distributed algorithm for computing connected components. PACC achieves high performance and scalability by three techniques: two-step processing (partitioning and computation), edge filtering, and sketching. In the two-step processing, the partitioning step divides the graph into subgraphs gradually in several rounds so that the subgraphs can be processed independently of each other in the computation step. PACC distributes workloads evenly to machines, and avoids 'the curse of the last reducer' issue [10], which the most advanced MapReduce algorithm [11] suffers from. We found that, during the partitioning step, most edges arrive early in subgraphs where they should be located eventually. Our proposed edge filtering shrinks the size of intermediate data by excluding every edge that settles down in a subgraph in each iteration. Note that, the edge filtering is due to the two-step processing, and thus, it is not applicable to previous MapReduce algorithms. We also found that even if we replace a subgraph of the input graph with another subgraph that has the same connectivity, the connectivity of the input graph does not change. The sketching reduces the size of the input graph by performing a sequential connected component algorithm on each subgraph of the input graph. This paper makes the following contributions: • Algorithm. We propose PACC, a fast and scalable algorithm for connected component computation in an enormous graph. PACC is made up of three key techniques: two-step processing (partitioning and computation), edge filtering, and sketching. The techniques make PACC distribute workloads evenly, shrink the size of input and intermediate data, and reduce the round number.
• Theory. We theoretically prove various characteristics of PACC and the correctness. We guarantee the input size never increases in every step of PACC.
• Experiment. We evaluate the performance of PACC using real and synthetic graphs. Experimental results show that PACC is up to 10.7 times faster than state-of-the-art MapReduce and Spark algorithms.
This paper is an extended version of [12]; in this paper, we newly propose a sketching technique that improves the performance of the previously proposed method (namely PACC-ef) by reducing the input data size. We provide the detailed proofs of the correctness and the performance of PACC with sketching (Lemmas 4, 6, 8, and 9). The efficacy of sketching is measured also experimentally; PACC with sketching shows up to 3.3 times faster performance on the graphs of Twitter and YahooWeb than PACC without sketching, i.e., PACC-ef. We describe how to implement PACC with sketching on Hadoop and Spark, and measure the impact of the systems on the performance of PACC. The datasets used in this paper and the code of PACC can be found in https://github.com/kmudmlab/PACC. Frequently used symbols are summarized in Table 1.

Single-machine algorithms
Breadth-first-search and depth-first-search are well known graph traversal algorithms that compute connected components in linear time. Patwary et al. [13] propose a multi-core algorithm that is based on the disjoint-set data structure; each set corresponding to a connected component and two sets are unified if an edge links nodes in the two sets. However, these algorithms cannot handle graphs exceeding the size of the main memory as they store the entire graph on the memory. GraphChi [14] and DSP-CC [15] are external algorithms that use external memory such as hard disks to increase the size of processable data in a single machine. GraphChi has an implementation for connected component computation based on iterative message passing process. DSP-CC is an external algorithm for computing connected components based on Union-Find, showing notable speed on graphs with billion nodes and edges by exploiting solid-state drives (SSDs). PACC can use these single-machine algorithms as a module, in other words, PACC can be seen as a tool that enhances single-machine algorithms to be distributed algorithms running on Hadoop and Spark.

Distributed-memory algorithms
Many distributed connected component algorithms are proposed in a parallel random-access machine (PRAM) model, a theoretical parallel processing model. Bader and Cong [16] investigate practical algorithms implemented on symmetric multiprocessors (SMPs), and propose a SMP algorithm that first finds a shallow spanning tree of the input graph and computes (u, v) Edge between u and v.
Partition containing u.
Threshold for the number of input edges.
C Number of chunks.
Representative node of the connected component containing u in G i ; the minimum node among nodes connected to u by a path in G i .
connected components by conducting the depth-first search starting from the spanning tree in parallel. Several graph mining platforms that use distributed-memory such as Pregel [17], GraphLab [18], GraphX [19], PowerGraph [20], PowerLyra [21], Ligra [22], and Gemini [23] provide implementations for computing connected components that are based on iterative message passing process. To achieve high performance, they make several implicit assumptions: (1) the entire graph including replicated nodes fits into the distributed memory (Pregel, GraphLab, GraphX, PowerGraph, PowerLyra, Ligra, Gemini), (2) the set of nodes fits into the memory of a single machine (Gemini), (3) node ids are represented by 32-bit integers and consecutive (GraphLab, PowerGraph, PowerLyra, Ligra, Gemini), and (4) the input file should be in a system-specific format (Ligra, Gemini). These assumptions make the above distributedmemory algorithms not scalable, not seamlessly connected with other data processing, and require complex preprocessing.

MapReduce algorithms
MapReduce [24] is a widely-used distributed computing framework. Exploiting distributed storage, MapReduce provides fault-tolerance of data, high scalability in terms of data and cluster sizes, and an easy-to-use interface. Thanks to the advantages of MapReduce, diverse graph mining tasks such as graph visualization [25,26], subgraph enumeration [27], triangle counting [28,29], and radii/diameter calculation [30] have been researched on MapReduce.
Recently, several MapReduce algorithms have been proposed for computing connected components. A naïve method to compute connected components on MapReduce is to repeat the breadth-first search (BFS) for each connected component. However, this algorithm requires too many rounds; for each connected component, the algorithm takes as many rounds as the diameter of the component. In MapReduce, the number of required rounds significantly affects overall performance. Pegasus [30] and Zones [31,32] conduct the breadthfirst search from every node concurrently so that the number of required rounds is reduced to the largest diameter of connected components in a graph. However, the number of rounds is still large for several enormous datasets; for example, the diameter of the YahooWeb graph is above 30. Hash-Greater-to-Min [33] reduces and guarantees the number of rounds to be O(log |V|) on the number of nodes for computing connected components. Hash-to-Min in [33] does not ensure that the number of round is logarithmic, but is faster than Hash-Greaterto-Min in practice. However, Hash-Greater-to-Min and Hash-to-Min generate massive intermediate data that is more than twice the original graph size, resulting in a severe performance degradation. However, both algorithms spawn large amounts of additional data during the execution process, resulting in significant performance degradation.
Two MapReduce algorithms, two-phase and alternating, proposed in [11] resolve the intermediate data explosion problem of Hash-to-Min. The algorithms guarantee that the intermediate data size of each round is always less than or equal to the input data size. However, the algorithms suffers from the load-balancing problem, which is another performance bottleneck. The alternating algorithm is introduced in more detail in the next section because it is relevant to our work.
The same paper [11] also proposes optimized algorithms of two-phase and alternating. The optimized two-phase algorithm, namely two-phase-DHT, exploits a distributed hash-table (DHT) to decrease the round number. The experimental result of the paper shows that the two optimized algorithms outperform the two-phase and alternating algorithms, while the optimized alternating algorithm is faster than the two-phase-DHT algorithm on graphs with above billion nodes and edges.

Preliminaries
In this section, we introduce the definitions of a connected component and the computation of it. Then, we introduce the alternating algorithm, a MapReduce algorithm for computing connected components.

Problem definition
A connected component is a subgraph whose nodes are connected by paths, and the formal definition is as follows.
Definition 1 Given an undirected graph, a connected component is a node set where nodes have a path to each other and have no path to a node outside the set. The computation of connected components is to detect all connected components in a graph. Equivalent task to connected component computation is to find the minimum reachable node from each node; the minimum reachable node from a node u is the node with the smallest identification number among nodes connected by a path from u. Note that we assume each node has a unique number and the numbers are in a total order. Each minimum reachable node represent a connected component.

The alternating algorithm
The alternating algorithm is one of the two MapReduce algorithms proposed in [11]. The large-star and the small-star are the core operations of the algorithm. Each operation gets a graph as input and generates a new graph as output. The large-star operation outputs an edge (v, m(u)) for each u and for each v 2 Γ + (u). Similarly, the small-star operation outputs an edge

PLOS ONE
(v, m(u)) for each u and for each v 2 Γ − (u) \ {m(u)}. The algorithm repeats the two operations alternately until no more edge is added or deleted.
After several rounds of the alternating algorithm, the input graph is changed into a star graph. It implies that several nodes will have a numerous neighbors while most nodes will have few neighbors. This leads to 'the curse of the last reducer' issue [10] meaning that most computations are performed on a small number of reducer, causing an abnormal increase in the running time of a MapReduce algorithm. In the same paper, the authors try to handle the loadbalancing issue by the optimized alternating algorithm that makes several copies of highdegree nodes and spreads the high-degree nodes' neighbors to the copies. However, the optimized alternating algorithm can increase the number of edges during the process, and the load-balancing issue still remains as we show in our experiments. Note that our method ensures that the number of edges does not increase (see Lemma 5), while resolving the load balancing problem.

Proposed method
In this section, we propose PACC (Partition-Aware Connected Components), a fast, scalable and distributed algorithm for computing connected components. We summarize the challenges in developing a distributed algorithm for computing connected components efficient and scalable, and how PACC addresses the challenges as follows.
1. How can we resolve the load balancing problem, which the alternating algorithm suffers from? Two-step processing of partitioning and computation hinders edges from congregating around a few nodes; thereby, the workloads are balanced across machines.
2. How can we shrinks the input and shuffled data sizes? Filtering out edges that are inside a partition or no longer change rapidly decreases the number of edges every round. Consequently, the shuffled data size is also decreased. Moreover, finding connected components on edge disjoint subgraphs of the input graph (sketching) shrinks the number of edges in the initial graph.
3. How can we minimize the round number? As edge-filtering decreases the number of edges in each round, PACC reduces the round number by performing a single machine computation instead of multiple rounds of distributed computation when the edge number falls below a threshold.
In the following subsections, we first show the overview of PACC, and describe the three core techniques of PACC with theoretical analyses. Then, we discuss the issues of implementation on Hadoop and Spark.

Overview
Fig 2 illustrates the overview of PACC. PACC consists of two main steps (partitioning and computation) and two optimization techniques (edge-filtering and sketching). We define PACC without sketching as PACC-ef, and PACC-ef without edge-filtering as PACC-base, i.e., PACC-base consists only of the two main steps: partitioning and computation. In PACC-base, the partitioning is an iterative operation that transforms the input graph into another graph that has the same connectivity; the transformed graph is divided into ρ subgraphs so that the computation step correctly computes the connected components by processing each subgraph independently by a sequential algorithm. Note that subgraphs can be processed together in the same machine or independently in different machines because partitions are logical divisions of data. In PACC-ef, the edge-filtering eliminates edges that have settled on a subgraph during the partitioning step to shrink the intermediate data size for each iteration. It is worthwhile to mention that the edge-filtering is available due to the two-step processing as the partitioning step focuses only on partitioning rather than computing connected components; thus, the edge-filtering is not applicable to the alternating algorithm discussed in the preliminaries. In PACC, the sketching transforms the input graph into a smaller graph based on the fact that even if a portion of the graph is replaced by another graph that has the same connectivity, the transformed graph keeps the connectivity of the original graph. A similar idea is already applied to DSP-CC [15], an external algorithm, but the idea is not directly applicable to PACC because the idea is prone to load-balancing issues on distributed algorithm. The sketching of PACC carefully converts the input data to avoid load-balancing issues. The following sections describe the PACC-base, PACC-ef, and PACC in turn.

PACC-base: Two-step processing for load-balancing
This section introduces PACC-base, a basic version of PACC without sketching and edge-filtering. We first describe the two main steps of PACC-base, and show how to combine the two steps to compute connected components.
The partitioning step. The partitioning step of PACC-base resolves the load-balancing issue of the alternating algorithm. The key idea is to partition the nodes and to avoid connecting nodes in different partitions by edges. The main cause of the alternating algorithm's loadbalancing issue is that the edges are concentrated on a few nodes after several rounds. A round of the alternating algorithm transforms the graph in Fig    the load-balancing issue. PACC-base divides the nodes into ρ partitions using a hash function ξ: V ! {0, � � �, ρ − 1}. We use ξ(u) to denote the partition of a node u. For a subset S � V of nodes, we define the i-th partition of S by [S] The partitioning step alternately performs two distributed operations, PA-large-star and PA-small-star, which improve the two core operations of the alternating algorithm by taking into account the partition of nodes. While the large-star and the small-star link the neighbors of each node u to the minimum neighbor m(u) of u, the PA-large-star and the PA-small-star link the neighbors of each node u to the 'local' minimum neighbor The global minimum node m(7) of node 7 is node 1, and the local minimum nodes m 1 (7), m 2 (7) of node 7 are nodes 1 and 2, respectively. PA-large-star links node 9 to node 1, and nodes 8 and 10 to node 2 since large neighbors Γ + (7) of node 7 are nodes 8, 9, and 10. PA-small-star links node 7 to node 1, and node 4 to node 2 since small neighbors Γ − (7) of node 7 are nodes 1, 2, and 4. Node 2 is linked to node 1 as node 2 is a local minimum node and node 1 is the global minimum node. Algorithms 1 and 2 show how PA-large-star and PA-small-star is implemented in a distributed manner using MapReduce. Note that both PA-large-star and PA-small-star maintain the connectivity of the input graph (see Lemma 2).
Algorithm 1: PA-large-star Map : input hu;vi 1 emit hu;vi and hv;ui The computation step. The computation step divides the resulting graph of the partitioning step into ρ subgraphs and computes connected components in each subgraph independently. The i-th subgraph is the edge-induced subgraph on the edges incident to the nodes in the i-th partition. For example in Fig 3c, the first subgraph contains nodes 1, 7 and 9, and the second subgraph contains nodes 1, 2, 4, 8, and 10. The computation step independently computes connected components in each subgraph. The computation step is accomplished by a single round MapReduce task; the map step builds the subgraphs, and the reduce step computes connected components in the subgraphs. The map step sends each edge (u, v) to a partition ξ(u) where u > v (without lossing generality as the graph is undirected). Then, a reduce operation receives the edge set E p of the p-th subgraph, and computes connected components from the edge-induced subgraph on E p using a single machine algorithm (LocalCC). In our experiments, Union-Find with the path compression [34] is used for LocalCC. A pseudo code for the CC-Computation is in Algorithm 3.

Algorithm 3: CC-Computation
Independent execution of LocalCC on each subgraph ensures to compute connected components in the original input graph. For a connected component X, the local minimum node min([X] i ) in i-th partition is adjacent to the global minimum node min([X]) in the i-th subgraph, and the other nodes in the i-th partition is adjacent to min([X] i ). Lemma 1. In the resulting graph of PACC-base's partitioning step, the edge-induced subgraph on the edges incident to the nodes in a partition is a star graph where the center is the minimum node in the partition, and the star graph includes the minimum node in the connected component where the center node belongs to.
Putting it together. Algorithm 4 is the pseudo code of the complete PACC-base algorithm. PACC-base performs PA-large-star and PA-small-star alternately until convergence to partition the input graph. The algorithm converges when no edges are added or deleted in a round. Then, PACC-base computes the connected components from the resulting subgraphs using the CC-Computation operation (Algorithm 3).

PACC-ef: Edge-Filtering
We propose PACC-ef that improves PACC-base with edge-filtering. We first show our observation that the input edge size cannot be smaller than the number of non-root nodes while the number of edges changed by the two operations of the alternating algorithm decreases rapidly every round (see Fig 5). This suggests that most edges are just read and written without change.
The lower bound of the input size at each round on the alternating algorithm is proved in Theorem 1.
Theorem 1. In the alternating algorithm, the number of input edges in each round does not fall below |V| − |C| where |V| and |C| are the numbers of nodes and connected components, respectively.
Proof. The last round of the alternating algorithm outputs star graphs, each of that consists of the nodes in a connected component and the center node is the connected component's minimum node, i.e., the output of the final round contains |V| − |C| edges. Besides, the number of output edges monotonically decreases every round by Lemmas 2 and 3 in [11]. As the input of every round except the first round is the output of the previous round, the number of input edges in a round cannot be less than |V| − |C|.
The partitioning step of PACC-ef excludes numerous edges so that the size of input and shuffled data decrease significantly every round. The edge-filtering is possible because the partitioning step focus on partitioning instead of finding connected components. PACC-ef excludes an edge (u, v) in two cases, where u < v without losing generality: Edges of case 1 can be excluded as the goal of the partitioning step is to divide the graph into subgraphs that can be processed independently each other during the computation step. Thereby, each filtered edge in case 1 stays in the partition, and thus, the edges filtered in case 1 form a forest graph where no edge spans partitions. Meanwhile, edges of case 2 together form star graphs, in each of them the center is the minimum node. Such edges no longer change in the subsequent rounds by PA-large-star and PA-small-star. Such edges can be excluded in the next round safely.    foreach v 2 Γ + (u) do 4 emit hv;ui to lcc 5 else 6 foreach The MapReduce version pseudo code of PACC-ef is listed in Algorithm 5. In the partitioning step, PACC-ef alternately performs PA-large-star-opt, and PA-small-star-opt. The pseudo is 'in' edge, and thus, PA-large-staropt stores the edge into lin separately (line 9 of Algorithm 6). In the reduce function of PAsmall-star-opt when the input is hu; Γ(u)i, u is linked to m ξ(u) (u) if u 6 ¼ m ξ(u) (u). If u has no large neighbor, the edge (u, m ξ(u) (u)) is 'in' edge, and thus, PA-small-star-opt stores the edge into sin separately (line 5 of Algorithm 7).
Decreasing the amount of data to read and write, the edge-filtering quickly drops the number of edges so that the input graph becomes small enough to be processed on a single machine after a few rounds. PACC-ef computes connected component using a single machine with LocalCC when the input size is smaller than a threshold τ, instead of running several MapReduce rounds. It saves preparation time for multiple rounds.

PACC: Sketching for shirinking the input graph size
The size of input graphs greatly influences the running time of PACC. We propose a sketching method that shrinks the size of an input graph based on the fact that the connectivity of the input graph is preserved even if a part of the graph is replaced by a graph that has the same connectivity. Sketching generates a graph G@ with fewer edges than the input graph G by two consecutive steps. Then, we use G@ as a new input graph of PACC. We explain the method with an example in Fig 7. The first step of sketching reduces the size of the input graph G while keeping the connectivity of G by computing connected components in edge disjoint subgraphs of G. We denote the reduced graph by G 0 . Details of the first step are as follows. The input graph G is stored as multiple chunks in a distributed storage, where every edge is included in exactly one of the chunks.  Fig 7d shows the edge induced subgraph G 0 of E 0 . Note that the first step is performed in parallel on multiple machines without network communication. The first step of sketching is very similar to the first step of DSP-CC [15], which is an external algorithm. As a result of the first step, the input size is reduced but many edges are concentrated to a few nodes. The edge concentration is not a problem for DSP-CC because DSP-CC is a single machine algorithm, but it causes a load-balancing issue on PACC as PACC is a distributed algorithm. In Fig 7d, for example, most edges are concentrated on node 1 and node 5. The skewness of G 0 causes a load-balancing issue during the partitioning step of PACC as computation of PA-large-star is concentrated on nodes with a large number of neighbors.
The second step of sketching resolves the load-balancing issue by scattering edges concentrated on a small number of nodes while keeping the connectivity of G. The second step generates a new input graph G@ from G 0 generated by the first step. For each node u, the second step replaces a subset of edges from u's neighbors to u with edges between the neighbors.

Analyses
We analyse the correctness and performance of PACC. We first prove the correctness of PACC through Lemmas 2, 3, 4, and Theorem 2.

Lemma 2. PA-large-star and PA-small-star do not break the connectivity of the input graph.
Proof. As proved in Lemmas 1 and 3 of [11], the large-star and the small-star of the alternating algorithm do not break the original connectivity of the initial graph. We show that PAlarge-star and PA-small-star have the same connectivity as the large-star and the small-star, respectively. For a node u and a large neighbor v of u, the large-star links v to m(u). PA-largestar links v to m(u) if v = m ξ(v) (u), or connects v to m(u) through m ξ(v) (u) in the other cases. Thus, PA-large-star also keeps the connectivity of the input graph. For a node u and a small neighbor v of u, the small-star links u and v to m(u). Like PA-large-star, PA-small-star con- u)), or via m ξ(v) (u) in the other cases. Thus, PA-small-star also keeps the connectivity of the input graph.
Lemma 3. The 'in' set, which is the set of edges filtered out by case 1, is a forest (i.e., a set of trees).
Proof. For edge (u, v) where u < v in case 1, u is the only neighbor of v and the edge is filtered out by PACC-ef. Assume that v d was a large neighbor of v in a previous round, and excluded by case 1; v was the only neighbor of v d , meaning that no cycle exists. Thus, the lemma follows.

Lemma 4. The original input graph G and the graph G@ generated by the sketching method have the same connected components.
Proof. By the definition of G 0 i , G i and G 0 i have the same connectivity; that is, if two nodes are connected by a path in G i , they are also connected in G 0 i . As G is the union of G i and G 0 is the union of G 0 i for i 2 {0, � � �, C − 1}, G and G 0 also have the same connectivity. (u, i)-localization does not change the connectivity of G 0 ; after (u, i)-localization, m þ i ðuÞ is still directly linked to u, and v that is not m þ i ðuÞ is reachable via m þ i ðuÞ. As (u, i)-localization does not change the connectivity, localization also does not change the connectivity of G 0 . That is, G, G 0 , and G@ have the same connectivity.
Theorem 2. PACC correctly finds all connected components. Proof. As proved in Lemma 4, the sketching step of PACC does not change the connectivity of the original input graph. We then show the correctness of PACC by proving the correctness of PACC-ef, which is PACC without sketching.
The partitioning step of PACC-ef outputs three edge sets, 'in', 'cc', and 'out', and the union of the edge sets is the input of CC-Computation. Lemma 3 shows that the 'in' set is a forest where a local minimum node in a connected component is the root of a tree. Meanwhile, Lemma 1 shows that all local minimum nodes and the global minimum node of a connected component are linked each other as a star graph in 'cc' and 'out' of the final round. The two lemmas together show that the union of the three sets is a forest where each tree corresponds to a connected component, and the root is the global minimum node; thus, the forest and the original input graph have the same connectivity. CC-Computation independently computes connected components from ρ subgraphs of the forest such that the i-th subgraph is the edgeinduced subgraph on the edges incident to the nodes in the i-th partition. As all local minimum nodes in a connected component are directly linked to the global minimum node in the forest, the nodes of a connected component in i-th partition have paths to the global minimum node, which implies the result of CC-Computation is the connected components of the original graph.
The following lemmas present performance bounds of PACC.

Lemma 5. PA-large-star(-opt) and PA-small-star(-opt) never increases the edge number.
Proof. PA-large-star and PA-small-star does not duplicate any edges, and just change each edge with another one. For each edge (u, v) where u < v, PA-large-star and PA-small-star change the edge with (v, m(u)) or (v, m ξ(v) (u)).

Lemma 6. Localization does not increase the number of edges, that is, |E@|
Proof. Localization replaces each edge to another edge to scatter the edges concentrated on a few nodes. In other words, the edge number does not increase by localization.
Lemma 7. In PACC-ef, the number of input edges into CC-Computation is at most |V| − 1 where |V| is the size of node set.
Proof. The input of CC-Computation forms a forest as shown in Theorem 2. Thus, the maximum number of edges in the forest is |V| − 1. If the hash function partitions the nodes evenly, the expected edge number in a subgraph is (|V| − 1)/ρ.

Implementation
Although we describe PACC using MapReduce primitives for simplicity, PACC can be implemented for various distributed frameworks such as Pregel, Spark, and Pegasus. This section discusses the issues of implementing PACC in Hadoop and Spark, which are the representative distributed computing frameworks.
PACC on Hadoop. Algorithm 9: PA-large-star-opt combiner In PA-large-star-opt (Algorithm 6), the map function emits hu;vi and hv;ui for each edge (u, v) so that all neighbors of each node are gathered into a reduce operation. Then, for each node u, the reduce function links each large neighbor v 2 Γ + (u) to the global minimum node m(u) or the local minimum node m ξ(v) (u) of the partition ξ(v) of node v. We note that small neighbors of u, except the local minimum nodes, are not needed in the reduce function. From this, we reduce the amount of shuffled data by using a combine function that excludes small neighbors that cannot be a local minimum node; a combine function in Hadoop conducts a reduce operation in each machine locally, just before the shuffle step. The combine function for the PA-large-star-opt operation is listed in Algorithm 9. The combine function gets a node u and a subset Γ 0 (u) of its neighbors as input. For each node v 2 Γ 0 (u), if v is a large neighbor of u, we emit hu; vi (lines [4][5]; in other cases, we update m 0 xðvÞ ðuÞ with v (lines 6-7). For each partition i, we emit hu; m 0 i ðuÞi if m 0 i ðuÞ has been updated at least once (lines 8-10). Algorithm 10: PA-small-star-opt combiner PA-small-star-opt (Algorithm 7), similarly, does not need all large neighbors of each node; one large neighbor is enough to check whether the node has a large neighbor or not (lines 4 and 9 of Algorithm 7). Accordingly, we use a combine function that excludes all but one large neighbor of each node in order to reduce the amount of shuffled data. The combine function for the PA-small-star-opt operation is listed in Algorithm 10. The combine function gets a node u and a subset Γ 0 (u) of its neighbors as input. For each node v 2 Γ 0 (u), if v is a small neighbor of u, we emit hu; vi (lines 3-4); in other cases, we update γ with v. If γ has been updated at least once, we emit hu; γi (lines 7-8).
If a graph has a cycle, PACC (and PACC-ef) outputs duplicate edges. PACC can generate plenty of duplicate edges on most real-world graphs which have many cycles. We improve the performance of PACC by eliminating such duplicate edges. Fig 8 shows an example of PACC generating duplicate edges from a cycle. Both PA-large-star and PA-small-star output duplicate edges (1, 3) and (1, 2), respectively. Edge (1, 3) appears twice when performing PA-largestar on node 1 and 2, respectively. Similarly, edge (1, 2) appears twice when performing PAsmall-star on node 2 and 3, respectively. PACC removes duplicate edges in the reduce functions of PA-large-star-opt and PA-small-star-opt. Each reduce function gets a node u and the neighbors Γ(u) of u as input. If edge (u, v) is duplicated, Γ(u) contains multiple v. That is, removing duplicates in Γ(u) means removing duplicate edges. One way to remove duplicates in Γ(u) is using a hash table where one slot contains only one element. However, a hash table is effectively slow and requires O(Γ(u)) space which can exceed the main memory. Instead of using a hash table, we sort and scan Γ(u) on external memory. PACC on Spark. Each round of PACC-ef makes three output sets: 'out', 'in', and 'cc'. In Spark, one can run three filter operations to make three output sets from a single input set as a filter operation holds only results satisfying a condition and discards the others. This approach, however, has two inefficiency issues. One is that this approach reads the same input data multiple times as each filter operation requires processing entire data, and the other is that all intermediate data (i.e., 'in' and 'cc' sets) must remain as RDDs until the end of PACC occupying a large amount of memory. We resolve the two issues by storing the output sets except the 'out' set into a distributed storage system such as HDFS.
The two star operations of PACC process the neighbors of each node together. Although a groupByKey operation in Spark can gather the neighbors of each node, we use a combination of a partitionBy operation and external sorting on each partition instead of group-ByKey because it does not work well with massive data; groupByKey uses memory-based hash tables which assume that data are evenly distributed and fit into memory. Also, even if some data are evenly distributed in memory, hash tables slow down as the data size increases because hash tables frequently require random access accompanying high cache miss rate. Contrary to groupByKey, partitionBy does not use a large hash table because it gathers the neighbors of each node according to the partition, rather than the node. In other words, the neighbors of a node are still mixed up with other nodes' neighbors in a partition as a set of edges. We gather the neighbors by sorting the edges by the source nodes, then also by the destination nodes to remove duplicate edges. As the size of a partition cannot fit into memory, we use an external sort algorithm. The combination of partitionBy and external sorting significantly increases the scalability and efficiency of PACC on Spark.

Experiments
This section evaluates PACC. The experimental settings are designed to answer the following questions. The memory size for a container is 7GB so that one slave can run 4 mappers or 4 reducers concurrently. All running times shown in this paper include all the time to load data and write the results from/to HDFS, the distributed storage used in the experiment.

Results
Efficacy of Partitioning.    LocalCC after initialization that simply transforms the input files into binary files. On the other hand, the running time of PACC when τ � 2 × 10 9 is not that high compared to the optimal running time even if PACC also runs LocalCC right after sketching, thanks to the reduced input data size by sketching.
Efficacy of sketching. Sketching boosts the performance of PACC by reducing the input data size. Fig 12 shows how much sketching reduces the size of input data. The data reduction ratio is defined as |E|/|E@| where |E| is the number of edges in the original graph and |E@| is the number of edges in the graph generated by sketching. As the upper bound of the data reduction ratio is the average number |E|/|V| of neighbors, as shown in Lemma 8, we show the correlation between the average number of neighbors and the data reduction ratio of sketching. The dashed line depicts the upper bound. Various data reduction ratios are measured according to the input data; sketching has the best effect on Twitter (data reduction ratio = 11.1) and has the least effect on Friendster (data reduction ratio = 1.7) among real-world graphs. For relatively small datasets such as Patent, Skitter, and LiveJournal, the data reduction ratios are almost close to the upper bound; this is because most edges are included in few chunks. On the other hand, the data reduction ratios are arbitrary for large datasets: Friendster, Twitter, Sub-Domain, YahooWeb, and ClueWeb12.
Scalability. Fig 13a shows the running time of PACC, Alt, and Alt-opt on RMAT-S graphs with various sizes. PACC shows the best scalability, which is almost linear, on data size as well as the shortest running time for every RMAT-S graph. Sketching not only reduces the running time of PACC but also increases the scalability on data size; when sketching is included, PACC has a more gradual slope in Fig 13a. This is because sketching has a positive effect on load-balancing. Without sketching, a load balancing problem may occur for the first PA-large-star-opt Results on real-world graphs. Fig 14 shows the running time of PACC, Alt, and Alt-opt on real world graphs. We also show the running time of PowerGraph, a representative distributed-memory graph processing system, and Union-Find, a representative single machine algorithm, for reference. The proposed algorithm PACC presents the best performance on large graphs, showing up to 10.7 times faster performance than Alt-opt on YahooWeb. Only PACC and PACC-ef succeed in processing ClueWeb12, which is the largest graph used in this paper with 6 billion nodes and 72 billion edges. On small graphs such as Skitter and Patent, all the distributed algorithms on Hadoop have longer running time even than the single machine algorithm; it is because the preparation time for each iteration on Hadoop dominates the entire running time when the graph is small. By the same reason, PowerGraph is faster than all hadoop algorithms, but PowerGraph fails on large graphs (YahooWeb and ClueWeb12) because of an out-of-memory error. For each dataset, the differences in running time between PACC and PACC-ef, and between PACC-ef and PACC-base indicate the effects of sketching and edge-filtering, respectively. The result shows that the decrease in running time by sketching is highly correlated with the data reduction ratio in Fig 12; the running time gaps between PACC and PACC-ef are much bigger on LiveJournal, Twitter, and YahooWeb, whose data reduction ratios are high, than on Friendster and SubDomain, whose data reduction ratios are low. On small graphs, the effect of sketching is insignificant because of the preparation time of Hadoop as mentioned. Meanwhile, edge-filtering tends to be effective on sparse graphs with small average numbers of neighbors such as Skitter, Patent, and YahooWeb, as edge-filtering reduces the round number significantly; the round number required by each algorithm is presented in Fig 15. CC-computation and sketching of PACC are counted as separate rounds, respectively. The number of required round by PACC is smallest compared to the other tested algorithm due to the edge-filtering.
Hadoop vs Spark. We test PACC, Alt, and Alt-opt on Spark and compare the running times with those on Hadoop.  Missing algorithms for some datasets mean they failed to run on the datasets. Only PACC and PACC-ef succeed in processing ClueWeb12, which consists of 6 billion nodes and 72 billion edges. PACC outperforms Alt-opt showing from 2.9 (on Friendster) to 10.7 (on YahooWeb) times faster performance. Details on the datasets are given in Table 2. https://doi.org/10.1371/journal.pone.0229936.g014 On small graphs (Skitter, Patent, and LiveJournal), PowerGraph has the shortest running time. On large graphs (Friendster, Twitter, and SubDomain), PACC shows similar or better performance even than PowerGraph. Compared to Alt-opt, PACC is 2.3-8.8 times faster on all tested graphs. Alt and PowerGraph fail on YahooWeb because of out-of-memory errors. Fig 17 shows speedup over Hadoop on real world graphs. Regardless of algorithms and graphs, the running time on Spark is shorter than that on Hadoop in most cases. The reason is that Spark has shorter preparation time for each iteration than Hadoop does. As Hadoop requires constant preparation time for each iteration, the total preparation time is proportional to the number of rounds. Meanwhile, the total running time increases with the graph size. This means that the preparation time takes a large part of the overall running time if a method requires many rounds while the graph is small; in this case, Spark significantly reduces the total running time, as the preparation time of Spark is much shorter than that of Hadoop. Thus PACC-base,  Table 2. PACC requires the smallest number of rounds.
https://doi.org/10.1371/journal.pone.0229936.g015 Alt, and Alt-opt, which require many rounds, get a lot of benefit from Spark when the graph is relatively small (Skitter and Patent). Meanwhile, PACC and PACC-ef have few benefits since they require fewer rounds due to edge-filtering. Despite the relatively few benefits from Spark, PACC still shows the best performance on Spark as well as on Hadoop when the graph is large.

Conclusion
Connected component computation is a fundamental problem in the field of graph mining with various applications such as pattern recognition, graph partitioning, and graph compression. Meanwhile, the existing MapReduce algorithms struggle in the enormity of the graphs with billions of nodes and edges. In this paper, we propose PACC, a fast and scalable algorithm for connected component computation in such large graphs. PACC increases its scalability and performance by three techniques: two step processing of partitioning & computation, edge filtering, and sketching. The two step processing allows PACC to distribute workloads evenly, the edge filtering shrinks the size of intermediate data, and the sketching significantly reduces the input data size. We implement PACC on Hadoop and Spark. The running time of PACC on Spark is shorter than that on Hadoop in most cases; the result is consistent with existing researches that Spark is more suitable for iterative tasks than Hadoop. Regardless of the underlying platform, our experimental results show that PACC shows the best performance on real graphs, with 10.7× faster speed than the most recently proposed MapReduce algorithm.