CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

doi:10.1371/journal.pone.0151064

Fig 1.

System architecture.

CLUSTOM-CLOUD consists of Application and Cluster units. Application is composed of Job Tracker and Data Manager. Job Tracker assigns Task Tracker to each Cluster Node in Cluster and checks its status. Task Tracker processes a distributed task in parallel using multi-threads. Data manager manages processed results and generates clustering results. Cluster is a set of N-nodes, which are unified by IMDG. Cluster is composed of Cluster Node and Task Tracker. A part of RAM in each Cluster Node is assigned to IMDG data structure and backup area.

More »

Expand

Fig 2.

Schematic diagram of clustering workflow.

16S rRNA sequences in FASTA format are provided as input. Each input file, already checked for low-quality and chimera errors, is pre-processed by the removal of duplicates and transformation of k-mer into numeric values. A fixed number of sequence pairs are distributed to each cluster node for k-mer (initial) and NW (refinement) distance calculation. Processed results are merged upon completion of each unit task. Clusters are determined based on criteria described previously [11] and in text. Output files are created and data are cleared from memory.

More »

Expand

Fig 3.

Representation of k-mer transformation method.

The diagram summarizes the layout of the k-mer transformation method. (A) All k-mer strings in the input sequence dataset along with non-redundant numeric values are loaded into hash map. (B) All k-mer in each sequence are replaced with numeric values corresponding to each key in hash map.

More »

Expand

Fig 4.

Fine-grained task distribution in CLUSTOM-CLOUD.

The figure summarizes the workflow of distributed processing in CLUSTOM-CLOUD. (A) The number of all possible sequence pairs that need to be compared for distance calculation is represented as a right-angled triangle; n represents the total number of sequences. (B) A chunk-size based on system granularity is determined to distribute only a fixed number of sequence pairs (shown here with 2 K) to each cluster node. (C) Each task (e.g., T_i) is assigned to nodes from top to bottom and left to right. (D) Each node takes and processes tasks in the order of task priority. (E) The assigned task (T_i) is divided into smaller sub-tasks (t_j) and processed in parallel using multi-threads (w_k) depending on the number of threads on the cluster node.

More »

Expand

Table 1.

CLUSTOM-CLOUD running time for each step according to the complexity of the microbial diversity^{^a}.

More »

Expand

Fig 5.

Running time and memory usage evaluation of the k-mer transformation method.

Comparison of the memory usage (A) and running time (B) were performed with and without k-mer transformation method only at the k-mer distance calculation step. Two of 100K 16S sequences were independently and randomly extracted from the sequence datasets of high-, intermediate- and low-complexity. For each of the six different sequence datasets, the running time and memory usage were measured three times independently.

More »

Expand

Fig 6.

Running time of the whole process according to the complexity of the microbial diversity.

Running time (A) and memory usage (B) of CLUSTOM-CLOUD were measured by analyzing 50 K, 100 K, 150 K, and 200 K sequences in high-, intermediate-, and low-complexity datasets (3% distance threshold). The measures were repeated three times per dataset and the average values are plotted.

More »

Expand

Table 2.

Time and cost of running one million reads on CLUSTOM-CLOUD.

More »

Expand

Fig 7.

Comparative accuracy test of existing clustering programs.

The clustering accuracy of CLUSTOM, CLUSTOM-CLOUD, DOTUR-AL-PSA, ESPRIT-Tree, mothur-AL-PSA, mothur-AL-MSA, UCLUST and Swarm was performed based on 16S rRNA pyrosequences of a mock community that was constructed by pooled DNA of 21 human-associated prokaryotic strains with even concentration (HMP-Mock-community). The precision and recall metrics as well as their F₂ values were used to compare the clustering accuracy of the eight programs at the species (A) and genus (B) levels.

More »

Expand