Swarm: A federated cloud framework for large-scale variant analysis

doi:10.1371/journal.pcbi.1008977

Fig 1.

Swarm Framework: The Swarm architecture enables federated computation on genomic variants.

It classifies variant inquiry tasks into two main categories. “Stat Query” handles all queries that do not require data motion, and returns statistics such as counts of matched records and frequency of the alleles. “Data Query” handles queries that involve moving a set of records to another computing platform for further processing. In this figure, as an example, we illustrate the use of AWS Athena and GCP BigQuery.

More »

Expand

Table 1.

Variants used for testing Stat Queries.

More »

Expand

Table 2.

Databases used for annotating the 1000 Genomes data sets in this study.

More »

Expand

Fig 2.

Runtime and amount of data processed for computing allele frequency for an input set of rsIDs in BigQuery and Athena.

Average values and standard deviations were plotted. (A) depicts the average execution time in seconds. The light blue and light green bars represent configurations without any optimizations (i.e., the entire input data used as it was), and the dark blue and dark green bars represent configurations with optimizations (i.e., the input data was divided by partitioning or clustering); (B) shows the amount of data processed in megabytes, and the y-axis is logarithmic in scale. Significance differences between groups are indicated on top of the bars (two samples t-test). Note that for each rsID experiment, differences in runtimes between any BigQuery and Athena runs in (A) were highly significant (P < 1e-5), and for (B), differences within the BigQuery or Athena runs were also highly significant (P < 1e-5).

More »

Expand

Fig 3.

Runtime and the amount of data processed for annotating an input set of genes.

A, V and J stand for Annotation records, Variant records and Join table operations, respectively. Average values and standard deviations were plotted. (A) depicts the execution time in seconds for the two input genes. In this experiment, the annotation table was in BigQuery and the variant table in Athena. Therefore, Swarm first found all the annotation records in BigQuery that overlapped with the input gene regions, compressed them and moved them to Athena. Then, on the Athena side, Swarm decompressed the overlapping annotation data and created a temporary table, which was eventually processed to join with the existing variant table. The light blue and light green represent the configurations without any optimizations by partitioning or clustering, and the dark blue and dark green represent the configurations with optimizations. (B) shows the amount of data processed in megabytes, and the y-axis is logarithmic in scale. Significance differences between groups are indicated on top of the bars (two samples t-test). Note that for (A), differences between any BigQuery and Athena groups were highly significant (P < 1e-5), and for (B), differences within the BigQuery or Athena groups were also highly significant (P < 1e-5).

More »

Expand

Table 3.

Average execution time for querying rs671 with the binID on the partitioned Parquet files of one half of the 1000 Genomes dataset using Apache Presto, with different numbers of worker nodes.

Each configuration additionally includes one master node.

More »

Expand

Fig 4.

Execution time for searching rs671 with different number of worker nodes for running Apache Presto on Dataproc.

(A) The average runtime using partitioning versus ignoring partitioning in Apache Presto. (B) The average runtime using preemptible (PVM) and non-preemptible (Non-PVM) instances. Average values and standard deviations were plotted. (C) The projected cost of reserving the dedicated nodes on GCP on a monthly basis. Monthly cost as of February 2021 https://cloud.google.com/compute/all-pricing. Note, for serverless systems like BigQuery and Athena, users are charged based on the amount of data processed, respectively. In (A), Differences between the paired groups of with or without partitioning were highly significant (two sample t-tests P < 1e-5). In (B), differences between the paired groups of Non-PVM and PVM, although not significant, had marginal P values close to 0.05.

More »

Expand

Fig 5.

Searching rs671 in the 1000 Genomes dataset loaded in (A) MySQL and (B) Apache Presto with different settings, i.e., varying number of vCPUs and main memory sizes. Average values and standard deviations were plotted. For the Apache Presto runs in (B), runtimes between CSV input versus Parquet input were compared, and significant P values are indicated (two sample t-tests). In addition, anova tests indicated that the number of worker nodes had a significant impact on the runtimes (P < 1e-5).

More »

Expand