Swarm: A federated cloud framework for large-scale variant analysis

Genomic data analysis across multiple cloud platforms is an ongoing challenge, especially when large amounts of data are involved. Here, we present Swarm, a framework for federated computation that promotes minimal data motion and facilitates crosstalk between genomic datasets stored on various cloud platforms. We demonstrate its utility via common inquiries of genomic variants across BigQuery in the Google Cloud Platform (GCP), Athena in the Amazon Web Services (AWS), Apache Presto and MySQL. Compared to single-cloud platforms, the Swarm framework significantly reduced computational costs, run-time delays and risks of security breach and privacy violation.

Great feedback -we now plotted standard deviation in the figures providing information about runtime, and added P-values to the experiments and indicated them in the figures when applicable.
2. It was unclear how much time is needed to set up the Swarm environment proposed by the authors. The computational overhead of setting up the platform does not seem to be included in the current analyses. The authors could package the setup codes in a single executable or via docker to expedite and simplify the setup process.
Thanks for the feedback -We created a dockerized version of the tool (Github link: https://github.com/StanfordBioinformatics/swarm/blob/master/Dockerfile) 3. Table 3 compares the average running time for querying an example single nucleotide polymorphism (SNP) using Apache Presto. The precision of the execution time is somewhat limited in the current table. The authors could consider using `time` or other related Unix command to get the exact running time of the query using different numbers of compute nodes. Metrics of variations in different runs will be helpful here as well.
Thanks for the feedback -we reran the experiment and reported the runtimes with higher accuracy in Table 3. 4. The authors could discuss how their proposed framework could accommodate federated machine learning tasks. More and more users are developing machine learning approaches to aggregate the contributions of different genetic variants in relation to their outcomes of interest (such as diseases, phenotypes, or other endpoints). It would be interesting to see if the proposed system not only provides simple summary statistics or results from data queries but also enables the transfer of gradients or any other intermediates required for federated learning. This could greatly enhance the potential impact of the proposed cloud computing framework.
Excellent feedback -we added a new feature in Swarm for handling ad hoc computation. As a proof of concept, we also implemented a version of Swarm that supports a federated learning use case. Users can provide an ad-hoc Docker image and execute a task on one platform (e.g., training a model), and move the trained model to a second platform. Swarm executes the first task on the first platform, transfers the output model/files to the other platform, and continues the computation by creating a new container on the second platform. For this proof of concept, we selected a basic polygenic risk score (PRS) analysis using PLINK, an open-source whole genome association analysis toolset. Figure 4 showed that the non-PVM (non-preemptible) environment and the N-2 PVMs (preemptible) + 2 non-PVMs setups have similar average execution time, while the monthly cost of the non-PVM environment is higher, since non-PVM generally cost more. Did the authors experience any preemption when running the experiments with PVM? If so, how does that affect the computation time and cost? What are the methods implemented in Swarm to enable a fast resumption of the unfinished computation?

5.
Thank you for the feedback. Currently, we do not have any self-healing mechanism in Swarm; one simple approach is to check if the output files exist already and avoid recomputation, but tracking potential corrupted files would be a major challenge here. Google BigQuery, AWS Athena and Apache Presto have different approaches for handling resilience within themselves, such as automatic retries to deal with minor storage and network availability downtime; more info around fault Tolerance: Google BigQuery: " In the event of a machine-level failure, BigQuery will continue running with no more than a few milliseconds delay. All queries should still succeed. In the event of a zonal failure, no data loss is expected. Soft zonal failure, such as resulting from a power outage, destroyed transformer, or network partition, is a well-tested path. " https://cloud.google.com/bigquery/docs/availability 6. The authors specified the compute nodes used for the third experiment (n1-standard-4 on Google Cloud Platform). However, the computing environment for other experiments was not specified. Different computational environments likely have different computational performance and cost.
Thank you for your accurate feedback. We updated all the figures, and specified the type of the instances, except for experiments related to BigQuery and Amazon Athena as they are both serverless. 7. Readers may be interested in a quick comparison between the proposed framework and some alternatives. For example, how is the computation time and cost of Swarm compare with a naïve implementation of SQL database that requires moving all relevant data across the platforms?
Great feedback -we added a new experiment using MySQL RDBMS (Figure 5 (a)). For this experiment, we loaded the entire 1000 Genomes dataset without the genotyping columns. The MySQL queries were executed on different n1-standard machine types on Google Cloud Platform. This evaluates execution time, but egress of the raw data would be of similar cost to the serverless platforms assuming the MySQL instance is running in a commercial cloud environment.
We also conducted a new experiment about the performance of Apache Presto importing CSV [a traditional row based storage format] vs. Parquet [a columnar storage format]. In order to compare the performance of Apache Presto on a CSV file and a Parquet file based on rsID search, the same 1000 Genomes dataset without genotyping information was utilized. Similar to the third experiment, the queries on Apache Presto were run using a different number of n1-standard machine type-based worker nodes on Google Cloud Platform.
Thank you -we fixed it.
2. For some reason, the GitHub Link provided in the manuscript does not work for me. I am not sure if it requires any specific permission setup (e.g., within the authors' research groups) to access the source codes.
Thanks for the feedback -We created a dockerized version of the tool and improved documentation around how to configure Swarm (Github link: https://github.com/StanfordBioinformatics/swarm)

Review #2
Bahmani et al. sescribe Swarm as a federated framework for variant analysis. Swarm offers to perform computational analyses on large genomics datasets hosted on different cloud platforms, enabling collaboration within or between different organizations and institutions, and facilitating multi-cloud solutions. As such, the method is fairly generic and could accelerate discoveries in small teams and larger collaborative studies, including between research and healthcare. It could also reduce costs of scientific studies, given the data movement is expensive in the cloud world.
Frameworks for federated computation are likely to increase in importance, as are frameworks to promote minimal data motion and facilitates crosstalk between datasets stored on different cloud platforms. This is an important contribution, and I overall liked this manuscript, although I have a couple of remaining questions: