CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce

Background Explosive growth of next-generation sequencing data has resulted in ultra-large-scale data sets and ensuing computational problems. Cloud computing provides an on-demand and scalable environment for large-scale data analysis. Using a MapReduce framework, data and workload can be distributed via a network to computers in the cloud to substantially reduce computational latency. Hadoop/MapReduce has been successfully adopted in bioinformatics for genome assembly, mapping reads to genomes, and finding single nucleotide polymorphisms. Major cloud providers offer Hadoop cloud services to their users. However, it remains technically challenging to deploy a Hadoop cloud for those who prefer to run MapReduce programs in a cluster without built-in Hadoop/MapReduce. Results We present CloudDOE, a platform-independent software package implemented in Java. CloudDOE encapsulates technical details behind a user-friendly graphical interface, thus liberating scientists from having to perform complicated operational procedures. Users are guided through the user interface to deploy a Hadoop cloud within in-house computing environments and to run applications specifically targeted for bioinformatics, including CloudBurst, CloudBrush, and CloudRS. One may also use CloudDOE on top of a public cloud. CloudDOE consists of three wizards, i.e., Deploy, Operate, and Extend wizards. Deploy wizard is designed to aid the system administrator to deploy a Hadoop cloud. It installs Java runtime environment version 1.6 and Hadoop version 0.20.203, and initiates the service automatically. Operate wizard allows the user to run a MapReduce application on the dashboard list. To extend the dashboard list, the administrator may install a new MapReduce application using Extend wizard. Conclusions CloudDOE is a user-friendly tool for deploying a Hadoop cloud. Its smart wizards substantially reduce the complexity and costs of deployment, execution, enhancement, and management. Interested users may collaborate to improve the source code of CloudDOE to further incorporate more MapReduce bioinformatics tools into CloudDOE and support next-generation big data open source tools, e.g., Hadoop BigTop and Spark. Availability: CloudDOE is distributed under Apache License 2.0 and is freely available at http://clouddoe.iis.sinica.edu.tw/.


Prerequisites of CloudDOE
CloudDOE is an open source project and it is announced as Apache license 2.0. It installs Java runtime environment version 1.6 and deploys a Hadoop cloud with Apache Hadoop [1] version 0.20.203. Binaries and source code of CloudDOE are hosting on website (http://clouddoe.iis.sinica.edu.tw/) and GitHub (https://github.com/moneycat/CloudDOE). Demo videos are also available on the website, and provided as supplementary materials of the CloudDOE paper. The following are steps to install and run CloudDOE.
(1) Make sure Java runtime environment (JRE) version 1.6 or later are installed on the local computer that you are going to run CloudDOE. For interested developers, building CloudDOE from source code involves different requirements as follows.
(1) Java development kit (JDK) and JavaFX runtime environment (jfxrt) are required. JDK version 1.7 or later is encouraged for better support of successfully building CloudDOE for users, since jfxrt is evolved in the JDK release. Otherwise, users should download the correct version of jfxrt package from Oracle and integrate the package into development environment manually. (2) Integrated development environment (IDE) tools, e.g., Eclipse, may be useful for building CloudDOE. Users can import source code into the IDE, or utilizing Java native command-line tools, i.e., javac and jar, to build CloudDOE.  Machine information is necessary for deploying a Hadoop cloud, and can be obtained from cloud service providers or system administrators when requesting computing resources. Table S2 lists necessary information for deploying a Hadoop cloud, and lists associated Hadoop role types assigned for each machine by CloudDOE. In Table S2, firewall and network connection information are simulated from real Microsoft Azure [2] virtual machine data. This simulated data forms a general environment of a private (or hybrid) cloud since machines are partially under a network address translation (NAT) layer as shown in Figure S1. In Table S2, the external Internet protocol (IP) address of machine A is used for communication between CloudDOE and a Hadoop cloud, whereas communications within Hadoop services are through internal IP addresses. In this scenario, users must contact service providers or system administrators for assistance before deploying a Hadoop cloud to prevent the necessary service ports are blocked by firewalls.

Management of configuration changes
CloudDOE creates and modifies necessary files of a Hadoop cloud during deployment.
Since the files may only be accessed by administrators, accounts with super-user privileges, i.e., users in the wheel group, are encouraged to deploy the Hadoop cloud. Table S3 lists files and directories CloudDOE may affect. The original configuration files are renamed with a timestamp and saved in the same directory for each installation by CloudDOE. The saved files are used to restore the system to the state before deployment. An interested user can achieve the deploying history through consoles.

Methods to deploy a Hadoop cloud with different release
CloudDOE downloads and installs Apache Hadoop version 0.20.203 by default. An interested user can download the development branches, or manually change the Hadoop release configuration to deploy a cloud with different Hadoop releases. Table S4 lists the configurations are located in the file at workspace/bin/32_install_hadoop.sh.
In the current CloudDOE release, the deploying function has been investigated with most common Apache Hadoop releases, i.e., version 0.20.203 and 1.2.1. The support of deploying Hadoop version 2 is also released as a development branch. We wish interested users and communities can join us to develop and examine the deployment procedure with different Hadoop distributions, e.g., Cloudera CDH [3], Hortonworks [4] and MAPR Hadoop [5], and releases.

A Prototype for Extending the Functionalities of a Hadoop Cloud
The functionalities of a Hadoop cloud are greatly enhanced by installing additional tools or platforms. For example, Hadoop related projects extend the ability of cloud by forming a distributed database [6,7] and accomplishing ad-hoc queries [8], or providing a highlevel platform and its correspond language for creating MapReduce programs [9]. Since the functionalities are value added for bioinformatics [10,11], the installation procedure also needs specific knowledge similar to deploy a Hadoop cloud. To reduce the obstacles encountered in enhancement, we built Extend wizard, an extension management center of a Hadoop cloud. Extend wizard is designed as a dashboard of collected cloud computing plug-ins, e.g., tools and platforms. It provides an approach for users to install plug-ins onto a Hadoop cloud. We demonstrated the features of Extend wizard with Cloudgene [12], a web-based MapReduce execution platform for bioinformatics. In conclusion, users can browse and install plug-ins, and monitor the installation progress through the wizard. Furthermore, program providers and developers can also incorporate their programs into our management center by writing suitable installation scripts. Figure S2 shows screenshots of Extend wizard. Because the demand of new MapReduce tools of each laboratory is different, it requires community efforts to maintain a repository of MapReduce tools and Hadoop-related add-ons with metadata and installation scripts. We will continue to provide fundamental supports toward this aim in Extend wizard.

The extension management center of Extend Wizard
The extension management center embedded a web browser component to handle interactions between a user and the cloud. Thus, HTML and JavaScript libraries (i.e., jQuery and jQuery UI) are utilized to organize a web-based interactive dashboard. An installation request launched from JavaScript is translated to proper install information, including the location of installation scripts and technical parameters, and pass to server side agents through the extend wizard. Therefore, installation procedures are controlled by the agents, and users can monitor installation progresses with the same mechanisms of deployment. Figure S3 and S4 show the iterations between CloudDOE, a Hadoop cloud and Internet when manipulating a Hadoop cloud through CloudDOE. The procedures of deploying or extending a cloud with CloudDOE have similar steps, which shown in Figure S3. Step 1 is the only one step that requires users to provide information for installation, i.e., IP address and user credentials of each computer. CloudDOE then uploads installationrelated resources to the master automatically (Step 2), and initiates installation procedures (Step 3). Required software, i.e., Java and Hadoop distribution, is requested and downloaded from the Internet (Step 4) during installation. Users can monitor installation progress through CloudDOE (Step 5). Figure S4 demonstrates interactions of operating a MapReduce program with CloudDOE. There are 3 major steps for users to performing an execution after successfully connect to the master of a Hadoop cloud as follows.

Interactions between CloudDOE Components
Step 1, import data to Hadoop cloud for analysis. Step 2 and 3 set up program parameters and submit the execution request to master of the Hadoop cloud. Users can monitor the progress during execution (Step 4), and download results from Hadoop cloud after successful execution (Step 5).

Schema of the Program Integration Files
CloudDOE utilizes a structured extensible markup language (XML) file for program integration. A configuration file consists of 4 main blocks, i.e., program, parameters, logs, and downloads. There are also multiple configuration fields or blocks contained in each main block, as shown in Figure S5, which is described as follows.
(1) The program block contains general program information.
• The <name>, <author>, <version>, and <website> fields indicate the program name, author of the program, program release version, and website respectively. • The <jarfile> field specifies the file name of the target MapReduce file.
• The <streaming> is an optional field for streaming mode. CloudDOE may execute the target program in streaming mode if the value is True. • The <lastupd> field contains the last update timestamp of this configuration file.
• The <argformat> is the most important field which defined the format of argument list used by the target program. Each argument in the format list represent as a variable, e.g., $input, $output, $work, and $value, which is mapped to the <type> field in each parameter. CloudDOE fills the variables and passes them to the target program.
(2) The parameters block contains multiple parameter blocks. A valid CloudDOE integration configuration file must contains at least one input, out, and work parameter for a target program. A parameter block consists of the following fields.
• The <label> field represents the display label of the parameter input area on CloudDOE interface. • The <editable> field controls whether the parameter input area can be edited by users or not. (3) The logs block may contain multiple log blocks. A log is produced by the standard output of the program, and is used to monitor the execution. It can also be downloaded for further analysis. A log field is composed by the following fields.
• The <name> field is the log file name generated by the target program. CloudDOE uses the value of the work parameter as prefix of the log file, and thus, perform further monitor processes. • The <type> field is used to determine the major log when there are multiple log files pf the target program.
(4) The downloads block may contain multiple download blocks. A download block is corresponding to a result produced by the target program. Each download block is consisted of the following fields.
• The <src> field is the name of target result. The target result may be a field or a directory contains a bunch of files. CloudDOE uses the value of the output parameter as its parent directory.
• The <dst> field represents the name used to stored the result in local computer.
• The <merge> is an optional field that indicates the download method. The target result will be merged into a single file if the value is True.
Since composing a valid CloudDOE program integration configuration of a MapReduce program may still be a complicated work to authors, we wish interested users or communities can join us to improvement the integration method. Figure S5. Format of the program integration configuration file of CloudDOE. A configuration file is defined in extensible markup language (XML), and contains 4 main blocks, i.e., program, parameters, logs, and downloads. There are multiple configuration fields or blocks in each main block.