Reproducible big data science: A case study in continuous FAIRness

doi:10.1371/journal.pone.0213013

Fig 1.

A high-level view of the TFBS identification workflow, showing the six principal datasets, labeled D1–D6, and the five computational phases, labeled –.

More »

Expand

Fig 2.

Network topology showing the distributed environment which was used to generate the six principal datasets, labeled D1–D6, and the locations of the five computational phases, labeled –.

More »

Expand

Fig 3.

An example BDBag, with contents in the data folder, description in the metadata folder, and other elements providing data required to fetch remote elements (fetch.txt) and validate its components.

More »

Expand

Fig 4.

A minid landing page for a BDBag generated by the encode2bag tool, showing the associated metadata, including locations (in this case, just one).

More »

Expand

Table 1.

Details of the per-tissue computations performed in the ensemble footprinting phase.

Data sizes are in GB. Times are in hours on a 32-core AWS node; they sum to 2,149.1 node hours or 68,771 core hours. DNase: DNase Hypersensitivity (DNase-seq) data from ENCODE. Align: Aligned sequence data. Foot: Footprint data and footprint inference computation. Numbers may not sum perfectly due to rounding.

More »

Expand

Fig 5.

The encode2bag portal.

The user has entered an ENCODE query for urinary bladder DNase-seq data and clicked “Create BDBag.” The portal generates a Minid for the BDBag and a Globus link for reliable, high-speed access.

More »

Expand

Fig 6.

Our DNase-seq ensemble footprinting workflow, used to implement and of Fig 1.

The master workflow A takes a BDBag from as input. It executes from top to bottom, using subworkflows B and C to implement and then subworkflow D to implement . It produces as output BDBags containing aligned DNase-seq data and footprints, with the latter serving as input to .