Ten simple rules for writing Dockerfiles for reproducible data science

Daniel Nüst; Vanessa Sochat; Ben Marwick; Stephen J. Eglen; Tim Head; Tony Hirst; Benjamin D. Evans

doi:10.1371/journal.pcbi.1008316

Loading metrics

Open Access

Ten simple rules for writing Dockerfiles for reproducible data science

Daniel Nüst ,

* E-mail: daniel.nuest@uni-muenster.de

Affiliation Institute for Geoinformatics, University of Münster, Münster, Germany

https://orcid.org/0000-0002-0024-5046

⨯
Vanessa Sochat,

Affiliation Stanford Research Computing Center, Stanford University, Stanford, California, United States of America

https://orcid.org/0000-0002-4387-3819

⨯
Ben Marwick,

Affiliation Department of Anthropology, University of Washington, Seattle, Washington, United States of America

https://orcid.org/0000-0001-7879-4531

⨯
Stephen J. Eglen,

Affiliation Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, Cambridgeshire, Great Britain

https://orcid.org/0000-0001-8607-8025

⨯
Tim Head,

Affiliation Wild Tree Tech, Zurich, Switzerland
⨯
Tony Hirst,

Affiliation Department of Computing and Communications, The Open University, Great Britain

https://orcid.org/0000-0001-6921-702X

⨯
Benjamin D. Evans

Affiliation School of Psychological Science, University of Bristol, Bristol, Great Britain
⨯

Ten simple rules for writing Dockerfiles for reproducible data science

Daniel Nüst,
Vanessa Sochat,
Ben Marwick,
Stephen J. Eglen,
Tim Head,
Tony Hirst,
Benjamin D. Evans

Published: November 10, 2020
https://doi.org/10.1371/journal.pcbi.1008316

Reader Comments

Post a new comment on this article

Some additional resources

Posted by yarikoptic on 18 Nov 2020 at 18:53 GMT

Great article! Some additional tools I would like to reference:

**"Rule 5: Specify software versions"**
- could be too tedious and not feasible in scenarios where you would like to retroactively reproduce some past environment. On Debian (and NeuroDebian-) systems snapshots of APT repositories allow to "freeze" to a specific date. nd_freeze tool (from neurodebian-freeze in (Neuro)Debian) invocation could be placed as the first command in Dockerfile to run (shipped within all NeuroDebian containers) which would switch to use APT repositories in the state for that date. So it would mimic "real-life" scenarios where we update from state on one date to another, and could be used to reproduce some past environment. E.g. see example: https://github.com/ReproN... produced by neurodocker (see next, which supports nd_freeze)

**"Tools for container generation"** - https://github.com/ReproN... (despite the name it is useful beyond neuro domain) - create Dockerfile or Singularity recipes from a single command line invocation, which would adhere to best practices

**"Use version control"** - DataLad or git-annex directly could be used to store not only recipes but containers themselves. https://github.com/datala... DataLad extension also helps to "register" and use those containers with DataLad. This way you could keep **everything** (code, Dockefile recipes, data files themselves, and container images) under version control

**Rule 7: Mount datasets at run time** - totally agree. But it hinders reproducibility since then outside data resources might change, be unavailable, have different paths on different systems. Have a look at YODA principles to self-contain everything within a "analysis dataset" itself -- input data, containers, etc, so all the paths to be mounted always reside "within" the "analysis dataset". https://github.com/myyoda... . More on that could be found in DataLad handbook: http://handbook.datalad.o... , and here is a sample DataLad dataset with containers for neuroimaging which enforces "total compartmentalization" for singularity containers: https://github.com/ReproN...

No competing interests declared.

RE: Some additional resources

DanielNüst replied to yarikoptic on 25 Nov 2020 at 16:45 GMT

Thank you very much for the comment! It would have been great to learn about them during the preprint stage, but I'm optimistic we'll find a way to make sure interested readers find them here, but also in the articles repository at https://github.com/nuest/...

- Pinning whole APT repositories and putting images into git annexes are quite useful ideas. I'm not sure they would fit the intended target audience of the article, but for advanced users they are one more layer of security.
- neurodocker is a tool to generate containers, but I'm a bit sceptical as to its accessibility as a CLI tool, in surely would help to apply good practices for Dockerfiles though
- Re. mounting: that is why we recommend to version control the Dockerfile and mounted files in the same repository; I did not know YODA, a very good effort - is it picked up broadly in your community? I think if people use/follow YODA and/or DataLad tools/principles, they are already on a very good path and might not need the manually crafted Dockerfile we focus on in the article; therefore, thank you for pointing them out!

No competing interests declared.

Subject Areas
?

For more information about PLOS Subject Areas, click here.
We want your feedback. Do these Subject Areas make sense for this article? Click the target next to the incorrect Subject Area and let us know. Thanks for your help!

Computer software
Is the Subject Area "Computer software" applicable to this article?

Thanks for your feedback.
Software tools
Is the Subject Area "Software tools" applicable to this article?

Thanks for your feedback.
Reproducibility
Is the Subject Area "Reproducibility" applicable to this article?

Thanks for your feedback.
Programming languages
Is the Subject Area "Programming languages" applicable to this article?

Thanks for your feedback.
Metadata
Is the Subject Area "Metadata" applicable to this article?

Thanks for your feedback.
Habits
Is the Subject Area "Habits" applicable to this article?

Thanks for your feedback.
Source code
Is the Subject Area "Source code" applicable to this article?

Thanks for your feedback.
Computer and information sciences
Is the Subject Area "Computer and information sciences" applicable to this article?

Thanks for your feedback.

Ten simple rules for writing Dockerfiles for reproducible data science

Ten simple rules for writing Dockerfiles for reproducible data science

Reader Comments

Post Your Discussion Comment

Why should this posting be reviewed?

Thank You!

Some additional resources

Posted by yarikoptic on 18 Nov 2020 at 18:53 GMT

RE: Some additional resources

DanielNüst replied to yarikoptic on 25 Nov 2020 at 16:45 GMT