The PLOS ONE collection on machine learning in health and biomedicine: Towards open code and open data

Recent years have seen a surge of studies in machine learning in health and biomedicine, driven by digitalization of healthcare environments and increasingly accessible computer systems for conducting analyses. Many of us believe that these developments will lead to significant improvements in patient care. Like many academic disciplines, however, progress is hampered by lack of code and data sharing. In bringing together this PLOS ONE collection on machine learning in health and biomedicine, we sought to focus on the importance of reproducibility, making it a requirement, as far as possible, for authors to share data and code alongside their papers.


Introduction
In its strategic plan for data science released in June of this year, the National Institute of Health stated that we "stand at a unique moment of opportunity in biomedical research" [1]. "Advances in [data] storage, communications, and processing" offer an opportunity for technologies such as machine learning to yield "transformative changes for biomedical research over the coming decade". This promise, alongside more accessible data and open-source software, has created a surge of studies in machine learning in health and biomedicine. Our collection provides a glimpse of things to come.
At the time of authoring the introduction to the collection, we had received over 100 submissions, of which we selected a subset across topics including primary care, acute care, medical imaging, and global health. Papers already accepted cover a diverse range of topics, including areas such as early detection of glaucoma, prediction of survival using health records, localization of ossification in radiographs, and risk stratification in neuroblastoma using transcriptomics data [2][3][4][5]. In all cases, the authors have included a link to their code and instructions for accessing data. Many more papers are under review.
While we have no doubt that applications of machine learning in health and biomedicine will have a tremendous impact on the outcomes of many future patients, we share concerns that the rigor of research is often tainted by the environment that drives it [6,7] trust in research findings, greater transparency of data, protocols, and code is desirable [8,9]. In the Call for Papers for this collection, we emphasised that "Data underlying the study's findings will be a requirement of publication, per the PLOS data policy" and that authors would be "responsible for providing, upon submission, the source code needed to replicate their findings" [10]. Journals such as PLOS ONE have had firmly worded policies on the importance of data and code sharing for many years, so these requirements are nothing new. Nevertheless, our experience has shown that policies on data and code sharing are often weakly enforced. It is rare for a published study in any journal to be associated with publicly available code and data, and it is even rarer for the code and data to be sufficiently well-curated to enable a third party to reproduce the study. Statements such as "code and data are available from the authors upon request" are the norm, rather than the exception. Such statements are typically misleading [11,12].
We therefore present this collection with an emphasis on reproducibility, to set a precedent for publishers and researchers alike on how open publishing policies can be applied to studies in health and biomedicine. With support from PLOS ONE staff, we have worked hard to ensure that all of the papers in the collection are associated with public code. While some data could not be shared publicly, all papers should at the very least provide institutional contact information for data requests in line with PLOS ONE's data availability policy. Despite best efforts, not all of the code and data is as cleanly organized or presented as we would like, but we hope that readers will respond positively to the good intentions of all authors and help data and code sharing to become common practice.

Reproducibility
Computational reproducibility is the ability to repeat an analysis of a given data set and obtain sufficiently similar results ( [13][14][15][16] and references therein). It requires having the complete software environment available, properly documented full source code, and the original data. Ideally the reader should be able to inspect, modify and apply the code under modified parameter settings to reproduce the results and explore the robustness of the algorithm to the values of its parameters [17]. In recent years, platforms designed for the development of software such as GitHub, GitLab, and BitBucket, have been adopted by the scientific community as ways to distribute the code that accompanies scientific papers. Initially little more than webbased front ends for source control systems such as "git", they evolved into integrated solutions that can render markdown documents and Jupyter notebooks, which can be used to visually present the results together with the code used to obtain them [18]. A number of platforms have been created to facilitate computational reproducibility of code shared through such platforms. One such example is Code Ocean, which allows readers to directly interact with code by running it within a widget embedded within an article [19]. Another example is Binder, which enables an investigator to quickly reproduce a computational environment using data and code shared online [20,21].
A potentially groundbreaking algorithm and its code implementation only really benefit the community and the wider society if they can be applied to new data and adapted to similar problems. Therefore reproducibility should be taken a step further by aiming for reusability [22], which enables the application of the methodology to new questions or new data so that future studies build upon past studies and science progresses faster. Reusability requires that the authors make the additional step of explaining and documenting how some decisions were taken or how some parameters were chosen based on the data. The authors should make the extra effort to make the code easy to maintain and to extend. As any derived code should be similarly distributed, the issue arises of a proliferation of different versions of similar code. This is where the use of repository management services, like GitLab for example, can make a difference by allowing researchers to clone existing code, modify it to suit their needs and possibly integrate potential improvements back into the original repository through a pull request. These services allow the community to track the evolution of the initial piece of code accompanying a paper into a widely used toolbox through collaborative science.

Data sharing
Data sharing in medical research is advocated for reasons including verification of results and for unlocking the opportunity to address complex medical questions through the creation of large multi-center datasets [23][24][25][26]. In the U.S., the National Institutes of Health (NIH) has required data sharing for all large funding grants since 2003 while the National Science Foundation (NSF) has required research grant proposals to include data sharing plans since 2011. Similar policies have been introduced by the UK research councils [27] and by the European Commission [28]. Piwowar (2011) examined studies funded by the NIH and the NSF and found that data sharing remains infrequent in spite of the recommendations by the funding agencies [29]. Even in the field such as genomics with mature policies, repositories and standards, research data sharing levels are low and increasing only slowly. In a recent survey of patients participating in clinical trials, a large majority (82%) indicated that they perceived the benefits of sharing deidentified data to outweigh the negatives [30]. 93% of respondents in the survey indicated willingness to allow their clinical trial data to be shared with scientists. A dominant theme in responses to the survey was the need for clinical data to help others as much as possible. Many of the respondents urged greater cooperation and less competition among scientists. The feeling that overt competitive behaviour can hamper research progress has been reflected in scientist sentiment: one study found that researchers who perceive their fields as particularly competitive are more likely to withhold data [31]. Numerous studies have suggested that data withholding can have a detrimental effect on scientific training and research [30][31][32][33][34].
Over the last decade, a significant proportion of journals have adopted guidelines for authors that explicitly require data associated with studies to be shared. Simple statements of willingness to share data by investigators rarely translate to true availability when data is requested by independent scientists [9], which motivates a need for sharing by more formal methods (for example, sharing via public repositories). Even where data is findable and accessible, maximum value can be gained where it is interoperable (for example, using standardised vocabularies) and reusable (for example, well described with an open license), as outlined in the FAIR Principles [35].
In a 2011 survey of papers published in 50 popular research journals, less than 10% of investigators made their raw data available [32]. Investigators typically cite concerns around patient privacy as the primary reason for withholding data. Privacy is a serious matter and it is appropriate that this concern curtails data sharing to an extent, but approaches that help to address privacy concerns while allowing data to be shared are emerging. These approaches typically include combinations of deidentification (removal of information that allows data to be attributed to a patient), statistical methods such as differential privacy to obscure details, and access control through protected networks.
The development of clear and effective policies to regulate data sharing is an ongoing task for governing organisations. Notably, the European Union (EU) General Data Protection Regulation (GDPR) went into effect in May 2018, with the goal of harmonizing data protection across the EU. It applies to data pertaining to any EU resident, regardless of where that data was processed. The policy was crafted with the understanding that health data should be a public good, but the penalty for breach of patient privacy is so steep (up to 4% of annual global revenue or 20 million Euros) that there are concerns that it will curtail momentum around data sharing [36]. The GDPR does not apply to anonymous or anonymized data, but it allows for significant room in the interpretation of key aspects of data protection, including when data are considered anonymized; what safeguards are sufficient for processing data under the research exemption; and what further limitations should be set on processing personal data for research purposes [37].
Despite the obstacles to data sharing, authors of studies submitted to the collection found solutions. In reviewing papers in their approaches to data sharing, we applied some general principles: open is better than closed; some data must be restricted, but this should not prevent it from being citable and discoverable; where data is restricted, synthetic or sample data should be provided; it is better to share data in specialist archives than as supplementary files. In general, authors of papers submitted to the collection were receptive to our data sharing requests. Several submissions that simply stated that data would be "available upon request" in initial versions, for example, were subsequently updated to provide links to datasets in public repositories. In one study, the authors determined that while the raw dataset could not be shared for privacy reasons, it was possible to provide code to enable its reproduction. In many cases, simple rewording of data availability statements helped to clarify how a researcher might obtain a protected dataset from a host institution.

Code sharing
There are specific cases where there is potential for private information to leak into code-for example, caution might be needed in sharing a word embedding generated from detailed patient notes-but in general code does not suffer the privacy risks associated with data. For this reason, one might expect code to be widely shared. In practice we find this is not the case, in short because effort outweighs reward. Familiar excuses for not sharing code include: people might find bugs; the code isn't clean enough, and; supporting users will be too much work [38,39]. For most studies there is a nugget of truth in each of these points. As a community we should strive to share anyway, accepting that bugs will be found; code can always be improved; and expectations must be managed. There are many excellent reasons to share code and countless ways to do so [40,41].
We believe that reviewing well-documented code can provide as much insight into a study as reviewing a paper itself. Code that is not clear and well-managed raises questions about the quality of a study, even if the paper itself is well-written and apparently scientifically sound. The PLOS ONE editorial team assessed the availability of code for every paper considered for inclusion in the collection. Similar to our approach for data, several basic principles were applied: code should be open unless there are exceptional circumstances; protection of intellectual property will not be cause for exception; user guidelines and a license are required; a fixed version of the code that underpins the study should be permanently archived and linked from the paper with a persistent identifier such as a Digital Object Identifier (DOI).
When code accompanies a paper submitted to a journal, how rigorously should it be peer reviewed? Ideally, the code should be reviewed with the same rigour as the paper, but relying on an already-stretched pool of referees to do this work is a large ask. For this collection, code was made available to referees during the review process where possible, but there was no expectation for it to be reviewed and we did not attempt to execute the code. By ensuring that code is available and discoverable from articles, however, we create the opportunity for postpublication review.

Closing remarks
The papers in this collection present a range of machine learning applications in health and biomedicine. To our belief, these papers go beyond what is typical in this field in terms of data and code sharing. For the research community, we hope that the collection sets a standard that encourages sharing more widely. For journal editors, we intend to demonstrate that authors are generally open to sharing when prompted. Finally, for organizations looking to fund machine learning applications in healthcare, we urge investments into the development of tools and platforms that promote reproducibility.