Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff

Background Significant efforts are underway within the biomedical research community to encourage sharing and reuse of research data in order to enhance research reproducibility and enable scientific discovery. While some technological challenges do exist, many of the barriers to sharing and reuse are social in nature, arising from researchers’ concerns about and attitudes toward sharing their data. In addition, clinical and basic science researchers face their own unique sets of challenges to sharing data within their communities. This study investigates these differences in experiences with and perceptions about sharing data, as well as barriers to sharing among clinical and basic science researchers. Methods Clinical and basic science researchers in the Intramural Research Program at the National Institutes of Health were surveyed about their attitudes toward and experiences with sharing and reusing research data. Of 190 respondents to the survey, the 135 respondents who identified themselves as clinical or basic science researchers were included in this analysis. Odds ratio and Fisher’s exact tests were the primary methods to examine potential relationships between variables. Worst-case scenario sensitivity tests were conducted when necessary. Results and Discussion While most respondents considered data sharing and reuse important to their work, they generally rated their expertise as low. Sharing data directly with other researchers was common, but most respondents did not have experience with uploading data to a repository. A number of significant differences exist between the attitudes and practices of clinical and basic science researchers, including their motivations for sharing, their reasons for not sharing, and the amount of work required to prepare their data. Conclusions Even within the scope of biomedical research, addressing the unique concerns of diverse research communities is important to encouraging researchers to share and reuse data. Efforts at promoting data sharing and reuse should be aimed at solving not only technological problems, but also addressing researchers’ concerns about sharing their data. Given the varied practices of individual researchers and research communities, standardizing data practices like data citation and repository upload could make sharing and reuse easier.


Introduction
The importance of sharing and reusing biomedical research data is well established. Sharing data facilitates agile research that allows for quicker translation of research findings into clinical practice [1][2][3], enhances scientific reproducibility and transparency [4][5][6][7][8], and increases collaboration and interdisciplinary research that helps advance science [9][10][11]. Collaboration and sharing allow for more effective analysis of the massive datasets that characterize certain data-intensive fields of research, including 'omics (such as genomics, proteomics, and metabolomics) and population health [12][13][14]. As the cost of genetic sequencing falls, electronic health records become more widely adopted, and mobile devices incorporate sensors that gather health data from patients, the amount of data available for analysis has exploded [15][16][17][18]. Particularly in the setting of rare disease research, sharing data allows researchers to pool several studies in order to increase statistical power and make findings that they could not have achieved individually [19][20][21].
Funders have also recognized the importance of sharing data and have implemented policies and mandates that encourage researchers to share. Shared data can be repurposed and used in novel ways, thus increasing the return on investment for funded research [22,23]. Proponents of open science suggest that taxpayers should have access to data arising from federally funded research, a view reflected in the United States Office of Science and Technology Policy's 2013 memorandum on access to federally funded research results [24]. Accordingly, funders and governmental bodies in the United States, including the National Institutes of Health (NIH) and the National Science Foundation (NSF), and elsewhere, including the Research Councils UK and the European Commission, have instituted policies and issued statements in support of data sharing and openness [25][26][27][28].
Despite the many arguments in favor of sharing and open science, researchers often do not share their data. A number of concerns may dissuade researchers from sharing, including concern over other researchers beating the original data collector to publication, fear that others may question the data collector's findings or conclusions, and worry about people misusing or misinterpreting the data [4,29]. Practical concerns may also present a roadblock to sharing data; preparing a dataset for sharing can be time-consuming, and researchers are often unaware of repositories available to accept their data [30].
Researchers working with clinical data face their own special set of concerns. Human subject data frequently contain personally identifiable information, and even de-identified data may carry the potential risk of re-identification of subjects [19,29]. In fact, even when complying with data protection policies such as those prescribed by the Health Insurance Portability and Accountability Act (HIPAA), re-identification of data is a possibility [31]. Obtaining subjects' consent for sharing datasets can be difficult, particularly since data may end up being used for secondary analysis well after the original study is complete; it is often impossible to foresee what kind of consent might be needed at the time consent is obtained [32]. Electronic health records (EHRs) present a potentially valuable source of clinical data for research, but most systems were designed for clinicians' ease of use, and frequently lack the kind of structured data that are best suited to sharing and analysis [33].
Sharing basic science research data also presents its own challenges. Data formats change frequently as new technologies and novel experimentation methods arise, making it difficult to coordinate and reuse datasets [34]. Particularly in nascent fields, like proteomics, a lack of standards and formats presents challenges to researchers who would like to share data or collaborate [30]. Working with digital data can be a challenge for researchers who have focused mostly on wet-lab experiments and lack training or a strong background in bioinformatics and computational methods [35].
While concerns over data sharing and reuse are frequently discussed in scientific communities, there are few quantitative studies examining researchers' attitudes, practices, and perceptions around sharing data. This study aims to better understand the motivations and barriers to data sharing, as well as elucidate differences between the sharing practices of clinical and basic science researchers.

Setting and Population
The NIH Library serves the NIH Intramural Research Program, which is the largest biomedical research program in the world, comprising over 1,200 principal investigators and 4,000 postdoctoral fellows [36]. In addition, the NIH Library serves other NIH employees and staff, as well as customers at related institutions within the Department of Health and Human Services.
The NIH Library launched its Data Services program in October 2013. The program is designed to assist researchers and staff with data management at each step of the research cycle, from conception of the study idea to sharing and archiving of the final research data. To address researchers' diverse needs, the program includes specialized consultations for research groups, as well as hands-on training in a variety of data-related topics. The survey discussed in this study was conducted during April-May 2014 in order to gain a better understanding of NIH researchers' data-related training and service needs. The survey sample included a wide variety of respondents in different roles at NIH, including students, fellows, staff scientists, senior scientists, administrators, and other professionals at NIH who collect, utilize, or manage data. However, for the purposes of this paper, only responses from staff scientists and clinical researchers were analyzed.

Research Instrument
The survey question protocol was tested in a pilot study and revised accordingly. The survey instrument consisted of four parts designed to assess respondents' attitudes, experience, and knowledge with regard to a variety of data-related topics. This paper reports on the results from sections 2 and 3.
1. Data Management Tasks: This section assessed two dimensions of respondents' experience with specific data management tasks: relevance of the task to their work and their current level of knowledge or expertise with the task. Questions were designed in a pairwise manner, so the first half of the questions addressed the relevance dimension and the second half the expertise dimension of a specific task. Respondents rated each dimension on a 5-point Likert-type scale, from "1-very low" to "5-very high." Based on feedback from the pilot study that indicated respondents may be so unfamiliar with the tasks that they might not be able to judge relevance and expertise, a non-weighted "not sure" option was also included.

Data Management and Sharing Practices:
This section elicited information about respondents' experiences with data management and sharing using a nominal scale for dichotomous responses ("yes" or "no"), with related contingency questions.
3. Data Sharing: Depending on their responses in section two, respondents were directed to one of two versions of the Data Sharing questions. Respondents who indicated that they had shared data were asked for additional details about their experience with data sharing. Respondents who answered that they had never shared data nor uploaded to a repository were asked to expand upon their reasons for not sharing data.

Demographic Information:
The final section gathered information about respondents' roles and research at NIH.
The survey was administered using SurveyMonkey, and all responses were anonymous, except when respondents chose to identify themselves as being willing to be contacted for followup. To increase the response rate, the survey was publicized through various NIH email lists, including the NIH Library email list and email lists for NIH special interest groups whose members likely work with digital data, such as the Bioinformatics and Biomedical Computing Special Interest Groups. The period for responding to the survey was also extended by several weeks to achieve a higher response rate.

Analysis Methods
The odds ratio (OR) with corresponding 95% confidence intervals were the primary analyses [37]. Fisher's exact tests were also used for small samples to avoid effect bias [37]. These two tests examined the potential relationships between variables [37]. When possible, valid responses were aggregated in order to perform OR tests. In the analysis of the Likert-type items, responses such as "not sure" were excluded from the initial analysis because they were not part of the 5-point (i.e., "very low," "low," "medium," "high," and "very high") Likert-type scale. However, they were included in the "worst case" sensitivity analyses to estimate the least favorable results. This approach should reduce the impact of excluded data on bias in the results. OR and Fisher's exact analyses were calculated through two online tools, MedCalc [38] and VassarStats [39], respectively. OR is calculated using a two-by-two contingency table, as demonstrated in Table 1.
For OR tests, p-value was obtained using the z-value calculated from the following formula [37]: All figures were created with R [40] and RStudio [41] using ggplot2 [42].

Ethics Statement
The NIH Office of Human Subjects Research Protections within the Office of Intramural Research determined that this survey did not require review by an institutional review board. In lieu of IRB review, the Director, NIH Office of Research Services, approved the survey instrument. The opening page of the survey noted that survey results could be used for research purposes, but that responses would be anonymized and subjects would not be identified individually. The survey opening page also contained a link to the Library's Privacy Policy, and contact information for the principal investigator. Although respondents could choose to identify themselves for follow-up, all names and email addresses were removed to anonymize the data before analysis.

Demographics of Respondents
Of the 190 respondents to the survey, 20 did not select a response for the question about their position and were therefore excluded from analysis. Of the remaining 170 respondents, 113 (67%) identified themselves as Scientific Staff and 22 (13%) identified themselves as Clinical Research Staff, referred to as "scientific" and "clinical" in the tables hereafter. The 35 respondents (21%) who identified themselves as Administrative Staff were excluded from this analysis. Most respondents were NIH employees (68%) or were at NIH on a fellowship appointment (18%) (see Table 2). Because the focus of this study is researchers, only responses from clinical and scientific staff (n = 135) were used for analysis.

Data Reuse-Relevance and Expertise
Respondents rated how relevant reusing other researchers' data was to their work, as well as their current level of expertise in reusing data (see Table 3). A majority of the respondents rated the relevance of finding and reusing datasets as high (31%) or very high (29%). However, nearly three-quarters of respondents considered their expertise very low (11%), low (33%), or medium (29%). Generally, scientific research staff considered the relevance of reusing data higher (median = 4, "high") than their expertise in doing so (median = 3, "medium"). Clinical staff also rated the relevance of data reuse higher (median = 3, "medium") than their expertise (median = 2, "low"). "Not sure" responses (n = 3) were excluded in the initial analysis because they were not part of the 5-point Likert-type scale. The exclusion rates were 2.22% for both the Relevance and Expertise questions. Next, responses were aggregated to test for differences between the two groups. In considering relevance and expertise, we recoded the 5 ranks of responses into 2 ranks: HIGH (including "medium," "high," and "very high" ranks) and LOW (including "low" and "very low" ranks). Odds ratio tests were conducted to test differences in responses for relevance and expertise in data reuse between scientific and clinical respondents.
Results showed that the odds of ranking data reuse as having HIGH relevance in the scientific group are 4.26 times greater than in the clinical group, and the result is statistically significant (OR = 4.26, 95% CI 1.501 to 12.11, p = 0.0065) (see Table 4). In other words, compared with clinical researchers, scientific researchers are more likely to consider data reuse highly relevant to their work. In terms of expertise, the odds of having HIGH expertise ranks in the scientific group are also greater than in the clinical group, and the result is statistically significant (OR = 3.66, 95% CI 1.322 to 10.165, p = 0.0125) (see Table 4).
In order to test if the exclusion of the "not sure" responses biased the results, we inserted these responses back and ran worst-case sensitivity analyses. The worst-case scenario method assumed that the "not sure" responses in the scientific group have the worst possible outcome (LOW) while the "not sure" responses in the clinical group have the best possible outcome (HIGH). The OR results under worse-case scenario were still statistically significant (p<0.05), indicating that the exclusion of the "not sure" responses did not substantially affect our analysis results. Table 4 summarizes the results.

Uploading to Repositories-Relevance and Expertise
Respondents also rated relevance and expertise regarding depositing data in a repository (see Table 5). About half of the respondents rated uploading to data repositories as very highly (27%) or highly (24%) relevant to their work, but the majority considered their level of expertise very low (11%), low (34%), or medium (24%). Scientific staff ranked the relevance of sharing data in a repository more highly (median = 4, "high") than they ranked their expertise in doing so (median = 3, "medium"). Clinical staff also ranked relevance more highly (median = 3, "medium") than expertise (median = 2, "low"). Fig 2 demonstrates the relationship between expertise in and relevance of repository use among scientific and clinical research staff. Following the same procedures as described above for data reuse, we excluded the "not sure" responses (9 for the Relevance question, and 7 for the Expertise question). The exclusion rates were 6.7% and 5%, respectively. Next, responses were aggregated to test for differences between the two groups. The same re-coding criteria were used: HIGH includes "medium," "high," and "very high" ranks; LOW includes "low" and "very low" ranks. Odds ratio results showed that the odds of having HIGH relevance in the scientific group are 5.75 times larger than in the clinical group, and the result is statistically significant (OR = 5.757, 95% CI 1.9341 to 17.1396, p = 0.0017) (see Table 6). This result indicates that scientific researchers are more likely to consider sharing data in a depository relevant to their work. The odds of having HIGH expertise in this task in the scientific group are also greater than in the clinical group (OR = 1.9974), but the result was not significant (95% CI 0.7651 to 5.2146, p = 0.1576) (see Table 6). Again, we ran the worst-case sensitivity analyses to test if the exclusion of the "not sure" responses biased the results. The worst-case scenario method assumed that the "not sure" responses in the scientific group have the worst possible outcome (LOW) while the "not sure" responses in the clinical group have the best possible outcome (HIGH). The worst-case OR results were consistent with the initial results, with statistical significance in the Relevance question and no statistical significance in the Expertise question. This result indicates that the exclusion of the "not sure" responses did not substantially affect our analysis results. Table 6 summarizes the comparative results.

Experiences with Sharing Data
Overall, most respondents (61%) reported that they had never uploaded data to a repository (see Table 7). The odds of scientific researchers uploading data to a repository for sharing were  Despite the low levels of sharing in repositories, a majority of respondents (71%) said that they had shared data directly with another researcher (see Table 8). Among scientific staff, almost three-quarters (73%) reported that they had shared data with another researcher, and a majority of clinical research staff (64%) had done so as well. Although there is a 1.5-fold increased odds of sharing data in the scientific group (OR = 1.51, 95% CI: 0.577 to 3.955), this result is not statistically significant (p = 0.399).

Motivations for Sharing Data
Respondents who indicated that they had previously shared data, either directly with another researcher or by uploading to a repository, were asked about their motivations for doing so. 106 participants provided responses (see Table 9). The most common reason for sharing was to collaborate with a researcher who requested the data (69%). Respondents were also highly motivated by a desire to advance science in a particular area (64%) and to assist a known colleague (49%).
We used OR tests to analyze whether any of the reasons are associated more with one of the two research groups. For small samples (fewer than 5 responses), Fisher's exact test was used additionally to avoid bias (see Table 10). None of the results showed any statistical significance

Sharing Practices
Sharing a dataset alone may not be enough for an outside researcher to be able to understand and reuse the data; additional information, like metadata or a codebook, may be necessary to contextualize and explain the data. Datasets may also need additional preparation to make them useable to other researchers, such as documenting shorthand or abbreviations, adding metadata, or changing formats. Respondents were asked about how much work was required to prepare their datasets and what additional information they supplied to requesters or repositories. A great deal of variation existed in how much time respondents needed to prepare their data for sharing (see Table 11). Overall, almost a third of respondents (28%) needed more than 10 hours to adequately prepare their data, but a nearly equivalent number (29%) needed no additional time at all, as their data were already ready for sharing. However, none of the clinical research staff responded that their data already existed in a shareable format.
Most respondents (76 out of 106 people, or 72%) indicated that they had included some additional materials when they shared their data (see Table 12). The most common supplementary material respondents had shared was contextualizing information about the data, such as metadata or a description of the experimental protocol (47%). Fisher's exact tests were conducted through 2 by 2 tables to identify differences regarding supplementary materials that were shared. No significance was found in any of the tables (Fisher's exact, p >0.1) (see Table 13). In other words, the odds of providing any of the listed supplementary materials did not appear different between the two groups. Although no single type of supplementary information emerged as a more common method for providing documentation, it is encouraging that none of the respondents indicated that they had failed to provide documentation that would be necessary for the requester.

Acknowledgment of Sharing
Respondents who had shared data were asked how they had been acknowledged for contributing their data. Since more than one publication could have arisen from sharing, respondents could select multiple options. 104 participants provided responses. In most cases of data sharing, publication had arisen as a result of the data being shared; only 31% of the respondents said that no publication had yet arisen from the analysis of the shared data (see Table 14).
About half of the respondents had been included as a co-author on a publication (51%). The next most common method of noting the contribution of data was recognition in the acknowledgement section of the publication (35%). Several respondents indicated that they had been cited in the bibliography of the publication (22%). However, in a number of cases (15%), respondents reported that they had not been acknowledged for sharing their data. Fisher's exact tests were conducted through 2 by 2 tables to identify differences in the ways scientific and clinical researchers were acknowledged. No significance was found in any of the tables (p>0.2). In other words, no significant difference was found between the two groups with regard to any of the listed methods for acknowledging data sharing (see Table 15).

Reasons for Not Sharing Data
Respondents who indicated that they had neither shared data with a researcher nor uploaded to a repository were directed to a question to elicit information about why they had never shared data (see Table 16). Respondents could select more than one of the fifteen possible responses, since multiple reasons might drive their decision not to share. While the list of reasons for not sharing is not completely comprehensive, the fifteen options were based on common reasons for not sharing identified in the existing literature [7,14,17,33]. Twenty participants provided responses.
Given the small sample sizes (15 vs. 5) in this section and the small values (less than 5) for most of the responses, no inferential statistical tests were conducted here to compare the two groups. However, the top concerns of scientific and clinical researchers seemed different. All of the clinical researchers cited subjects' privacy as a reason for not sharing, while only two (13%) of the scientific researchers shared this concern. In general, researchers in both categories had diverse reasons for not sharing their data, though many involved a lack of adequate knowledge on how to share data, such as unfamiliarity with existing repositories or data preparation standards.

Limitations
This study is primarily exploratory in nature and results may not be broadly generalizable. The small size of the sample for this study limits the ability to draw conclusions about the population of NIH researchers as a whole. In particular, clinical researchers were underrepresented. Moreover, the population of NIH researchers may not be representative of the larger biomedical research community on the whole; researchers who work at academic institutions or in the private sector may have different attitudes about sharing data than those who choose a career with a federal agency.

Conclusions
Sharing research data is a complex issue presenting many challenges that can only be effectively addressed by enlisting the efforts of a variety of stakeholders. While technological barriers to data sharing must be addressed, the scientific community must also evolve in its attitudes and practices to facilitate, encourage, and reward data sharing and reuse. As this study demonstrates, clinical and scientific researchers are not identical in their concerns. Effective methods for encouraging data sharing must take into account the unique needs and challenges of diverse scientific communities. Though a majority of respondents had shared data with other researchers, or at least indicated they would be willing to do so, fewer researchers had shared data in repositories. Sharing among researchers is a good first step toward increasing access, but systematized methods of sharing may facilitate more widespread access to and reuse of research data. With many different repositories available, including institutional repositories, discipline-specific repositories, and more generalized repositories like Dryad and Figshare, determining where to upload data can be confusing for researchers. Resources like BioMart, a federated search tool that allows users to search across multiple domains at once, and Databib, a curated list of repositories, can help make the task of finding an appropriate repository easier for researchers [34]. Though this study specifically asked about sharing in repositories, new platforms and mechanisms for sharing data merit further exploration. For example, data journals allow authors to publish their data in a way that can be easily cited and may provide ways of sharing data that fit within the framework of more traditional scholarly communication [43]. Improving standards for metadata, provenance, and data publishing is also essential to facilitate sharing and reuse [44]. As this study indicated, many researchers, particularly clinical staff, do not see sharing data in a repository as relevant to their work. Preparing data for sharing in general, and particularly for sharing in a repository, can be a time-consuming process with little payoff for the researcher who is doing the sharing. Funders, institutions, and publishers can all play a role in incentivizing and encouraging data sharing. Many funders, including NIH and NSF, have already begun requiring some grantees to share datasets. A number of publishers also stipulate as a condition of publication that supporting data must be publically available. Institutions can play a role in encouraging sharing by creating policies and providing space for researchers to upload data [45]. Universities can build upon successes with open access policies that encourage or mandate sharing of publications [46].
Clinical researchers' lower perceived relevance of uploading to a repository may reflect differences in data practices between clinical and basic science research. Because clinical research usually involves human subjects, privacy concerns and regulations may deter clinical researchers from sharing data in repositories. Indeed, among clinical researchers who indicated that they had not shared data, concern for research subjects' privacy was the most common reason cited in this study. The necessity of de-identifying patient data may also account for the increased likelihood in this study that clinical researchers would need time to prepare their data for sharing. Finally, more specialized or subject-specific repositories exist for basic science research data than for clinical data. For example, of the 57 data repositories listed on NIH's Data Sharing Repositories website, 37 of them (65%) accept primarily basic science rather than clinical data [47].
As this study demonstrates, little consistency exists with regard to how researchers are acknowledged by those who reuse their shared data. Standardizing a mechanism for data citation could help incentivize sharing by giving researchers credit for their contribution to the scientific community, in much the same way that they receive credit in the form of article citations for their intellectual contributions in the scientific literature. Though a number of respondents in this study indicated they had been co-authors on articles that cited their shared data, coauthorship may not be an appropriate mechanism for acknowledging the contribution of shared data. The International Committee of Medical Journal Editors defines four criteria for authorship: contributing to the design of the work or collection, analysis, or interpretation of data; drafting or significantly revising the work; approving the final draft; and agreeing to be accountable for all questions of integrity or accuracy of the final work [48]. While researchers who share data meet the first criteria, they may not meet the other three, in which case it would be more appropriate to acknowledge their contribution through citation of the dataset, rather than co-authorship. Creating standards for citing datasets is important to ensure that researchers who share data receive credit in ways that appropriately recognize their contribution.
While incentivizing sharing is important, regulatory and policy changes may be needed to remove barriers to sharing and mitigate unintended negative consequences. In addition to creating adequate infrastructure and awareness of outlets for sharing, mechanisms must be created for protecting researchers' data and ensuring that data are reused responsibly. Particularly with regard to patient data, access to data should be mediated as appropriate for the level of sensitivity of the dataset. Mechanisms like peer review of proposals for reusing research data, data sharing agreements that clearly specify how a dataset may be used, and approval or exemption of data reuse projects by institutional review boards can all help ensure that data are reused with respect for the subjects and the original researchers who gathered the data [7,49].
Outreach to researchers may help increase awareness about why sharing is important to the biomedical research community, and training and assistance for researchers preparing data for sharing may also be useful. It is essential that the biomedical research community continue to work toward identifying and addressing the challenges that hinder the effective sharing and reuse of research data. This exploratory study has established some possible concerns and perspectives of biomedical researchers, and we hope that it will serve as a foundation for future studies that will further elucidate the barriers to and incentives for sharing within the broader biomedical research community.