Preprinting the COVID-19 pandemic

The world continues to face an ongoing viral pandemic that presents a serious threat to human health. The virus underlying the COVID-19 disease, SARS-CoV-2, caused over 29 million confirmed cases and 925,000 deaths since January 2020. Although the last pandemic occurred only a decade ago, the way science operates and responds to current events has experienced a paradigm shift in the interim. The scientific community responded rapidly to the COVID-19 pandemic, releasing over 16,000 COVID-19 scientific articles within 4 months of the first confirmed case, of which 6,753 were hosted by preprint servers. Focussing on bioRxiv and medRxiv, two growing preprint servers for biomedical research, we investigated the attributes of COVID-19 preprints, their access and usage rates and characteristics of sharing across online platforms. Our results highlight the unprecedented role of preprint servers in the dissemination of COVID-19 science, and the impact of the pandemic on the scientific communication landscape.


Introduction
Following a steep increase in the posting of COVID-19 research, traditional publishers adopted new 117 policies to support the ongoing public health emergency response efforts. After multiple public calls 118 from scientists [18], over 30 publishers agreed to make all COVID-19 work freely accessible by the 16 th 119 March [19,20]. Shortly after this, publishers (for example eLife [21]) began to alter peer-review policies 120 in an attempt at fast-tracking COVID-19 research. Towards the end of April, OASPA issued an open 121 letter of intent to maximise the efficacy of peer review [22]. The number of open-access COVID-19 122 journal articles suggests that journals have largely been successful at implementing these new policies 123 (Supplemental Fig. 1B). 124 Attributes of COVID-19 preprints posted between January and April 2020 125 To explore the attirbutes of COVID-19 preprints in greater detail, we focused our following 126 investigation on two of the most popular preprint servers in the biomedical sciences: bioRxiv and 127 medRxiv. 128 Between January and April 2020, 14,812 preprints were deposited in total to bioRxiv and medRxiv, of 129 which the majority (12,285, 82.9%) were non-COVID-19 related preprints ( Fig. 2A). While the weekly 130 numbers of non-COVID-19 preprints did not change much during this period, COVID-19 preprint 131 posting increased, peaking at over 250 in early April. When the data was broken down by server, it 132 was evident that whilst posting of COVID-19 preprints to bioRxiv had remained relatively steady, 133 preprints posted to medRxiv increased with time (Supplemental Fig. 2A). 134 The increase in the rate of preprint posting poses challenges for their timely screening. Only marginally 135 faster screening was detected for COVID-19 preprints than for non-COVID-19 preprints (Fig. 2B) when 136 adjusting for differences between servers (two-way ANOVA, interaction term; F1,14808 = 69.13, p < 137 0.001). Whilst COVID-19 preprints were screened < 1 day quicker from mean differences observed 138 within both servers (Tukey HSD; both p < 0.001), larger differences were observed between servers 139 (Supplemental Fig. 2B), with bioRxiv screening preprints on approximately 2 days quicker than 140 medRxiv for both preprint types (both p < 0.001). 141 The number of authors may give an indication as to the amount of work, resources used, and the 142 extent of collaboration in a paper. While the average number of authors of COVID-19 and non-COVID-143 19 preprints did not differ, COVID-19 preprints showed slightly more variability in authorship team 144 size (median, 6 [IQR 8] vs 6 [IQR 5]). Single-author preprints were almost three times more common 145 among COVID-19 than non-COVID-19 preprints (Fig. 2C).
Researchers may be shifting their publishing practice in response to the pandemic. Among all 147 identified corresponding authors of preprints during the pandemic, we found a significant association 148 between preprint type and whether this was the author's first bioRxiv or medRxiv preprint (Chi-149 square, χ 2 = 215.2, df = 1, p < 0.001). Among COVID-19 corresponding authors, 83% were posting a 150 preprint for the first time, compared to 68% of non-COVID-19 corresponding authors in the same 151 period. To further understand which authors have been drawn to begin using preprints since the 152 pandemic began, we additionally stratified these groups by country. Corresponding authors based in 153 China had the greatest increase in representation among authors of COVID-19 preprints compared to 154 non-COVID-19 preprints as an expectation (Fig. 2D). Additionally, India had a higher representation 155 among COVID-19 authors specifically using preprints for the first time compared to non-COVID-19 156 posting patterns. Moreover, we found that most countries posted their first COVID-19 preprint close 157 to the time of their first confirmed COVID-19 case (Supplemental Fig. 2C -19 preprints were more likely to choose the more restrictive CC-BY-NC-ND or CC-180 BY-ND than those of non-COVID-19 preprints, and less likely to choose CC-BY and CC (Fig. 2E). 181 Preprint servers offer authors the opportunity to post new versions of a preprint, to improve upon or 182 correct mistakes in an earlier version. The majority of preprints existed as only a single version for 183 both COVID-19 and non-COVID-19 work with very few preprints existing beyond two versions (Fig. 2F). 184 This may somewhat reflect the relatively short time-span of our analysis period. COVID-19 preprints 185 did not discernibly differ in number of versions compared with non-COVID-19 preprints ( Mann-Whitney, p < 0.001) (Fig. 2G). This supports anecdotal observations that preprints are being 192 used to share more works-in-progress rather than complete stories. We also found that COVID-19 193 preprints contain fewer references than non-COVID-19 preprints (median, 30. [7]. We assessed differences in publication outcomes for COVID-19 versus non-COVID-19 preprints 200 during our analysis period, which may be partially related to differences in preprint quality. Published 201 status (published/unpublished) was significantly associated with preprint type (Chi-square; χ2 = 6.77, 202 df = 1, p = 0.009); within our timeframe, 4% of COVID-19 preprints were published by the end of April, 203 compared to the 3% of non-COVID preprints that were published (Fig. 2I). These published COVID-19 204 preprints were split across many journals, with clinical or multidisciplinary journals surveyed tending 205 to publish the most papers that were previously preprints (Supplemental Fig. 2E). To determine how 206 publishers were prioritising COVID-19 research, we compared the time from preprint posting to 207 publication in a journal. Delay from posting to subsequent publication was significantly accelerated 208 for COVID-19 preprints by a mean difference of 26.2 days compared to non-COVID-19 preprints posted 209 in the same time period (mean, 22.5 days [SD 15.7] vs 48.7 days [SD 25.6]; two-way ANOVA; F1,289 = 210 69.8, p < 0.001). This did not appear driven by any temporal changes in publishing practices, as 211 publication times of non-COVID-19 preprints were similar to expectation of our control timeframe of September -January (Fig. 2J). COVID-19 preprints also appeared to have significantly accelerated 213 publishing regardless of publisher (two-way ANOVA, interaction term; F6,283 = 0.41, p = 0.876) 214 (Supplemental Fig. 2F). However, data aggregated across several publishers revealed that on average, 215 non-COVID-19 manuscripts had a 10.6% higher acceptance rate than COVID-19 manuscripts, 216 regardless of preprint availability (Supplemental Fig. 2G). 217 Extensive access of preprint servers for COVID-19 research To confirm that usage of COVID-19 and non-COVID-19 preprints was not an artefact of differing 228 preprint server reliance during the pandemic, we compared usage to September 2019 -April 2020, as 229 a non-pandemic control period. We observed a slight decrease in abstract views (Supplemental Fig.  230 3A) and pdf downloads (Supplemental Fig. 3B) in March 2020, but otherwise, the usage data did not 231 differ from that prior to the pandemic. 232 Secondly, we investigated usage across additional preprint servers (data kindly provided by each of 233 the server operators). We found that COVID-19 preprints were consistently downloaded more than 234 non-COVID-19 preprints during our timeframe, regardless of which preprint server hosted the 235 manuscript (Supplemental Fig. 3C), though the gap in downloads varied between server (two-way 236 ANOVA, interaction term; F4,276544 = 586.9, p < 0.001). Server usage differences were more pronounced 237 for COVID-19 preprints; multiple post-hoc comparisons confirmed that bioRxiv and medRxiv received 238 significantly higher usage per COVID-19 preprint than all other servers for which data was available 239 (Tukey HSD; all p values < 0.001). However, for non COVID-19 preprints, the only observed pairwise 240 differences between servers indicated greater bioRxiv usage than SSRN or Research Square (Tukey 241 HSD; all p values < 0.001). This suggests specific attention has been given disproportionately to bioRxiv 242 and medRxiv as repositories for COVID-19 research. 243 COVID-19 preprints were shared more widely than non-COVID-19 preprints 244 Based on citation data from Dimensions, we found that COVID-19 preprints are cited more often than 245 non-COVID-19 preprints (time-adjusted negative binomial regression; rate ratio = 71.1, z = 49.2, p < 246 0.001) (Fig. 4A), although it should be noted that only a minority of preprints received at least one 247 citation in both groups (30.6% vs 5.5%). The highest cited preprint had 127 citations, with the 10 th 248 most cited COVID-19 preprint receiving 48 citations (Table 1) We also investigated sharing of preprints on Twitter to assess the exposure of wider public audiences 253 to preprints, using data from Altmetric. COVID-19 preprints were tweeted at a greater rate than non-254 COVID-19 preprints (rate ratio = 14.8, z = 91.55, p < 0.001) (Fig. 4B). The most tweeted non-COVID-19 255 preprint received 1,323 tweets, whereas 8 of the top 10 tweeted COVID-19 preprints were tweeted 256 over 10,000 times each ( Table 2). Many of the top 10 tweeted COVID-19 preprints were related to 257 transmission, re-infection or seroprevalence and association with the BCG vaccine. The most tweeted 258 COVID-19 preprint (29,984 tweets) was a study investigating antibody seroprevalence in California 259 [25], whilst the second most tweeted COVID-19 preprint was a widely criticised (and later withdrawn) 260 study linking the SARS-CoV-2 spike protein to HIV-1 glycoproteins [26]. 261

262
To better understand the main discussion topics associated with the top-10 most tweeted preprints, 263 we analysed the hashtags used in original tweets (i.e. excluding retweets) mentioning those preprints 264 (Supplemental Fig. 4A). After removing generic or overused hashtags directly referring to the virus 265 (e.g. "#coronavirus", "#COVID-19"), we found that the most dominant hashtag among tweets 266 referencing preprints was "#hydroxychloroquine", a major controversial topic associated with two of 267 the top ten most tweeted preprints. Other prominent hashtags contained a mixture of direct, neutral 268 references to the disease outbreak such as "#coronavirusoutbreak" and "#Wuhan", and some more 269 politicised terms, such as "#fakenews" and "#covidisalie", associated with conspiracy theories. 270

271
As well as featuring heavily on social media, COVID-19 research has also pervaded print and online 272 news media. COVID-19 preprints were used in news articles at a rate over two hundred times that of 273 non-COVID-19 preprints (rate ratio = 220.4, z = 39.27, p < 0.001), although as with citations, only a 274 minority were mentioned in news articles at all (26.9% vs 6.7%) (Fig. 4C). The top 10 non-COVID-19 275 preprints were reported in less than 100 news articles in total, whereas the top COVID-19 preprints 276 were reported in over 300 news articles (Table 3). Similarly, COVID-19 preprints were also used in blogs at a significantly greater rate than non-COVID-19 preprints (rate ratio = 9.48, z = 29.2, p < 0.001) 278 ( Fig. 4D; Table 4). We noted that several of the most widely-disseminated non-COVID-19 preprints 279 featured topics relevant to infectious disease research, e.g. human respiratory physiology and 280 personal protective equipment (Tables 2 and 3). 281 282 Independent COVID-19 review projects have arisen to publicly review COVID-19 preprints [34]. To 283 investigate engagement with preprints directly on the bioRxiv and medRxiv platforms, we quantified 284 the number of comments for preprints posted between January and April. We found that non-COVID-285 19 preprints were rarely commented upon when compared to COVID-19 preprints (time-adjusted 286 negative binomial regression; rate ratio = 27.9, z = 32.0, p < 0.001) (Fig. 2E); the most commented non-287 COVID-19 preprint received only 15 comments, whereas the most commented COVID-19 preprint had 288 over 500 comments on the 30th April (Table 5). One preprint, which had 127 comments was retracted 289 within 3 days of being posted following intense public scrutiny [35]. Collectively these data suggest 290 that the most discussed or controversial COVID-19 preprints are being rapidly and publicly scrutinised, 291 with commenting systems being used for direct feedback and discussion of preprints. Research Square, arXiv). However, these citations occurred at a relatively low rate, typically 299 constituting less than 20% of the total citations in these 26 documents (Fig. 2F). Fifty-eight individual 300 COVID-19 preprints from bioRxiv or medRxiv were cited in examined policy documents, of which 17 301 were cited more than once and 4 were cited more than twice. Most preprint citations occurred in 302 documents from the ECDC, UK POST and WHO SB with no preprints cited in analysed documents from 303 the US HSSCC. In comparison, only two instances of citations to preprints were observed among 26 304 manually collected non-COVID-19 policy documents from the same sources. 305 306 To understand how different usage indicators may represent the sharing behaviour of different user 307 groups, we calculated the correlation between the usage indicators presented above (citations, 308 tweets, news articles, comments). For COVID-19 preprints, we found weak correlation between the 309 numbers of citations and Twitter shares (Spearman's ρ = 0.37, p < 0.001), and the numbers of citations 310 and news articles (Spearman's ρ = 0.41, p < 0.001) (Fig. 4G), suggesting that the preprints cited mostly 311 within the scientific literature differed to those that were mostly shared by the wider public on other 312 online platforms. There was a stronger correlation between COVID-19 preprints that were most 313 blogged and those receiving the most attention in the news (Spearman's ρ = 0.58, p < 0.001). 314 Moreover, there was a strong correlation between COVID-19 preprints that were most tweeted and 315 those receiving the most attention in the news (Spearman's ρ = 0.53, p < 0.001), suggesting similarity 316 between preprints shared on social media and in news media (Fig. 4G). There was a weak correlation 317 between the 10 most tweeted COVID-19 preprints and the 10 most commented upon (Spearman's ρ 318 = 0.41, p < 0.001). Taking the top ten COVID-19 preprints by each indicator, there was substantial 319 overlap between all indicators except citations (Supplemental Fig. 4B). We observed much weaker 320 correlation between all indicators for non-COVID-19 preprints (Fig. 4H) show that preprints have been widely adopted for the dissemination and communication of COVID-338 19 research, and in turn, the pandemic has greatly impacted the preprint and science publishing 339 landscape. 340 Changing attitudes and acceptance within the life sciences to preprint servers may be one reason why 341 COVID-19 research is being shared to readily as preprints compared to past epidemics. In addition, the need to rapidly communicate findings prior to a lengthy review process might be responsible for this 343 observation (Fig. 3). A recent study involving qualitative interviews of multiple research stakeholders 344 found "early and rapid dissemination" to be amongst the most often cited benefits of preprints [28]. The fact that news outlets are reporting extensively on COVID-19 preprints ( Fig. 4C and 4G) represents 375 a marked change in journalistic practice: pre-pandemic, bioRxiv preprints received very little coverage 376 in comparison to journal articles [27]. This cultural shift provides an unprecedented opportunity to 377 bridge the scientific and media communities to create a consensus on the reporting of preprints [36]. 378 Another marked change was observed in the use of preprints in policy documents (Fig. 4F). Preprints 379 were remarkably absent in non-COVID-19 policy documents yet present, albeit at relatively low levels, 380 in COVID-19 policy documents. In a larger dataset, two of the top 10 journals which are being cited in 381 policy documents were found to be preprint servers (medRxiv and SSRN in 5 th and 8 th position 382 respectively) [37]. This suggests that preprints are being used to directly influence policy-makers and 383 decision making. We only investigated a limited set of policy documents, largely restricted to Europe 384 and the US and whether this extends more globally remains to be explored. In the near future, we aim 385 to examine the use of preprints in policy in more detail to address these questions. 386 As most COVID-19-preprints were not yet published, concerns regarding quality will persist [38]. commenting on COVID-19 preprints (Fig. 4). Moreover, prominent scientists are using social media 392 platforms such as Twitter to publicly share concerns with poor quality COVID-19 preprints or to amplify 393 high-quality preprints [43]. The use of Twitter to "peer-review" preprints provides additional, public, 394 scrutiny on manuscripts that can complement the less opaque and slower traditional peer-review 395 process. Although these new review platforms partially combat poor-quality preprints, it is clear that 396 there is a dire need to better understand the general quality and trustworthiness of preprints 397 compared to peer-review articles. We found comparative levels of preprints had been published 398 within our short timeframe (Fig. 2) and that acceptance rates at several journals was only slightly 399 reduced for COVID-19 research compared to non-COVID-19 articles (Supplemental Fig. 2) suggesting 400 that, generally, preprints were relatively of good quality. Furthermore, recent studies have suggested 401 that the quality of reporting in preprints differs little from their later peer-reviewed articles [44] and 402 we ourselves are currently undertaking a more detailed analysis (see version 1 of our preprint for an 403 initial analysis of published COVID preprints [45]). However, the problem of poor-quality science is not 404 unique to preprints and ultimately, a multi-pronged approach is required to solve some of these 405 issues. For example, scientists must engage more responsibly with journalists and the public, in 406 addition to upholding high standards when sharing research. More significant consequences for 407 academic misconduct and the swift removal of problematic articles will be essential in aiding this.
Moreover, the politicisation of science has become a polarising issue and must be prevented at all 409 costs. Thirdly, transparency within the scientific process is essential in improving the understanding of 410 its internal dynamics and providing accountability. 411 Our data demonstrates the indispensable role that preprints, and preprint servers, are playing during 412 a global pandemic. By communicating science through preprints, we are sharing at a faster rate than 413 allowed by the current journal infrastructure. Furthermore, we provide evidence for important future 414 discussions around scientific publishing. 415 416 Methods 417 418 Preprint Metadata for bioRxiv and medRxiv 419 We retrieved basic preprint metadata (DOIs, titles, abstracts, author names, corresponding author 420 name and institution, dates, versions, licenses, categories and published article links) for bioRxiv and 421 medRxiv preprints via the bioRxiv Application Programming Interface (API; https://api.biorxiv.org). 422 The API accepts a 'server' parameter to enable retrieval of records for both bioRxiv and medRxiv. We 423 initially collected metadata for all preprints posted from the time of the server's launch, corresponding 424 to November 2013 for bioRxiv and June 2019 for medRxiv, until the end of our analysis period on 30th 425 April 2020 (N = 84,524). All data were collected on 1st May 2020. Note that where multiple preprint 426 versions existed, we included only the earliest version and recorded the total number of following 427 revisions. Preprints were classified as "COVID-19 preprints" or "non-COVID-19 preprints" on the basis 428 of the following terms contained within their titles or abstracts (case-insensitive): "coronavirus", 429 "covid-19", "sars-cov", "ncov-2019", "2019-ncov", "hcov-19", "sars-2". For comparison of preprint 430 behaviour between the COVID-19 outbreak and previous viral epidemics, namely Western Africa Ebola 431 virus and Zika virus (Supplemental Fig. 1), the same procedure was applied using the keywords "ebola" 432 or "zebov", and "zika" or "zikv", respectively. 433 For all preprints contained in the subset, disambiguated author affiliation and country data for 441 corresponding authors were retrieved by querying raw affiliation strings against the Research 442 Organisation Registry (ROR) API (https://github.com/ror-community/ror-api). The API provides a 443 service for matching affiliation strings against institutions contained in the registry, on the basis of 444 multiple matching types (named "phrase", "common terms", "fuzzy", "heuristics", and "acronyms"). 445 The service returns a list of potential matched institutions and their country, as well as the matching 446 type used, a confidence score with values between 0 and 1, and a binary "chosen" indicator relating 447 to the most confidently matched institution. A small number (~500) of raw affiliation strings returned 448 from the bioRxiv API were truncated at 160 characters; for these records we conducted web-scraping 449 using the rvest package for R [46] to retrieve the full affiliation strings of corresponding authors from 450 the bioRxiv public webpages, prior to matching. For the purposes of our study, we aimed for higher 451 precision than recall, and thus only included matched institutions where the API returned a confidence 452 score of 1. A manual check of a sample of returned results also suggested higher precision for results 453 returned using the "phrase" matching type, and thus we only retained results using this matching 454 type. In a final step, we applied manual corrections to the country information for a small subset of 455 records where false positives would be most likely to influence our results by a) iteratively examining 456 the chronologically first preprint associated with each country following affiliation matching and 457 applying manual rules to correct mismatched institutions until no further errors were detected (n = 8 458 institutions); and b) examining the top 50 most common raw affiliation strings and applying manual 459 rules to correct any mismatched or unmatched institutions (n = 2 institutions). In total, we matched 460 19,002 preprints to a country (73.2%); for COVID-19 preprints alone, 1716 preprints (67.9%) were 461 matched to a country. Note that a similar, albeit more sophisticated method of matching bioRxiv 462 affiliation information with the ROR API service was recently documented by Abdill et al. [47]. 463 Word counts and reference counts for each preprint were also added to the basic preprint metadata 464 via scraping of the bioRxiv public webpages (medRxiv currently does not display full HTML texts, and 465 so calculating word and reference counts was limited to bioRxiv preprints). Web scraping was 466 Usage data (abstract views and pdf downloads) were scraped from the bioRxiv and medRxiv public 496 webpages, using the rvest package for R (Wickham, 2019). bioRxiv and medRxiv webpages display 497 abstract views and pdf downloads on a calendar month basis; for subsequent analysis (e.g Figure 4), 498 these were summed to generate total abstract views and downloads since the time of preprint 499 posting. In total, usage data were recorded for 25,865 preprints (99.9%) -a small number were not 500 recorded, possibly due to server issues during the web scraping process. Note that bioRxiv webpages 501 also display counts of full-text views, although we did not include these data in our final analysis. This 502 was partially to ensure consistency with medRxiv, which currently does not provide display full HTML 503 texts, and partially due to ambiguities in the timeline of full-text publishing -the full text of a preprint 504 is added several days after the preprint is first available, but the exact delay appears to vary from 505 preprint to preprint. We also compared rates of PDF downloads for bioRxiv and medRxiv preprints 506 with a number of other preprint servers (Preprints.org, SSRN, and Research Square) (Supplemental Counts of multiple altmetric indicators (mentions in tweets, blogs, and news articles) were retrieved 510 via Altmetric (https://www.altmetric.com), a service that monitors and aggregates mentions to 511 scientific articles on various online platforms. Altmetric provide a free API (https://api.altmetric.com) 512 against which we queried each preprint DOI in our analysis set. Importantly, Altmetric only contains 513 records where an article has been mentioned in at least one of the sources tracked, thus, if our query 514 returned an invalid response we recorded counts for all indicators as zero. Coverage of each indicator 515 (i.e. the proportion of preprints receiving at least a single mention in a particular source) for preprints 516 were 99.1%, 9.6%, and 3.5% for mentions in tweets, blogs and news articles respectively. The high 517 coverage on Twitter is likely driven, at least in part, by automated tweeting of preprints by the official 518 bioRxiv and medRxiv twitter accounts. For COVID-19 preprints, coverage was found to be 100.0%, 519 16.6% and 26.9% for mentions in tweets, blogs and news articles respectively. 520 To quantitatively capture how high-usage preprints were being received by Twitter users, we retrieved 521 all tweets linking to the top ten most-tweeted preprints. Tweet IDs were retrieved via the Altmetric 522 API service, and then queried against the Twitter API using the rtweet package [50] for R, to retrieve 523 full tweet content. 524 Citations counts for each preprint were retrieved from the scholarly indexing database Dimensions 525 (https://dimensions.ai). An advantage of using Dimensions in comparison to more traditional citation 526 databases (e.g. Scopus, Web of Science) is that Dimensions also includes preprints from several 527 sources within their database (including from bioRxiv and medRxiv), as well as their respective citation 528 counts. When a preprint was not found, we recorded its citation counts as zero. Of all preprints, 3707 529 (14.3%) recorded at least a single citation in Dimensions. For COVID-19 preprints, 774 preprints 530 (30.6%) recorded at least a single citation. 531 532 Comments 533 BioRxiv and medRxiv html pages feature a Disqus (https://disqus.com) comment platform to allow 534 readers to post text comments. Comment counts for each bioRxiv and medRxiv preprint were 535 retrieved via the Disqus API service (https://disqus.com/api/docs/). Where multiple preprint versions 536 existed, comments were aggregated over all versions. As with preprint perceptions among public 537 audiences on Twitter, we then examined perceptions among academic audiences by examining 538 comment sentiment. Text content of comments for COVID-19 preprints were provided directly by the 539 bioRxiv development team. 540 Screening time for bioRxiv and medRxiv 541 To calculate screening time, we followed the method outlined by Steve Royle [51]. In short, we 542 calculate the screening time as the difference in days between the preprint posting date, and the date 543 stamp of submission approval contained within bioRxiv and medRxiv DOIs (only available for preprints 544 posted after December 11 th 2019). bioRxiv and medRxiv preprints were filtered to preprints posted 545 between January 1 st -April 30 th 2020, accounting for the first version of a posted preprint. 546

547
To describe the level of reliance upon preprints in policy documents, a set of policy documents were 548 manually collected from the following institutional sources: the European Centre for Disease were then text-mined and manually verified to calculate the proportion of references that were 556 preprints. 557

558
Preprint counts were compared across categories (e.g., COVID-19 or non-COVID-19) using Chi-square 559 tests or, in cases where any expected values were < 5, with Fisher's exact tests using Monte Carlo 560 simulation. Quantitative preprint metrics (e.g. word count, comment count) were compared across 561 categories using Mann-Whitney tests and correlated with other quantitative metrics using Spearman's 562 rank tests for univariate comparisons. 563 For time-variant metrics (e.g. views, downloads, which may be expected to vary with length of preprint 564 availability), we analysed the difference between COVID-19 and non-COVID-19 preprints using 565 generalised linear regression models with calendar days since Jan 1 st 2020 as an additional covariate 566 and negative binomially-distributed errors. This allowed estimates of time-adjusted rate ratios 567 comparing COVID-19 and non-COVID-19 preprint metrics. Negative binomial regressions were 568 constructed using the function 'glm.nb' in R package MASS [52]. For multivariate categorical 569 comparisons of preprint metrics (e.g. screening time between preprint type and preprint server), we 570 constructed two-way factorial ANOVAs, testing for interactions between both category variables in all 571 cases. Pairwise post-hoc comparisons of interest were tested using Tukey's honest significant 572 difference (HSD) while correcting for multiple testing, using function 'glht' in R package multcomp 573 [53]. 574