Scientific quality of COVID-19 and SARS CoV-2 publications in the highest impact medical journals during the early phase of the pandemic: A case control study

Background A debate about the scientific quality of COVID-19 themed research has emerged. We explored whether the quality of evidence of COVID-19 publications is lower when compared to nonCOVID-19 publications in the three highest ranked scientific medical journals. Methods We searched the PubMed Database from March 12 to April 12, 2020 and identified 559 publications in the New England Journal of Medicine, the Journal of the American Medical Association, and The Lancet which were divided into COVID-19 (cases, n = 204) and nonCOVID-19 (controls, n = 355) associated content. After exclusion of secondary, unauthored, response letters and non-matching article types, 155 COVID-19 publications (including 13 original articles) and 130 nonCOVID-19 publications (including 52 original articles) were included in the comparative analysis. The hierarchical level of evidence was determined for each publication included and compared between cases and controls as the main outcome. A quantitative scoring of quality was carried out for the subgroup of original articles. The numbers of authors and citation rates were also compared between groups. Results The 130 nonCOVID-19 publications were associated with higher levels of evidence on the level of evidence pyramid, with a strong association measure (Cramer’s V: 0.452, P <0.001). The 155 COVID-19 publications were 186-fold more likely to be of lower evidence (95% confidence interval [CI] for odds ratio, 7.0–47; P <0.001). The quantitative quality score (maximum possible score, 28) was significantly different in favor of nonCOVID-19 (mean difference, 11.1; 95% CI, 8.5–13.7; P <0.001). There was a significant difference in the early citation rate of the original articles that favored the COVID-19 original articles (median [interquartile range], 45 [30–244] vs. 2 [1–4] citations; P <0.001). Conclusions We conclude that the quality of COVID-19 publications in the three highest ranked scientific medical journals is below the quality average of these journals. These findings need to be verified at a later stage of the pandemic.


Introduction
Coronavirus disease 2019  is caused by severe acute respiratory syndrome coronavirus 2 (SARS CoV-2), and it is a rapidly spreading pandemic that is putting extraordinary stress on healthcare systems across the globe (For simplicity, we will use COVID-19 in reference to both the virus and the disease). While everyone waits for a breakthrough of a specific COVID-19 therapy and an effective vaccine, scientists are redirecting their efforts into COVID-19-themed research to build up our knowledge of this new disease [1]. A search for "COVID-19 or SARS-CoV2" in the PubMed database revealed 4,670 publications between January 1, 2020, and April 12, 2020. This need to publish COVID-19-related findings has been supported by many Ethical Committees, grant providers, and journal editors, who have 'fasttracked' COVID-19 publications so that they can be processed at record speed [2][3][4]. However, concerns are emerging that scientific standards are not being met.
The first report of COVID-19 transmission in asymptomatic individuals [5] was later considered to have been flawed, because the patient showed symptoms at the time of transmission [6]. A similar example occurred in The Lancet, whereby the authors retracted a publication after admitting irregularities on the first-hand account of the front-line experience of two Chinese nurses [7]. While our article was under review, two major analyses on the use of hydroxychloroquine and cardiovascular mortality associated with COVID-19 were retracted in the Lancet [8] and the New England Journal of Medicine [9] because source data could not be verified.
Such situations raise concerns as to the quality of the data, the conclusions presented by the authors, and the peer review by the editors, due to the pressure to publish highly coveted information on COVID-19. The urgency of the outbreak suddenly appears to legitimize key limitations of studies, such as small sample sizes, lack of randomization or blinding, and unvalidated surrogate endpoints [10,11].
While clinicians and the public long for effective treatments, a debate about the quality of this surge of research and the potential violations of scientific rigor has emerged [10,12,13]. Despite this massive publication effort, current guidelines remain without any recommendations on core topics for patient management and care [14,15]. The combination of clinical urgency, weak evidence, pre-print publications without prior peer review [16], and public pressure [17] might lead to inappropriate public health actions and incorrect translation into clinical practice [18], with the potential for worrying breaches in patient safety [19]. A further concern is the inflation of publication metrics, particularly in terms of journal impact factors. Citation-based metrics are used by researchers to maximize the citation potential of their articles [20]. The expectation of a high citation rate might be used by journals to publish papers of questionable scientific value on 'trendy' topics [21].
To date, the quality of COVID-19 publications in the top three general medical journals by impact factor (i. e. the New England Journal of Medicine, The Lancet and The Journal of the American Medical Association, represented by an impact factor > 50 for all) has not been formally assessed. We hypothesized that the quality of recent publications on COVID-19 in the three most influential medical journals is lower than for nonCOVID-19 articles published during the same time period. We also determined the early research impact of COVID-19 original articles versus nonCOVID-19 original articles.

Materials and methods
This report follows the applicable STROBE guidelines for case-control studies. . The resulting publications were stratified into COVID-19-related and nonCOVID-19-related. We matched the nonCOVID-19 publications with COVID-19 publications according to article types within each journal, with the exclusion of nonmatching article types. Secondary studies, correspondence letters on previously published articles, unauthored publications, and specific article types not matching any of the six categories on the levels of the evidence pyramid [22][23][24] (e.g., infographic, erratum) were excluded (Fig 1).

Multi-step design
We performed a multi-step 360-degree assessment of the studies. It consisted of their classification according to level of evidence for a quantitative appraisal of their methodological quality using a validated tool, and a narrative analysis of the strengths and weaknesses of the COVID-19 publications, as is often used in social sciences [25]. Early citation frequencies of the original articles was determined.

Levels of evidence
All of the publications included were assessed for number of authors and level of evidence. We used the Oxford Quality Rating Scheme for Studies and Other Evidence [22] to categorize the level of evidence, as adjusted to include animal and in-vitro research [23,24]. The highest level is attributed to research as randomized trials, followed by nonrandomized controlled studies and cohort trials. The lower levels are represented by descriptive studies, expert opinion, and animal or in-vitro research, commonly represented in the form of a pyramid [22,23,26]. For secondary analysis, we split the six levels of evidence into the upper and lower halves, which reflected higher (i.e., 1-3) and lower (i.e., 4-6) levels of evidence, respectively. The number of authors per publication was counted manually. After the hierarchical grading of included publications, the original articles (i.e., published as 'original research articles' in each of the journals; Fig 1) were defined for further in-depth analysis using the study quality checklist proposed by Kmet et al. [27]. This checklist is consistent with the recommendations from the Center for Reviews and Dissemination [28,29]. Four authors in pairs (MZ-DB, JBE-BZ; each pair assessing one half of the publications) independently assessed the original articles on 14 quality criteria (see S1 File). The 14 items covered the research question, design, measures to reduce bias, and data reporting and interpretation, and these were scored according to the degree to which each specific criterion was met ("yes" = 2; "partial" = 1; "no" = 0; "not applicable" = n/a) with the help of a prespecified manual [27]. The total score ranged from 0 to 28. The summary percentage scores were calculated for each original article by summing the total score obtained across the applicable items and dividing by the total possible score (i.e., 28 -[number of "n/a" × 2] ×100). Disagreements between the reviewers (defined as >2 difference in the total score, or >10% difference in the summary percentage scores), were resolved through one round of discussion between each 2-author pair.

Narrative analysis of COVID-19 original articles
The COVID-19 original research articles (n = 13) were assessed in narrative form to report on their major weaknesses, potential conflicts of interest, and likely influence on further research and clinical practice.

Citation frequencies
The early citation frequencies were tracked every 5 days from April 25 th to May 25 th 2020 for all of the original scientific articles through GoogleScholar [30], to determine how strongly these COVID-19 original articles had impacted upon further publications, in comparison to the nonCOVID-19 original articles. A comparison to an original article set in the same time frame of 2019 was done. Citations per month were calculated to reduce lead time bias. The Google scholar search engine has been shown to reliably identify the most highly-cited academic documents [31].

Statistical analysis
The distributions of the COVID-19 and nonCOVID-19 publications on the levels of evidence pyramid were analyzed using Pearson's Chi-squared statistics and Cramer's V as the measure of strength of association (weak: >0.05; moderate: >0.10; strong: >0.15; very strong: >0.25) [32]. Further effect size estimations were performed on two by two contingency tables (split by level of evidence into high and low quality groups) and are reported as odds ratios with 95% confidence intervals (CI).
The retrospectively calculated sample size for the summary percentage scores [27]  articles, respectively) on a t-test, with a standard deviation of 15, 85% power, and 0.05 alpha, was 8 original articles [33,34]. Thus, we deemed our collected data sufficient.
We also planned for a secondary analysis if the comparison above resulted in a significant difference (defined as P <0.05) in the mean percentage scores between the COVID-19 and nonCOVID-19 original articles. The secondary analysis aimed to compare the 2:1 allocation of nonCOVID-19:COVID-19 original articles, for which the allocation was carried out with the 26 original articles with the lowest overall percentage scores in the nonCOVID-19 group versus all of the 13 original articles in the COVID-19 group. The threshold p-value for significance was set at P <0.025, to adjust for multiple testing.
Assessment of the original articles' quality is reported as a two-reviewer mean score (95% CI) and was analyzed using Welch's t-tests. Hedges's g was used as the effect size measure based on a standardized mean difference [35] (small: d = 0.20; medium: d = 0.50; large: d = 0.80; very large: d = 1.20; huge: d >2.00) [36,37]. To confirm the reliability of the scoring, Cronbach's alpha was calculated for the total score and the summary percentage score (internal consistency), and the Intraclass Correlation Coefficient with absolute agreement for the inter-rater reliability. The percentage agreement between the two reviewers was also calculated for each individual item (see S2 File).
The data distributions were tested for normality with Kolmogorov-Smirnov tests, and are reported accordingly. Tests between two groups were done with Mann-Withney tests, between multiple groups with Kurskal-Wallis test. Significance was set at P <0.05 or adjusted for multiple testing. All of the tests were two-tailed. The statistical analysis was performed using SPSS Statistics 20 (IBM Inc., Armonk, NY, USA) and Prism 8 (GraphPad Software, San Diego, CA, USA).

Results
Out of 559 publication entries on PubMed for the selected journals, 155 publications on COVID-19 and 130 publications on other (nonCOVID-19) topics were included in the level of evidence analysis. The subsequent analysis of quality was performed on 13 COVID-19 original articles in comparison with 52 nonCOVID-19 original articles (Fig 1).

Levels of evidence and number of authors
The nonCOVID-19 publications were associated with higher quality on the level of evidence pyramid (P <0.001; Chi squared), with a strong association measure (Cramer's V: 0.452, Table 1). When comparing the higher evidence group to the lower evidence group, the COVID-19 publications were 18-fold more likely (i.e., odds ratio) to be in the lower evidence group (95% CI: 7.0-47; P <0.001). When comparing only the original articles on the levels of evidence pyramid (Table 2), the nonCOVID-19 publications were also associated with higher  [2][3][4][5][6]) than in the non-COVID-19 publications (median: 2 [1-3]) (P <0.001; Mann-Whitney). Obvious outliers were a NEJM case report [38] with 35 authors, an opinion correspondence piece in The Lancet [39] with 29 authors, and a comment piece in The Lancet with 77 authors in a coalition [40].

Quantitative appraisal
Due to >2 difference in the total scores, or >10% difference in the summary percentage scores, the reviewer pairs discussed 8 (of 32) and 12 (of 33), respectively, of the original articles after the individual scoring. The internal consistency reliability of the total score was 0.987, and of the summary percentage score was 0.964 (Cronbach's alphas) for the reviewer pair MZ-DB, and 0.988 and 0.928, respectively, for the reviewer pair JBE-BZ (P < 0.001, for all). The interrater reliabilities of the total scores was 0.975, and the summary percentage score was 0.930 (Intraclass Correlation Coefficient, absolute agreement) for pair MZ-DB, and 0.974 and 0.860, respectively, for pair JBE-BZ (Intraclass Correlation Coefficient, absolute agreement) (P < 0.001, for all).

Citation frequency
There was a significant difference in the median number of citations according to GoogleScholar at each of the seven dates tested, favoring COVID-19 original research papers (P <0.001, for all; Mann-Whitney, Table 3

Narrative appraisal
The major weaknesses of the 13 COVID-19 original research articles were assessed ( Table 4). The selection included one randomized trial [41], four retrospective cohort studies or case series [42][43][44][45], five epidemiological descriptive studies [46][47][48][49][50], three epidemiologic modeling studies [51][52][53], with most of the designs reflecting low grades of evidence [22]. Most of these studies had limitations in terms of missing data or under-reporting. The randomized trial was not blinded. Ten studies showed no apparent conflicts of interest. Two studies were based on data collected by the World Health Organization [51,52], and in another study [54] a pharmaceutical company screened the patients for treatment, collected the data, and supported the trial financially. Two studies had a patient:author ratio <1 [43,46]. Two studies were close to 1 [55,56]. Three studies were considered not relevant for further research [46,48,55], and four studies were deemed not relevant for clinical practice [43,46,55,56], because the findings were neither new nor generalizable. The 13 COVID-19 original articles have already been cited in 52 sets of published guidelines.

Discussion
The main finding of our study is that the COVID-19-related research in these highly ranked medical journals is of lower quality than research on other topics in the same journals for the  same period of time, with strong measures for effect size. We also demonstrated that the number of publications on COVID-19 alone is almost the same as the number of publications on all other topics. These findings provide evidence for the debate on the scientific value, ethics, and information overload of COVID-19 research [10,13,19]. There are several limitations to the present study. Even though our data were less than a month old at first submission, the results may soon become obsolete, as new COVID-19 research emerges on a daily basis. We tried to overcome potential bias with a clear search strategy and simple analysis, making our findings highly reproducible. We chose Lander's method because it allowed inclusion of in-vitro and animal research [23], and we refined the hierarchical grading of the level of evidence using a quantitative tool [27]. Given the vast choice [57], we chose the QUALSYST-tool on the basis that it allows assessment and comparison across multiple study types [27]. Even when the summary scoring might be biased for a methodological quality assessment [57], "composite quality scales can provide useful overall assessments when comparing populations of trials" [57]. The QUALSYST tool has been validated and is easy to use. This may facilitate additional similar studies at a later stage of the pandemic. Compared to an in-depth analysis of a study's peer-review process prior to acceptance for publication, it must remain very superficial. We did not expand our analysis to check source data. The data scandal leading to retraction of two major studies [8,9] emerged while our article was under peer-review. The tools we used would not be suitable to have detected this. Public data repositories and an "open science" approach may facilitate data validation [58].
The imbalance between the two cohorts in our study might come from a lack of randomized trials and a proliferation of opinion articles and cluster descriptions for the COVID-19 publications. It can be argued that in the early phases of a pandemic, case-defining reports are mandatory for the evolving dynamics of the outbreak and that such studies will suffer from the usual limitations of initial investigations, and will score lower on quality, even when they are carried out to high standards. However, in our secondary analysis, after exclusion of the highest-quality nonCOVID-19 publications, the significant quality difference remained. One might argue that a comparison to a historical control group, for example the same time frame in 2019, when there was no pandemic effect on research, would have been more appropriate. Our hypothesis was that COVID-related research showed lower quality than non-COVID research. A historical control group may introduce a selection bias, since conditions for research then would be clearly different. We would therefore argue that the control group has to be subject to the same conditions as the test group, when methodological quality is assessed. This may be different for other endpoints like total research output. In line with our results, Stefanini et al reported-in an oral presentation at the European Society of Cardiology Congress 2020-similar findings of lower quality associated with COVID-19 in the same journals and timeframe as our work with a historical control group of 2019. So, both historical and contemporary control groups lead to the same conclusions.
The COVID-19 thematic per se might have attracted more readers and researchers, which will have led to more citations and greater incorporation into secondary studies, as we have also demonstrated. Such a 'double-whammy' of lower-quality literature and high dissemination potential can have grave consequences, as it might urge clinicians to take actions and use treatments that are compassionately based but supported by little scientific evidence. Indeed, apart from exposing patients to potential side effects of some drugs [46,59,60], treatment strategies based on case reports are generally futile [61]. While multiple diagnostic, therapeutic, and preventive interventions for COVID-19 are being trialed [62], clinicians should sometimes resist the wish "to at least do something", and to maintain clinical equipoise while fully gathering and evaluating the data that are available [12,61]. This responsibility needs to be shared by the high-impact journals, which should continue to maintain publication standards as for other nonCOVID-19 research. It must be acknowledged though, that a citation does not necessarily need to be positive for a study or author, if the context, i. e. criticism or discussions about retractions and corrections, of the citations are considered. This is beyond the scope of our work.
The pandemic took a toll on all aspects of life. Clearly, journal reviewers were restricted in the time they were able to invest into their valuable, voluntary and honorary work. To what extent changes in their practices have occurred is not accessible for us, since the peerreview process was blind and confidential. Assessing of journals with open peer review during the pandemic may shed light on such phenomena, but this was not the scope of our study.
We also demonstrated a worrying trend of increasingly long authorships in lower quality COVID-19 publications, with the almost 'anecdotical' findings of some of the publications actually having more authors than patients [38,43,46]. The current demand for publications appears to have led authors to send their COVID-19 findings to higher-impact journals. As the authors of the present report, we are exposed to the same allegations.
At present, we can only issue a plea to both authors and editors to maintain their ethical and moral responsibilities in terms of the International Committee of Medical Journal Editors authorship standards. Being at the forefront of medical discovery, these journals should not publish lower quality findings just to promote citations. The risk of bias and unintended consequences for patients is relevant [61], and scientific standards must not be 'negotiable' [10].

Conclusions
The quality of the COVID-19-related research in the top three scientific medical journals is below the quality average of these journals. Unfortunately, our numbers do not contribute to a solution as to how to preserve scientific rigor under the pressure of a pandemic.