The incremental value of the contribution of a biostatistician to the reporting quality in health research—A retrospective, single center, observational cohort study

Background The reporting quality in medical research has recently been critically discussed. While reporting guidelines intend to maximize the value from funded research, and initiatives such as the EQUATOR network have been introduced to advance high quality reporting, the uptake of the guidelines by researchers could be improved. The aim of this study was to assess the contribution of a biostatistician to the reporting and methodological quality of health research, and to identify methodological knowledge gaps. Methods In a retrospective, single center, observational cohort study, two groups of publications were compared. The group of exposed publications had an academic biostatistician on the author list, whereas the group of non-exposed publications did not include a biostatistician of the evaluated group. Rating of reporting quality was done in blinded fashion and in duplicate. The primary outcome was a sum score based on six dimensions, ranging between 0 (worst) and 11 (best). The study protocol was reviewed and approved as a registered report. Results There were 131 publications in the exposed group published between 2017 and 2018. Of these, 95 were either RCTs, observational, or prediction / prognostic studies. Corresponding matches in the group of non-exposed publications were identified in a reproducible manner. Comparison of reporting quality overall revealed a 1.60 (95%CI from 0.92 to 2.28, p <0.0001) units higher reporting quality for exposed publications. A subgroup analysis within study types showed higher reporting quality across all three study types. Conclusion Our study is the first to report an association of a higher reporting quality and methodological strength in health research publications with a biostatistician on the author list. The higher reporting quality persisted through subgroups of study types and dimensions. Methodological knowledge gaps were identified for prediction / prognostic studies, and for reporting on statistical methods in general and missing values, specifically.


Abstract
Background: The reporting quality in medical research has recently been critically discussed. While reporting guidelines intend to maximize the value from funded research, and initiatives such as the EQUATOR network have been introduced to advance high quality reporting, the uptake of the guidelines by researchers could be improved. The aim of this study was to assess and quantify the contribution of a biostatistician to the reporting and methodological quality of health research, and to identify methodological knowledge gaps. Methods: In a retrospective, single center, observational cohort study, two groups of publications were compared. The group of exposed publications had an academic biostatistician on the author list, whereas the group of non-exposed publications did not include a biostatistician of the evaluated group. Rating of reporting quality was done in blinded fashion and in duplicate. The primary outcome was a sum score based on six dimensions, ranging between 0 (worst) and 11 (best). The study protocol was reviewed and approved as a registered report. Results: There were 131 publications in the exposed group published between 2017 and 2018. Of these, 95 were either RCTs, observational, or prediction / prognostic studies. Corresponding matches in the group of non-exposed publications were identified in a reproducible manner. Comparison of reporting quality overall revealed a 1.60 (95%CI from 0.92 to 2.28, p < 0.0001) units higher reporting quality for exposed publications. A subgroup analysis within study types showed higher reporting quality across all three study types. Conclusion: Our study is the first to demonstrate a higher reporting quality and methodological strength in health research publications with a biostatistician on the author list. The higher reporting quality persisted through subgroups of study types and dimensions. Methodological knowledge gaps were identified for prediction / prognostic studies, and for reporting on statistical methods in general and missing values, specifically. are only around 50% [2]. When it comes to the reporting of randomized trials, 10 Dechartres et al. [3] have systematically evaluated reporting of more than 20'000 trials 11 included in Cochrane reviews. They conclude that poor reporting has decreased over 12 time, but that especially lower impact factor journals show room for improvement. 13 Reporting quality of clinical prediction models has recently been evaluated 14 systematically in the context of research on Severe acute respiratory syndrome 15 coronavirus 2 (Sars-Cov-2) [4]. The authors concluded that almost all published models 16 for predicting mortality were poorly reported, and that the corresponding Transparent 17 reporting of a multivariable prediction model for individual prognosis or diagnosis 18 (TRIPOD [5]) guideline was largely omitted.

19
In Switzerland, the government paid 22.9 billion Swiss francs for research and 20 development, representing more than 3% of the gross domestic product in 2019.

21
Publications in the field of "clinical medicine" represent 25% of all publications [6], and 22 given the large amounts of resources, value from research and publications should be 23 maximized.

24
The objectives of the current study were, first, to quantify the contribution of a The study is a retrospective, single-center observational cohort study, conducted at the 32 University of Zurich (UZH) and its University Hospital (USZ).

33
December 2, 2021 2/11 Selection of exposed and non-exposed publications 34 In this study, two groups of publications were compared. The group of "exposed" 35 publications was defined according to their exposure to one or more of a set of 13 36 academic biostatisticians from the Epidemiology, Biostatistics and Prevention Institute, 37 and the Institute of Mathematics, both localized at University of Zurich, as a co-author. 38 The group will be referred to as biostatisticians in the following. The group of exposed 39 publications was published between 2017 and 2018, and it was retrieved in a PubMed 40 search, with a search string as specified in S1 Appendix on Dec 9, 2019.

41
Methodological publications as well as non-English language publications were excluded. 42 To define the group of "non-exposed" publications for comparison, all medical number of non-exposed publications resulting from the affiliation list was used in a 52 random but replicable order -aiming to remove potential chronological ordering or any 53 other systematic ordering while adhering to high standards of reproducibility.

54
Categorization into study types 55 For each of the exposed publications, the study type was determined, and the subset of 56 all RCTs, observational studies, and prediction / prognostic studies was evaluated 57 further. Categorization into study types was performed by the set of biostatisticians.

58
For most publications, the authors themselves determined the study type. For some 59 publications, the biostatistician as co-author had left the department, and thus the 60 study type was categorized independently and in duplicate by two authors (UH, EF).

61
After consensus on study type was reached, record count for each study type for each 62 publication year was obtained. The three study types RCT, observational study, and 63 prediction / prognostic study were the most frequent types. Other types (e.g. 64 systematic reviews) had been abandoned a priori.

65
The number of non-exposed publications was much larger than the number of 66 exposed publications. For that reason, the categorization of the non-exposed 67 publications into RCTs, observational studies and prediction / prognostic studies was 68 performed in random but replicable order until the numbers of non-exposed publications 69 of these study types matched the corresponding number of exposed publications per   whether the corresponding reporting guideline was mentioned.

85
The rating of publications regarding these six items was operationalized and piloted, 86 such that they could be used efficiently and robustly to rate each publication 87 consistently. Each dimension had different possible answer categories, also dependent on 88 study type, resulting in a rating varying between 0 (lowest) to 2 (highest) for 89 dimensions 1 to 5, plus an additional point for mentioning the corresponding reporting 90 guideline. Details of the operationalization can be found in S3 Appendix. The range 91 of the total score was from 0 (lowest) to 11 (highest).

92
Outcomes 93 The primary outcome of this study was the sum score of reporting quality and 94 methodological strength in exposed and non-exposed publications, with respect to the 95 six dimensions. The primary outcome was assessed in blinded fashion and in duplicate 96 by two independent raters. The raters were recruited from outside of the departments. 97 Blinding to whether the publication belonged to the exposed or non-exposed group was 98 guaranteed by removing author names, affiliation lists, journal name, corresponding 99 author name, author contributions, date, acknowledgements, references, and DOI from 100 every publication's PDF. Discrepancies in the ratings between the two raters were 101 resolved by a third rating and discussion until consensus was reached.

102
The secondary outcome of this study was the number of citations in the group of 103 exposed and non-exposed publications at a fixed date (July 20, 2021).

104
Outcome rating and rater training 105 The outcome rating and its operationalization was developed by four authors (UH, KS, 106 MH, EF). After operationalization was finalized the resulting questions for each study 107 type were programmed to be evaluated through an R Shiny app, which underwent 108 quality review and a testing period. The questionnaire can be found in S3 Appendix. 109 To find raters, outside of the core study team and outside of the departments, PhD 110 programs in health research across Switzerland, as well as groups of researchers 111 interested in Research on Research were contacted. Each candidate rater could chose a 112 study type, and received written instructions for the rating task. The candidate raters 113 were instructed and trained by rating vignette publications for calibration. These 114 vignette publications of all study types were similar publications as those under study, 115 but they were published in 2019 and were rated with scrutiny by the study authors, 116 including detailed explanation. Only upon successful completion of test ratings, the 117 raters received sets of 11-12 papers of the same study type for rating. The raters were 118 obliged to rate the reporting quality based on the blinded PDF's alone, and not to use 119 additional information from the internet while doing so. Ratings were performed in 120 blinded fashion, meaning that the raters were unaware of the classification of 121 publications as exposed or non-exposed, and of authors on the publications. The ratings 122 were performed in duplicate, and any discrepancies were resolved by a third independent 123 rating. The raters were reimbursed with vouchers for every set of 11-12 publications.

124
Additionally, raters were asked for co-authorship after completion of 33 or more ratings. 125 In total, 15 raters were recruited. The ratings were done between May and July 2021. The sample size was justified a priori, based on the consideration that with 95 128 publications in the exposed group, and 95 publications in the non-exposed group at a 129 significance level of 5% and with a power of 80% an effect size of 0.41 (Cohen's d ) could 130 be detected, using a 2-sided, 2-sample t-test with equal variance assumption. The effect 131 size would be considered a medium effect size. The number of 95 publications 132 corresponded to all publications in the exposed group in the years 2017 and 2018.

133
Data management 134 Data collection in the context of this study had to cover two different aspects. First, 135 categorization of the exposed and non-exposed publications into the three study types 136 was performed with the help of a specifically programmed R Shiny app, in which the 137 title and abstract, as well as the link to the full text was provided, such that the 138 categorization could be performed independently and in duplicate and that any 139 discrepancies could be detected and resolved by discussion. Second, reporting quality 140 rating was performed using another R Shiny app, implementing the operationalized 141 quality dimensions. The electronic records of the two independent ratings, and the 142 consensus rating were saved. The use of R Shiny apps in this research guaranteed highly 143 reliable data entry.

144
Risk of bias 145 The study was designed to compensate the following biases a priori. Risk of detection partially addressed by comparing the number of citations of exposed and non-exposed 154 publications, under the hypothesis that equal citation numbers would indicate that less 155 confounding by indication was present. based on the categorization suggested by Altman [8].

162
Statistical methods for the primary outcome included visualization of the results 163 with dot-plots (lollipop plots), in which the means of the outcome in the exposed and 164 non-exposed publications are shown, overall (score 0 to 11), and in subgroups of study 165 type (score 0 to 11) and reporting quality dimension (score 0 to 2). Besides that, the 166 estimated between-group differences, overall and in subgroups of study type with 95% 167 confidence intervals (CI) were reported. The two-sided, two-sample t-test under the 168 assumption of equal variances was used to test the hypothesis of no difference in 169 reporting quality between exposed and non-exposed publications. Corresponding The number of citations was reported overall, and in subgroups of study type, with 172 medians and interquartile ranges, as the distribution was right-skewed. The 173 non-parametric exact Wilcoxon-Mann-Whitney method was used to test the hypothesis 174 of no difference in number of citations between exposed and non-exposed publications, 175 and to estimate a confidence interval. The between-group difference in location was 176 estimated and reported with 95% CI, based on rank statistics.

177
All analyses, including subgroup analyses described above were pre-specified in the 178 registered report study protocol [9]. The unit of analysis was the individual publication, 179 or the reporting quality dimension.

180
Statistical programming was performed with R 4.1.1 [10], in combination with 181 dynamic reporting. Statistical programming included downloading all potential 182 non-exposed publications, random reordering, development of an R Shiny app for 183 categorization of the publications, development of an R Shiny app for the recording of 184 reporting quality ratings, as well as statistical programming of the methods for data 185 analysis and visualization. Results of the study were reported according to STROBE 186 guidelines [11]. All anonymized data was uploaded in an OSF repository.

188
In total there were 131 exposed papers published in 2017 and 2018. Of these, 95 189 publications were of the study types RCT, observational study, or prediction / 190 prognostic study. There were six RCTs, 77 observational studies, and 12 prediction / 191 prognostic studies. The literature search for non-exposed publications with first and / 192 or second author with suitable affiliation and year resulted in a total number of 3420 193 publications. Four hundred publications of these in random order were categorized into 194 one of the three study types RCT, observational, or prediction / prognostic study, and 195 the retrieved case numbers of the exposed papers could be frequency matched 196 individually for 2017 and 2018. The corresponding flow-chart is shown in Fig. 1. All 197 data was made available on OSF [12].

198
Ten of the exposed publications and two of the non-exposed publications mentioned 199 the corresponding reporting guideline. In 48 of the exposed publications, and in 14 of 200 the non-exposed publications, the programming language R was used for the statistical 201 analysis. All descriptive results can be found in Table 1 The agreement between the two ratings of each publication was 0.52 (95%CI from 0.46 205 to 0.57) overall, indicating moderate agreement, according to Altman [8].

206
For the three different study types, however, the agreement varied between 0.31 207 (95%CI from 0.05 to 0.52) for RCTs and 0.52 (95%CI from 0.46 to 0.59) and 0.53 208 (95%CI from 0.35 to 0.68) for observational and prediction studies, respectively. To 209 reach consensus for all ratings with discrepancies a third blinded rater was involved.

210
Primary outcome

211
The estimated between-group difference for the primary outcome was 1.60 (95%CI from 212 0.92 to 2.28, p < 0.0001) in favor of the exposed publications. This result corresponds 213 to a Cohen's d of 0.67 (95%CI from 0.38 to 0.97). In the pre-specified subgroups of 214 study type, the estimated between group difference was 3.33 (95%CI from -0.84 to 7.51), 215 1.39 (95%CI from 0.68 to 2.09) and 2.08 (95%CI from 0.12 to 4.04) for randomized, 216 observational and prediction / prognostic studies, respectively (Fig. 2), showing higher 217 reporting quality across all study types. In addition to the estimation of the between 218 group difference, the representation of each subgroup's mean values shows that generally 219 for RCTs the reporting quality was higher than for observational and prediction / 220 prognostic studies.

221
Dimension-specific score values

222
For each of the five reporting dimensions, the between group difference was estimated. 223 The corresponding range of values was between 0 (worst) and 2 (best). Again the 224 results are shown in a graphical representation (Fig. 3). The dimension "Variables" had 225 a smaller between group difference, and a higher reporting quality overall, whereas the 226 "Missing data" and "Statistical methods" dimensions were generally reported with less 227 detail. The mean reporting quality in the exposed publications was higher throughout, 228 than that of the non-exposed publications.

Number of citations 230
The number of citations, extracted on July 20, 2021, had a non-normal, right-skewed 231 distribution, and for that reason the non-parametric exact Wilcoxon-Mann-Whitney 232 method was used for the estimation of the between group difference and its confidence 233 interval. The estimate was -2 (95%CI from -4 to 0, p = 0.07) indicating weak evidence 234 for higher citation numbers for the non-exposed publications. All descriptive statistics 235 for number of citations can be found in Table 2. It can be seen that the number of 236 citations was relatively balanced for observational studies and prediction / prognostic 237 studies, whereas in the RCTs the number of citations was much larger in the 238 non-exposed group of publications as compared to the exposed publications. Our study demonstrates that academic biostatisticians as co-authors have a positive 242 impact on reporting quality and methodological strength in health research publications, 243 overall and in subgroups of study types. In addition to that, the subgroup analyses 244 demonstrated that there was evidence for a higher reporting quality in the exposed 245 publications for observational studies and prediction / prognostic studies. The

246
CONSORT statement seems to have been taken up well, because reporting quality was 247 highest generally, for both exposed and non-exposed publications in RCTs. Citation 248 numbers were comparable between observational and prediction / prognostic studies, 249 but the average number of citations for RCTs was higher in the non-exposed group of 250 publications. Generally, the number of RCTs was very low.

251
Methodological knowledge gaps seemed to be more prominent in the areas of 252 statistical methods, and missing values. Nevertheless the mean reporting quality was 253 higher in the exposed publications, throughout all subgroups. While it seems reasonable 254 to assume that in the exposed papers the biostatisticians knew the methods well, there 255 was still sub optimal reporting of these. The rating of reporting quality was performed 256 in duplicate, and the agreement between first and second ratings were moderate to 257 good, overall. The difficulties in the rating tasks were an indicator of sub optimal 258 reporting in itself. Our study is the first to our knowledge to develop and use a rating 259 score that is usable across study types, and which allows the comparison of reporting 260 quality across study types. Low citation numbers of corresponding reporting guidelines 261 in both, the group of exposed and non-exposed publications may be an indication of 262 lack of awareness among study authors.  [13]. 275 Observational studies were the most frequent study type in the sample at hand, and 276 reporting as well as methodological quality was only moderately higher in the exposed 277 publications than in the non-exposed publications. Although the STROBE reporting  [14]. bias assessment could be facilitated if reporting quality was higher generally. Another 299 limitation of our study was the low agreement between ratings for the RCTs, which 300 turned out to be only fair. An explanation for this could have been the small number of 301 RCTs being rated by only two different raters. Both raters were relatively consistent in 302 their ratings: one of them being somewhat strict, and the other one relatively lenient.

303
The discrepancies led to the fact that many questions had to be rated by a third rater 304 to come to a consensus.

305
Our study has several strengths. First of all, we had written a clear study protocol, 306 receiving an external review as a registered report. Upon review of the protocol, the 307 study design and operationalization could be revised and improved. Second, several 308 measures were taken to compensate for different sources of bias, as our study was 309 observational and retrospective. These included double ratings of reporting quality, 310 unbiased assessment of reporting quality through blinded PDFs, and highly reliable 311 data entry through the specifically designed R Shiny app.

313
Our study has several implications for future research. First of all, the study design can 314 repeatedly be applied for future assessments of reporting quality in our group or other 315 academic centers over time. The continuing discussions about the assessment already 316 had an impact on the awareness of the topic among the people involved. In addition, 317 the setup can be generalized to address other documents, e.g., systematic reviews (based 318 on PRISMA [15]), statistical analysis plans [16] , or research proposals (SPIRIT [17]).

319
Academic biostatisticians should take more responsibility in the review of final