Statistical Reviewers Improve Reporting in Biomedical Articles: A Randomized Trial

Background Although peer review is widely considered to be the most credible way of selecting manuscripts and improving the quality of accepted papers in scientific journals, there is little evidence to support its use. Our aim was to estimate the effects on manuscript quality of either adding a statistical peer reviewer or suggesting the use of checklists such as CONSORT or STARD to clinical reviewers or both. Methodology and Principal Findings Interventions were defined as 1) the addition of a statistical reviewer to the clinical peer review process, and 2) suggesting reporting guidelines to reviewers; with “no statistical expert” and “no checklist” as controls. The two interventions were crossed in a 2×2 balanced factorial design including original research articles consecutively selected, between May 2004 and March 2005, by the Medicina Clinica (Barc) editorial committee. We randomized manuscripts to minimize differences in terms of baseline quality and type of study (intervention, longitudinal, cross-sectional, others). Sample-size calculations indicated that 100 papers provide an 80% power to test a 55% standardized difference. We specified the main outcome as the increment in quality of papers as measured on the Goodman Scale. Two blinded evaluators rated the quality of manuscripts at initial submission and final post peer review version. Of the 327 manuscripts submitted to the journal, 131 were accepted for further review, and 129 were randomized. Of those, 14 that were lost to follow-up showed no differences in initial quality to the followed-up papers. Hence, 115 were included in the main analysis, with 16 rejected for publication after peer review. 21 (18.3%) of the 115 included papers were interventions, 46 (40.0%) were longitudinal designs, 28 (24.3%) cross-sectional and 20 (17.4%) others. The 16 (13.9%) rejected papers had a significantly lower initial score on the overall Goodman scale than accepted papers (difference 15.0, 95% CI: 4.6–24.4). The effect of suggesting a guideline to the reviewers had no effect on change in overall quality as measured by the Goodman scale (0.9, 95% CI: −0.3–+2.1). The estimated effect of adding a statistical reviewer was 5.5 (95% CI: 4.3–6.7), showing a significant improvement in quality. Conclusions and Significance This prospective randomized study shows the positive effect of adding a statistical reviewer to the field-expert peers in improving manuscript quality. We did not find a statistically significant positive effect by suggesting reviewers use reporting guidelines.


INTRODUCTION
Despite being widely accepted as the best way to filter low-quality research, to detect flaws in scientific communications and to improve papers with significant contributions to their fields [1][2][3], peer review has also raised many criticisms [4][5][6]. Certainly, as a process carried out by humans it has its weaknesses and therefore many initiatives have been developed to improve it. For instance, some methodological mistakes are continually repeated in published papers and proposals that referees fail to detect or do not consider properly, and because of this, both the development of reporting guidelines [7][8][9][10][11] and the suggestion of adding methodological experts to the referees panel have been promoted [12] and implemented [13]. Accordingly, from what has been said, we might have expected the existence of strong scientific evidence in favor of peer review, but surprisingly, there have not been many attempts to determine ''with the scientific rigor they demand of their authors'' [14] its effect through measurable variables, and little evidence supports its use [15,16]. Attempts to quantify its effects cannot deal with comparisons concerning the acceptance of reports without any intervention between submission and final publication. In fact, due to ethical and practical considerations, efforts have been specially made so as not to delay or interrupt the normal screening processes, so that the only alternative for evaluating the effects of peer review has been to assess surrogate variables.
The suggestion of adding a methodological expert to the reviewers pool of a Spanish biomedical journal gave us the chance to conduct a masked, randomized experiment to assess the effect of peer review, not only without interfering with the regular course of the editorial process, but also describing more realistically the true role of peer reviewing in improving the quality of papers. The two main objectives of our investigation were to assess the effects of (1) adding a statistical peer reviewer and (2) suggesting reporting guidelines [17][18][19][20][21][22] to reviewers, by taking into account a direct measurement of the final quality of papers instead of relying on surrogate variables.

METHODS Setting
Medicina Clínica (www.doyma.es/medicinaclinica) is a peerreviewed weekly Spanish biomedical journal included in the Science Citation Index, the Current Contents, the Index Medicus and Excerpta Medica. It aims to publish original research papers, review articles, brief clinical notes, challenging editorials and the opinions of readers in the ''letter to the editor'' section. All submitted original research manuscripts are first evaluated by the journal's editorial committee, who decide which papers meet the journal's criteria and standards of relevance, and which are consequently sent for external peer-review, usually by two referees from the journal pool who are particularly familiar with the subject matter of the paper.
For this study, we included manuscripts sent to the ''original'' and ''brief original'' sections, which have to contain original primary research and include statistical analysis.
Qualitative research reports, case series, editorials, nonsystematic reviews and letters to the editor were systematically excluded from the study.

Data and allocation
Original research articles submitted consecutively to Medicina Clínica between May 2004 and March 2005 were assessed for eligibility (JMR, FC). Articles not fitting the journal's editorial policy were excluded.
We randomly allocated the manuscripts accepted for review into four groups defined by the interventions: Clinical reviewers (C) as normal procedure; Clinical reviewers plus a Statistical reviewer (CS); Clinical reviewers with checKlist (CK); and, Clinical reviewers plus a Statistical reviewer and checKlist (CSK). In this fashion, group C, acting as control group, only applied a clinical review, and therefore each article was sent to two clinical reviewers chosen from among the usual pool used by the journal. Papers were randomized once the two clinical peers had been chosen. Then, those allocated to the CS set were also sent to an expert statistical reviewer selected from the Medicina Clinica referee pool, which includes mainly senior methodological experts and graduate statisticians. A total of 39 methodological experts (table 1) were employed as statistical reviewers. Due to the late introduction, during the nineties, of formal statistical studies in Spain, reviewers with formal academic degrees in statistics were much younger (32.9 ( SD 2.7)) than statistical reviewers with academic degrees from applied bio-health disciplines. Although both authors and reviewers were warned that their material would be used to evaluate quality improvement during the editorial process, they were not warned about specific objectives.
Manuscripts sorted into the CK intervention group were simply sent to the two clinical reviewers with a standard letter [''To facilitate your revision, you will find enclosed the reporting guideline from Bosch  . Reviewers were not asked to report whether they used the reporting guideline in reviewing the manuscript. Finally, manuscripts from the CSK group were defined by the use of both interventions.
Each paper was appraised (on a 9-point Likert scale) and classified (by EC) by study type (1: intervention, if a different treatment than standard was given to patients; 2: longitudinal, if observation lasted for more than one time point; 3: cross-sectional, just one time point observation; and 4-others, if the study could not fit into the previous groups, for example, non-human units, population data, meta-analyses and so on). Then, manuscripts were randomly allocated (by AS) using a computer program that first stratifies by study type, and second allocates to intervention groups while minimizing differences in initial quality. Then, the manuscripts followed the proper editorial procedure, according to their assigned group. See authors and reviewers agreement in supporting information file S1.

Assessment and Procedure
A modified version of the Manuscript Quality Assessment Instrument (MQAI) designed by Goodman et al. (see supporting information file S2) [23] was used to assess the outcome. Each item assessed quality using a 5-point Likert scale from 1 (low) to 5 (high). Two specific items were added: one related to misconduct (item 1b), which includes suspicion of forgery, and one related to sample-size calculations (item 4b).
Two evaluators (EC, RD) independently rated the reporting quality of manuscripts at initial submission and following peer review and revision, according to the MQAI. Both knew the initial and final status but were blinded to the intervention group. The final score awarded to each scale item was reached by averaging the two evaluators' item scores after allowing each one to modify his or her score once the reasons for the other evaluator's score was made known. Primary outcome was defined as the difference in the quality of papers between the initial and final submission, expressed as the sum of the 36 specific MQAI items, resulting in a minimum of 36 (lowest quality) and a maximum of 180 points.
In order to evaluate the success of masking, the evaluators tried to guess which papers had been revised by a statistical reviewer (or with the help of a guideline) from the changes they observed in each article. The evaluators had to answer the following question by consensus: ''Has it been reviewed by a statistician (or with the help of a guideline)?'' The possible answers were ''Yes'', ''No'' and ''I don't know''. The blinding process was analyzed and considered successful if the evaluators' hit-proportion was not bigger than that expected by chance (50%). The cases with an answer ''I don't know'' were not included in the analysis. Only after the results had been validated and introduced into the database, was the group corresponding to each article revealed to the evaluators so that they could carry out the analysis.

Study populations
Three different populations were considered: ''complete'' which included all randomized manuscripts not lost to follow-up, which was the population for the main analysis. Two other populations were defined for the sensitive analyses: one taking into consideration all ''randomized'' manuscripts and one including only those manuscripts accepted for publication. Those manuscripts rejected due to reviewers' comments and those lost to follow-up were analyzed considering two different values for their final quality: 1) the initial overall quality was imputed as the final overall quality interpreted as no change in quality during the editorial process), or 2) the final overall quality was assigned a value equal to the mean final quality value of the final versions of the received articles. The second method accounted for the positive effects of rejecting the low quality manuscripts, since the lower the initial score of the manuscript, the better the score in the initial-final difference assigned.

Analysis and Sample Rationale
A 262 analysis of variance of the primary variable (change in quality score) was carried out to test the two hypotheses: 1, adding a statistical reviewer to the field expert peers and 2, suggesting reviewers reporting guidelines. Since both hypotheses addressed different objectives, no adjustment of the alpha risk consumption was made. Sample-size calculations indicated that a hundred articles were needed to reach an 80% power to detect a difference in means equivalent to a 55% of the intra-group standard deviation (a = 0.05; two-sided testing). Secondary analysis were also based on ANOVA and included the same comparison of quality improvement by each individual item, as well as the segregated analysis of the rejected manuscripts and comparisons of initial quality between the originals that completed the editorial process and those which were lost to follow-up, by means of t-tests and x 2 tests. To check the effectiveness of the masking procedure, the percentage of matched trials between the appraisers' assessment and the real allocation of each article was computed. The main analysis was repeated stratifying by response to the masking question in order to analyze if it was able to account for the intervention effect.

Enrolment and Randomization
Of the 327 originals received between May 2004 and March 2005, 196 (59.9%) were directly rejected by the editorial team. The remaining 131 (40.1%) were selected by the editorial committee as possible publications and therefore randomized. Of these, 2 were excluded either as a result of an administrative error (n = 1) or because the authors refused to participate (n = 1). From the 129 randomized manuscripts, 14 were lost of follow up because authors missed the deadline and the masked allocation was revealed; 21 (18.3%) of the 115 included papers were ''interventions'', but only 3 were randomized clinical trials, 46 (40.0%) were longitudinal designs, 28 (24.3%) cross-sectional and 20 (17.4%) others. On the other hand, 16 were rejected by the editorial team after evaluating peer-review reports. The rejected papers had a significantly lower initial score on the overall Goodman scale than the accepted papers (difference 15.0, 95% CI from 4.6 to 24.4). No significant differences in initial quality were found between the lost to follow-up articles and the ones studied. Figure 1 shows the distribution of manuscripts among randomization groups.

Descriptive Analysis of Initial Quality
The initial mean overall quality of the 115 originals analyzed was 84.5 (SD 19.1), without significant differences between the intervention groups. Table 2 shows baseline overall Goodman's scale by study type and allocated intervention group and Table 3 shows the baseline characteristics for each item by intervention group. In general, the manuscripts classified as ''others'', followed by those reporting interventions, showed the lowest scores while, on average, longitudinal designs rated above the other types of study.  . 3) of adding a statistical reviewer was 5.5 (95% CI from 4.3 to 6.7) and the effect of sending a guideline to the authors was 0.9 (95% CI from 20.3 to +2.1) with no significant interaction effect between them (1.1, 95% CI from 20.1 to +2.3). Adding a statistical reviewer had a reasonably homogeneous effect among the four study types ( fig. 3), but the suggestion of employing a checklist had a negative effect on the group of intervention papers. In the sensitive-analyses populations, we reached the same conclusions about those effects. Referring to the impact of noncomplete data, the 14 papers lost to follow-up had a heterogeneous distribution among the randomized groups (figure 1) but did not differ, in terms of baseline quality, from the originals in the accepted manuscripts population. We performed several sensitivity analyses including those papers with different imputed values for the final version that produced very similar conclusions.

DISCUSSION
We have shown that the addition of a supplementary statistical reviewer improves manuscript quality during the editorial process. As the control intervention was ''two clinical reviewers'', the estimated effect can be imputed either to the addition of an extra reviewer, or to the inclusion of a statistical expert, both of which confirm that peer review improves overall quality as measured by the MQAI. This result is consistently sustained by the alternative analysis of the sensitive populations. The guess (yes/no/don't know) of the statistician intervention was not able to remove the observed effect. The size of the effect found can be interpreted in terms of specific item improvement: the 5.5 effect value counts as a 1-point quality improvement in a 5-point scale on more than five specific items. Although this effect is significant and positive, its size is very small related to the scale range (3.8%) but medium size (85.9%) related to the improvement variability (6.4), which may be due to some prudence or cautiousness during masked evaluation. In any case, in the light of the two evaluator scores, there is still room for improvement.
Mainly because it is not easy to check peer review without interfering with the editorial process, but also because it is considered a self-evident idea, the scientific testing of a process that is essential for science, which filters and shapes scientific communications and decides major research funding, has barely deserved the interest of researchers. Some research groups have tried to assess the effect on the quality of peer review of training evaluators [24], blinding and unmasking [25][26][27], referee characteristics and Table 3. Mean initial individual quality scores assessed by two blinded evaluators on the Goodman scale (lowest = 1, highest = 5). Number of manuscripts is specified when that item did not apply to all manuscripts. Standard deviations ranged from 0.9 to 1.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    publication language [28], and feedback by editors [29]. Some assessed the blinding effect on the acceptance of papers, either in randomized [30] or non-randomized comparisons [31]. Others analyzed, in a historical cohort, the report of ''positive'' findings [32]. Schriger et al studied the changes after peer review and editing in tables and figures in a cohort of 62 randomized clinical trials submitted to BMJ [33]. However, only the randomized retrospective evaluation of Goodman et al [23] and the paired comparison of Pierie et al [34] measured reporting quality. A systematic review undertaken in 2002 concluded ''Peer review, although widely used, is largely untested and its effects are uncertain'' [35]. Our searches in Medline found only one (although non-randomized) study, suggesting that a statistical checklist could improve report quality [36]. Surprisingly, very few studies have analyzed the true outcome of peer review: manuscript quality instead of review quality. Although it is possible to think of further indicators of research quality, perhaps related to positive impact [37], the MQAI [23] has the advantage of being the only scale developed out of a randomized study that measures reporting quality and can be applied to a broad set of studies. As far as we know, this is the first prospective randomized trial assessing the effect of peer review to have a positive result. In 2001 we carried out a similar study on 43 manuscripts to estimate the effect of reviewers, which was significant in several secondary variables, but not in the principal, although this did show a trend towards a positive effect [38]. Those results encouraged us to extensively review our design and methods. Basically, what we have added to this new study is a complete follow-up, including those manuscripts finally rejected, with the analysis of alternative sensitive analyses.
The 14 papers lost to follow-up did not differ, in terms of baseline quality, from the originals in the complete population. On the other hand, the 16 rejected papers present a significantly lower initial score on the overall Goodman scale than the 99 accepted papers. However, we have to be careful when interpreting these results as the two evaluators, although blinded to the intervention group, knew the editorial decision-as they didn't have the final manuscript version.
For this study we have concentrated on the quantitative results. We do not provide qualitative information about what the statistical reviewers actually did to improve the manuscripts or how they differ from clinical reviewers. Furthermore, we did not study if authors followed all of the reviewer's suggestions, either clinical or methodological; or if there are manuscript characteristics related to potential improvement introduced by peer review.
It should also be stressed that our target population was a single journal with an impact factor of just over 1: external validity of our results may be compromised if the positive effect of including a methodological reviewer depends upon journal, paper or reviewer characteristics. If journals with higher impact factors have better methodological papers [39], their room for improvement may be lower. But on the other hand, it could also be considered that those journals may also have better methodological reviewers.
We did not find statistical significance on the effect of enclosing a Spanish checklist [17] and suggesting English reporting guidelines such us CONSORT or STARD to Spanish reviewers. Unfortunately, because referees were not asked to return the  completed checklists, we are not able to determine if they employed the proper guide for the study reviewed or if they misused the guide. It is very difficult to interpret the negative result in the intervention subgroup. In terms of alpha risk consumption and regression to the mean (since baseline quality was higher for group 3) both may lead to an erroneous conclusion, but as very few studies were randomized clinical trials and most of them were before-after studies, without well-known guidelines, a negative effect may have been produced. In any event, the fact that the evaluators were able to guess the presence of the reporting guideline in 65.3% (95% CI: 53.5 to 76.0%) of papers suggests the need for a new trial with an improvement in checklist delivery and feedback.
Here we have shown scientific evidence that peer review has a positive effect on the final quality of papers, by means of demonstrating, in a randomized trial with masked evaluation, the effects of adding a methodological expert to the review panel. Nonetheless, there is still a long way to go to ensure that scientific communications achieve the maximum quality. Even, if peer review is not the last system to improve research or, at least, to improve scientific journals and reporting [40].

SUPPORTING INFORMATION
Text S1 Authors and reviewers agreement Found at: doi: 10