^{1}

^{*}

^{2}

^{2}

^{3}

Conceived and designed the experiments: LB. Performed the experiments: RM. Analyzed the data: RM. Wrote the paper: LB HDD.

The authors have declared that no competing interests exist.

This paper presents the first meta-analysis for the inter-rater reliability (IRR) of journal peer reviews. IRR is defined as the extent to which two or more independent reviews of the same scientific document agree.

Altogether, 70 reliability coefficients (Cohen's Kappa, intra-class correlation [ICC], and Pearson product-moment correlation [r]) from 48 studies were taken into account in the meta-analysis. The studies were based on a total of 19,443 manuscripts; on average, each study had a sample size of 311 manuscripts (minimum: 28, maximum: 1983). The results of the meta-analysis confirmed the findings of the narrative literature reviews published to date: The level of IRR (mean ICC/r^{2} = .34, mean Cohen's Kappa = .17) was low. To explain the study-to-study variation of the IRR coefficients, meta-regression analyses were calculated using seven covariates. Two covariates that emerged in the meta-regression analyses as statistically significant to gain an approximate homogeneity of the intra-class correlations indicated that, firstly, the more manuscripts that a study is based on, the smaller the reported IRR coefficients are. Secondly, if the information of the rating system for reviewers was reported in a study, then this was associated with a smaller IRR coefficient than if the information was not conveyed.

Studies that report a high level of IRR are to be considered less credible than those with a low level of IRR. According to our meta-analysis the IRR of peer assessments is quite limited and needs improvement (e.g., reader system).

Science rests on journal peer review

According to Marsh, Bond, and Jayasinghe

In this study, we test whether the result of the narrative techniques used in the reviews – that there is a generally low level of IRR in peer reviews – can be confirmed using the quantitative technique of meta-analysis. Additionally, we examine how the study-to-study variation of the reported reliability coefficients can be explained by covariates. What are the determinants of a high or low level of IRR

We performed a systematic search of publications of all document types (journal articles, monographs, collected works, etc.). In a first step, we located several studies that investigated the reliability of journal peer reviews using the reference lists provided by narrative overviews of research on this topic

The search for publications identified 84 studies published between 1966 and 2008. Fifty-two out of the 84 studies reported all information required for a meta-analysis: reliability coefficients and number of manuscripts. Nearly all of the studies provided the following quantitative IRR coefficients: Cohen's Kappa, intra-class correlation (ICC), and Pearson product-moment correlation (r). If different coefficients were reported for the same sample in one single study, ICCs were included in the meta-analyses (n = 35). The ICC measures inter-rater reliability and inter-rater agreement of single reviewers

Reliability generalization studies were originally introduced by Vacha-Haase

As opposed to fixed effects models, the objective of random effects models is not to estimate a fixed reliability coefficient but to estimate the average of a distribution of reliabilities. Random effects models assume that the population effect sizes themselves vary randomly from study to study and that the true inter-rater reliabilities are sampled from a universe of possible reliabilities (“super-population”).

Whereas fixed effects models only allow generalizations about the studies that are included in the meta-analysis, in random effects models the studies are assumed to be a sample of all possible studies that could be done on a given topic, about which the results can be generalized

Multilevel models are an improvement over fixed and random effects models, as they allow simultaneous estimation of the overall reliability and the between-study variability and do not assume the independency of the effect sizes or correlations. If a single study reports results from different samples, the results might be more similar than results reported by different studies. Statistically speaking, the different reliability coefficients reported by a single study are not independent. This lack of independence may distort the statistical analyses – particularly the estimation of standard errors

In this study we used a multilevel model (especially a three-level model) suggested by several researchers, including DerSimonian and Laird

When there is a high level of between-study variation (study heterogeneity), it is important to look for explanatory variables (covariates) to explain this variation. As Egger, Ebrahim, and Smith

The following covariates were included in the meta-regression analysis:

The number of manuscripts was used as the first covariate, based on which the reliability coefficients in the individual studies were calculated. The number of manuscripts was divided by 100 to obtain a regression parameter that is not too small. This procedure both warrants the accuracy of estimation and enhances the interpretation of the results. The influence of the commonly called “publication bias” or “file drawer problem”

According to the findings of an analysis by Cicchetti and Conn

As a third covariate the scientific discipline was included in the meta-regression analysis: (1) economics/law, (2) natural sciences, (3) medical sciences, or (4) social sciences. For Weller

The fourth covariate is based on the object of appraisal. According to Weller

Further, with the covariate cohort, the period is included in the meta-regression analyses on which the data in a study is based. In general, a study investigated the IRR for manuscripts submitted to a journal within a certain period of time (e.g., one year). For the meta-analysis, we classified these periods into four different categories of time (e.g., 1980–1989). The covariate cohort tests whether the level of IRR has changed since 1950.

“In an attempt to eliminate some of the drawbacks of the peer review system, many journals resort to a double-blind review system, keeping the names and affiliations of both authors and referees confidential”

Finally, the type of rating system used by the reviewers in a journal peer review process (analyzed in a reliability study) was included as a covariate. This tests whether various rating systems (metric or categorical) are connected to different levels of IRR. Strayhorn, McDermott, and Tanguay

All analyses were performed using SAS PROC MIXED in SAS, version 9.1.3

Using the above mentioned meta-analysis methods, three analyses were calculated based on r coefficients and ICC coefficients (see

Fixed effects | van Houwelingen, Arends and Stijnen |
2 | 44 | .234 | .222 | .246 |

Random effects | van Houwelingen, Arends and Stijnen |
2 | 44 | .341 | .289 | .392 |

van Houwelingen, Arends and Stijnen |
3 | 44 | .340 | .283 | .396 | |

Hunter & Schmidt |
2 | 26 | .17 | .13 | .21 |

One further model was calculated on the basis of Cohen's Kappa (see

The forest plot (_{0}. As

The 95% confidence interval of the mean value (vertical line) is shaded grey. Predicted values for the same author and year but with different letters (e.g., Herzog 2005a and Herzog 2005b) belong to the same study.

Variable (metric) | Range | Mean | Standard Deviation |

Number of manuscripts | 15→1983 | 321.98 | 398.13 |

Variable (categorical) | Range | Frequency | Percent |

ICC | 0–1 | 35 | 80 |

r (RC) | 0–1 | 9 | 20 |

Economics/Law | 0–1 | 3 | 7 |

Natural Sciences | 0–1 | 7 | 16 |

Medical Sciences | 0–1 | 11 | 25 |

Social Sciences (RC) | 0–1 | 23 | 52 |

Paper | 0–1 | 29 | 66 |

Abstract (RC) | 0–1 | 15 | 34 |

1950–1979 | 0–1 | 15 | 34 |

1980–1989 | 0–1 | 9 | 21 |

1990–1999 | 0–1 | 5 | 11 |

2000–2008 | 0–1 | 8 | 18 |

Unknown (RC) | 0–1 | 7 | 16 |

Single | 0–1 | 22 | 50 |

Double | 0–1 | 3 | 7 |

Unknown (RC) | 0–1 | 19 | 43 |

Categorical | 0–1 | 35 | 80 |

Metric | 0–1 | 5 | 11 |

Unknown (RC) | 0–1 | 4 | 9 |

Model | Model 0Intercept | Model 1Number of Manuscripts | Model 2Method | Model 3Discipline | Model 4Object of Appraisal | Model 5Cohort | Model 6Blinding | Model 7Rating System | Model 8Number of Manuscripts, Rating System |

Intercept | .67 / .04 |
.77 / .04 |
.69 / .07 |
.66 / .05 |
.74 / .07 |
.60 / .08 |
.71 / .05 |
1.03 / .13 |
1.06 / .11 |

Number of Manuscripts/100 | −.03 / .007 |
−.03 / .006 |
|||||||

Method (RC = r) | |||||||||

ICC | −.02 / .08 | ||||||||

Discipline (RC = Social Sciences) | |||||||||

Economics/Law | .08 / .13 | ||||||||

Natural Sciences | −.02 / .09 | ||||||||

Medical Sciences | −.006 / .09 | ||||||||

Object of Appraisal (RC = Abstract) | |||||||||

Paper | −.06 / 0.08 | ||||||||

Cohort (RC = unknown) | |||||||||

1950–1979 | .10 / .09 | ||||||||

1980–1989 | .07 / .11 | ||||||||

1990–1999 | .15 / .15 | ||||||||

2000–2008 | −.007 / .12 | ||||||||

Blinding (RC = unknown) | |||||||||

Double | .15 / .11 | ||||||||

Single | −.05 / .08 | ||||||||

Rating System (RC = unknown) | |||||||||

Categorical | −.40 / .14 |
−.32 / .11 |
|||||||

Metric | −.33 / .16 |
−.33 / .13 |
|||||||

^{2}) |
|||||||||

Study Level 3 | .03 /.01 |
.016 / .009 |
.03 / .012 |
.03 / .01 |
.027 / .01 |
.03 / .01 |
.03 / .01 |
.017 / .01 | .0036 / .02 |

Coefficient Level 2 | .01 /.009 | .007 / .007 | .009 / .009 | .009 / .009 | .009 / .009 | .007 / .008 | .005 / .007 | .01 /.01 | .01 / .02 |

−2LL | −8.4 | −23.7 | −8.5 | −8.8 | −12.7 | −10.7 | −10.7 | −15.3 | −30.0 |

BIC | 2.0 | −9.9 | 1.9 | 8.5 | −2.4 | 10.1 | 3.2 | −1.4 | −12.6 |

We carried out a series of meta-regression analyses in which we explored the effects of each covariate in isolation and in combination with other covariates. The focus was particularly on tests of the a priori predictions about the effects of the covariates (e.g., publication bias). As

The loglikelihood test provided by SAS/proc mixed (−2LL) can be used to compare different models, as can also the Bayes Information Criteria (BIC). The smaller the BIC, the better the model is. By comparison to the null model, only models 1, 7, and 8 exhibited significant differences in the loglikelihood and BIC, with statistically significant regression coefficients. The covariates method, discipline, object of appraisal, cohort, and blinding were accordingly not significantly correlated to the study-to-study variation (see models 2, 3, 4, 5, and 6).

The statistically significant regression coefficient of −.03 in model 1 can be interpreted as follows: The more manuscripts (divided by 100) that a study is based on, the smaller the reported reliability coefficients are. If the number increases, for instance from 100 manuscripts to 500, the reliability decreases from .40 to .34. By including this covariate, the study-to-study random effects variance declined from .03 (model 0) to .016 (model 1), i.e., 46.6% of the variance between the studies could be explained by the number of manuscripts. This result indicated a distinctly marked publication bias in the case of publication of studies for reliability of peer review. Even when the statistical significance level was adjusted by Bonferroni correction (α divided by the number of single tests), the regression parameter remained statistically significant. There is much evidence in the meta-analysis literature that studies that report relatively high correlations or effect sizes are more likely to be published than results of studies that report low correlations or effect sizes

A further significant covariate is represented by the rating system. Even, if the statistical significance level is adjusted by Bonferroni correction, the regression parameter of the categorical rating remains statistically significant. It was decisive whether the rating system was reported in a study or not. If the information was conveyed, then this was associated with smaller reliability coefficients (regression coefficients in

When the statistically significant covariates in models 1 and 7 – number of manuscripts and rating system – were included in a multiple meta-regression analysis, the study-to-study variance fell from 0.03 (model 0) to 0.0036 (model 8), i.e., 86.6% of the variance could be explained with both variables. As the variance component was no longer statistically significant in this model, an approximate homogeneity of the intra-class correlations was present, i.e., the residuals of the meta-regression analysis almost only varied due to sampling error (the desired final result of a meta-analysis).

Meta-analysis tests the replicability and generalizability of results – the hallmark of good science. In this study we present the first meta-analysis for reliability of journal peer reviews. The results of our analyses confirmed the findings of narrative reviews: a low level of IRR: .34 for ICC and r (random effects model) and .17 for Cohen's Kappa. Even when we used different models for calculating the meta-analyses, we arrived at similar results. With respect to Cohen's Kappa, a meta-analysis of studies examining the IRR of the standard practice of peer assessments of quality of care published by Goldman

To explain the study-to-study variation for the inter-rater reliabilities, we calculated meta-regression analyses regarding the metric reliability coefficients. It emerged that neither the type of blinding nor the discipline corresponded to the level of the IRR. With double-blinding, which is already used by many journals as a measure against biases in refereeing

Two covariates emerged in the analyses as significant, to achieve approximate homogeneity of the intra-class correlations. On the one hand, the number of manuscripts on which a study is based was statistically significant. We therefore assume a distinctly more marked publication bias for studies on IRR: With a small sample, the results are published only if the reported reliability coefficients are high. If the reported reliability coefficients are low, on the other hand, a study has to be based on a large number of manuscripts to justify publication.

Apart from the number of manuscripts upon which a study is based, the covariate rating system was also statistically significant. Studies that do not provide information on the rating system report higher IRR coefficients than studies that provide detailed information on the rating system. Failure to mention the rating system must be viewed as an indication of low quality of a study.

The main conclusion of our meta-analysis is that studies that report a high level of IRR are to be considered less credible than those with a low level of IRR. The reason is that high IRR coefficients are mostly based on small sample sizes than low IRR coefficients, which are based mostly on huge sample sizes. In contrast to narrative literature reviews, quantitative meta-analysis weights the study results according to the standard error to get unbiased estimates of the mean IRR. Therefore, meta-analysis should be preferred over narrative reviews. However, future primary studies on IRR of peer reviews that could be included in later meta-analyses should be based on large sample sizes and describe the evaluation sheet/rating system for reviewers in detail.

Very few studies have investigated reviewer agreement with the purpose of identifying the actual reasons behind reviewer disagreement, e.g., by carrying out comparative content analyses of reviewers' comment sheets

Studies included in the meta-analyses are marked with an asterisk.

The authors wish to express their gratitude to three reviewers for their helpful comments.