Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

On the agreement between bibliometrics and peer review: Evidence from the Italian research assessment exercises

  • Alberto Baccini ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing

    alberto.baccini@unisi.it

    Affiliation Department of Economics and Statistics, University of Siena, Siena, Italy

  • Lucio Barabesi,

    Roles Conceptualization, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Economics and Statistics, University of Siena, Siena, Italy

  • Giuseppe De Nicolao

    Roles Conceptualization, Data curation, Investigation, Methodology, Validation, Writing – original draft, Writing – review & editing

    Affiliation Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy

Abstract

This paper analyzes the concordance between bibliometrics and peer review. It draws evidence from the data of two experiments of the Italian governmental agency for research evaluation. The experiments were performed by the agency for validating the adoption in the Italian research assessment exercises of a dual system of evaluation, where some outputs were evaluated by bibliometrics and others by peer review. The two experiments were based on stratified random samples of journal articles. Each article was scored by bibliometrics and by peer review. The degree of concordance between the two evaluations is then computed. The correct setting of the experiments is defined by developing the design-based estimation of the Cohen’s kappa coefficient and some testing procedures for assessing the homogeneity of missing proportions between strata. The results of both experiments show that for each research areas of science, technology, engineering and mathematics the degree of agreement between bibliometrics and peer review is—at most—weak at an individual article level. Thus, the outcome of the experiments does not validate the use of the dual system of evaluation in the Italian research assessments. More in general, the very weak concordance indicates that metrics should not replace peer review at the level of individual article. Hence, the use of the dual system in a research assessment might worsen the quality of information compared to the adoption of peer review only or bibliometrics only.

1 Introduction

Efficient implementation of a research assessment exercise is a common challenge for policy makers. Even if attention is limited to scientific quality or scientific impact, there is a trade-off between the quality of information produced by a research assessment and its costs. Until now, two models have prevailed [1]: a first model based on peer review, such as the British Research Excellence Framework (REF), and a second model based on bibliometric indicators, such as Australian Excellence in Research (ERA), for the years preceding 2018. The first model is considered more costly than the second. In the discussion on the pros and cons of the two models, a central topic deals with the agreement between bibliometrics and peer review. Most part of the scholarly works has analyzed the REF by adopting a post-assessement perspective [2]. Indeed, results of the REF at various levels of aggregation are compared with those obtained by using bibliometric indicators. A clear statistical evidence on the concordance of bibliometrics and peer review would represent a very strong argument in favor of the substitution of the latter with the former. Indeed, the claim for such a substitution—based on agreement and minor costs—could likely appear pragmatic and hence more acceptable for academics than the argument based on juxtaposition of “objective bibliometric data” and “subjective peer reviews” (among others, see e.g. [3]).

However, there are two problems hindering the adoption of the bibliometric model for research assessment. The first is how to handle the scientific fields for which bibliometrics is not easily applicable, namely social sciences and humanities. The second is how to manage research outputs not covered in bibliographic databases, such as books or articles in national languages. In these cases, no substitution is possible and peer review appears as the unique possible tool for evaluating research outputs.

As a consequence, a third model of research assessment has emerged, where bibliometrics and peer review are jointly adopted: some research outputs are evaluated by bibliometrics and others by peer review. The evaluations produced by the two techniques are subsequently mixed together for computing synthetic indicators at various levels of aggregation. The Italian governmental agency for research evaluation (ANVUR) applied extensively this model in its research assessment exercises (VQR), and called it “dual system of evaluation” [4]. In reference to this model, the question of the agreement between bibliometrics and peer review has a constitutive nature. Indeed, a high agreement would ensure that final results of a research assessment—at each possible level of aggregation—are not biased by the adoption of two different instruments of evaluation. In the simplest scenario, this issue might happen when bibliometrics and peer review produce scores which substantially agree, for instance, when the research outputs evaluated by bibliometrics receive the same score by peer review—except for random errors. In contrast, let us consider a second scenario where scores produced by bibliometrics and peer review do not agree: for instance, bibliometrics produces scores systematically lower or higher than peer review. In this more complex case, the disagreement might not be a problem solely if the two systems of evaluation are distributed homogeneously, e.g. at random, among units of assessment. Even if the concordance is not accurate at the individual article level, the errors may offset at an aggregate level [2, 5]. In sum, the agreement between bibliometrics and peer review is functional for validating results of the assessment.

ANVUR tried to validate the use of the dual system of evaluation by implementing two extensive experiments on the agreement between bibliometrics and peer review, for each national research assessment of the years 2004-2010 (VQR1) and 2011-2014 (VQR2). The two experiments are hereinafter indicated as EXP1 and EXP2, respectively. They consisted in evaluating a random sample of articles by using both bibliometrics and peer review, and, subsequently, in assessing their degree of agreement at an individual publication level. ANVUR presented the results of EXP1 and of EXP2 as the evidence of a substantial concordance between bibliometrics and peer review. In turn, this agreement would validate the use of the dual system of evaluation and the final results of the research assessements.

Two of the authors of the present paper documented the flaws of EXP1 and contested the interpretation of data as indicative of a substantial agreement [69]. The present paper takes advantage of the recent availability of the raw data of the two experiments, in order to deepen the analysis and reach conclusive results on issues that had remained open due to the sole availability of aggregated data. Therefore, this paper aims to replicate the ANVUR analysis in order to draw a solid evidence on the concordance between bibliometrics and peer review.

The paper is organized as follows. In Section 2 the literature on the two Italian experiments is framed in the general discussion on the agreement between bibliometrics and peer review. Section 3 presents the structure of EXP1 and EXP2 by reminding the essential issues of the Italian research assessment exercises. Section 4 introduces the main research questions on the sampling design and the measures of agreement. Section 5 develops the correct framework for the design-based estimation of the Cohen’s kappa coefficient. Section 6 presents the estimates of Cohen’s kappa coefficients for EXP1 and EXP2, by comparing the current results with ANVUR’s findings. In Section 7, a further problem with missing data in EXP2 is presented and the homogeneity of missing proportions between scientific areas is assessed. Section 8 discusses results and concludes with some suggestions for research evaluation policy.

2 A short review of the literature

Most part of the literature on the agreement between bibliometrics and peer review considers the British REF. Overviews of this literature are provided by [2, 5, 10]. It is therefore possible to limit the discussion to a central issue which is functional to the development of this paper. By and large, results on agreement do not converge when different approaches and statistical tools are used. Notably, the analysis conducted by the Higher Education Funding Council for England (HEFCE) in the so-called Metric Tide report “has shown that individual metrics give significantly different outcomes from the REF peer review process, showing that metrics cannot provide a like-for-like replacement for REF peer review” [11]. This analysis was performed at an individual article level by comparing the quality profile attributed by peer reviews to a set of bibliometric indicators for articles submitted to REF. Traag and Waltman [2] criticized results of the Metric Tide report by arguing that the individual publication level “is not appropriate in the context of REF”. They claimed that the appropriate level is the institutional one, since “the goal of the REF is not to assess the quality of individual publications, but rather to assess ‘the quality of research in UK higher education institutions’. Therefore, the question should not be whether the evaluation of individual publications by peer review can be replaced by the evaluation of individual publications by metrics but rather whether the evaluation of institutions by peer review can be replaced by the evaluation of institutions by metric”. In a similar vein, Pride and Knoth [5] documented that a high concordance between bibliometric and peer-review indicators for REF is achieved when the analysis is conducted at an institutional level.

These claims should be framed in a “post-assessment” perspective, where the issue at stake is to verify the coherence between results obtained by applying one evaluative technique or the other at the desired institutional level. In the case of REF the coherence to be verified is between the adopted technique, i.e. peer review, and the alternative, i.e. bibliometrics. This viewpoint is very different from that developed in the Italian experiments and considered in this paper. In the present case, the question is whether it is possible to interchangeably use bibliometrics and peer review at an individual article level. To this end, the analysis of the agreement between bibliometrics and peer review at the level of individual publications is therefore fully justified. In turn, Traag and Waltman [2] support the study of the concordance at an individual publication level when the issue is the possibility that bibliometrics replaces peer review at an individual level. In reference to Metric Tide report, they explicitly wrote that “the analysis at the level of individual publications is very interesting. The low agreement at the level of individual publications supports the idea that metrics should generally not replace peer review in the evaluation of a single individual publication” [2].

As anticipated, ANVUR implemented EXP1 and EXP2 in order to justify the use of a dual system of evaluation in VQR1 and VQR2. As to EXP1, results were initially published as part of the official report of the research assessment exercise [12]. In the official report results are synthesized by stating that “there is a more than adequate concordance between evaluation carried out through peer reviews and through bibliometrics. This results fully justifies the choice (…) to use both techniques of assessment” [12 Appendix B, pp. 25-26, translation by the authors] (See also [6]) Ancaiani et al. [4] republished the complete results of EXP1, by claiming a “fundamental agreement” between bibliometrics and peer review “supporting” the choice of using both techniques in the VQR1. Moreover, they also interpreted the experiment as indicating that “combining evaluations obtained with peer review and bibliometric methods can be considered more reliable than the usual practice of combining two or more different evaluations obtained by various reviewers of the same article”.

The specific results obtained in EXP1 for the field of Economics and Statistics were largely disseminated. Bertocchi and coauthors published as far as five identical working papers where they interpreted the results of EXP1 by claiming that bibliometrics and peer review “are close substitutes” (among the others [13]). In the version finally published in a scholarly journal, they concluded that “the agencies that run these evaluations could feel confident about using bibliometric evaluations and interpret the results as highly correlated with what they would obtain if they performed informed peer review” [14].

The results and the interpretation of EXP1 were challenged by two of the authors of the present paper on the basis of published data only, since they were unable to access raw data at the time undisclosed by ANVUR (the whole thread of papers, comments and replies includes [69, 15, 16]). The first critical appraisal was about the interpretation of the degree of concordance. Baccini and De Nicolao [6, 7] argued that, according to the available statistical guidelines, the degree of concordance between bibliometrics and peer review has to be interpreted as “unacceptable” or “poor” for all the considered research fields. The unique exception—confirmed by a statistical meta-analysis of the data—was Economics and Statistics, for which the protocol of the experiment was substantially modified with respect to the other fields. Baccini and De Nicolao [8, 9] also raised some questions on the sampling protocol used for EXP1, which are considered in details also in this paper.

As for to EXP2, the results were published in the official report [17] and presented in a conference [18]. The synthesis of the results apparently confirmed the outcome of EXP1. The results of EXP2, summarized in the conclusion of the report, state that there is a “non-zero correlation” “between peer review evaluation and bibliometric evaluation”. The degree of agreement is “modest but significant. Of particular importance is the result that the degree of concordance (class and inter-row) between the bibliometric evaluation and the peer evaluation is always higher than the one existing between the two individual peer reviews” [17, Appendix B, p. 33, translation by the authors]. These results are interpreted as indicating that “the combined used of bibliometric indicators for citations and journal impact may provide a useful proxy for peer review judgements” [18].

As anticipated, this paper aims to draw definitive evidence from the two experiments. This analysis is possible since ANVUR accepted to disclose the anonymous individual data of both EXP1 and EXP2. The mail to the President of ANVUR containing the request is dated March 12th 2019. The decision of disclosing the data was communicated by mail dated March 26th 2019. Access to the data was open on April 9th 2019. It is therefore possible to replicate the results of EXP1 and EXP2, by verifying in details ANVUR methods and calculations. Replication is solely possible at the research area levels, since—according to a communication dated 16th March 2019—the data for the sub-areas “are no longer available” in the ANVUR archives. For a correct understanding of the research questions, the following section presents a description of EXP1 and EXP2 in the context of the Italian research assessments.

3 A brief description of the Italian experiments

EXP1 and EXP2 were designed and performed during VQR1 and VQR2, respectively. Italian research assessement exercises aimed to evaluate research institutions, research areas and fields, both at national and institutional level (i.e. universities and departments). Synthetic indicators were obtained by aggregating the scores received by the research outputs submitted by the institutions. All the researchers with a permanent position had to submit a fixed number—with few exceptions—of research outputs (3 in VQR1 and 2 in VQR2). VQR1 and VQR2 were organized in 16 research area panels. Research areas were distinguished between “bibliometric areas”, i.e. science, technology, engineering and mathematics (namely Mathematics and Informatics (Area 1), Physics (Area 2), Chemistry (Area 3), Earth Sciences (Area 4), Biology (Area 5), Medicine (Area 6), Agricultural and Veterinary Sciences (Area 7), Civil Engineering (Area 8b), Industrial and Information Engineering (Area 9)), and “non bibliometric areas”, i.e. social science and humanities (namely Architecture (Area 8a) Antiquities, Philology, Literary studies, Art History (Area 10), History, Philosophy, Pedagogy and Psychology (Areas 11a and 11b), Law (Area 12), Economics and Statistics (Area 13), Political and Social Sciences (Area 14)).

Both research assessments performed evaluations of the submitted research outputs by using a “dual system of evaluation” where some outputs were evaluated by bibliometric algorithms and others by “Informed Peer Review” (IPR). Informed peer review indicates that reviewers were asked to evaluate a submitted research item by being provided with its complete metadata and, if available, with its bibliometric indicators. Actually, this dual system of evaluation regarded only the bibliometric areas plus Economics and Statistics (Area 13). Indeed in the non-bibliometric areas, panels evaluated all the submitted research products exclusively by peer review. In the bibliometric areas, instead, while books, book chapters and articles in not-indexed journals were evaluated by IPR, journal articles were evaluated for the most part by applying bibliometric algorithms. VQR1 and VQR2 adopted two different bibliometric algorithms. Both algorithms combined the number of citations received by an article and a journal indicator, e.g. the impact factor. The complete description of the algorithms and their critical appraisal can be found in [6, 1921]. Both algorithms were built in such a way that, if the two indicators were coherent, they generated a categorical score (B-score) and a corresponding numerical value used for computing aggregate results for institutions. Namely, in the VQR1 there were four categories: Excellent (A, score 1), Good (B, score 0.8), Acceptable (C, score 0.5), Limited (D, score 0); in the VQR2 there were five categories: Excellent (A, score 1), Elevated (B, score 0.7), Fair (C, score 0.4), Acceptable (D, score 0.1), Limited (E, score 0). If the two bibliometric indicators gave incoherent indications for an article, e.g. high number of citations and low impact factor or viceversa, the algorithm classified it as “IR” (Inconclusive Rating) and it was evaluated by IPR. In both VQR1 and VQR2, Area 13 (Economics and Statistics) did not adopt the bibliometric algorithms for evaluating articles. They were substituted by classifications of journals directly developed by the area panel, including the same number of categories as in the algorithms. Therefore, all the articles received the score of the journal where they were published and no article was classified as IR.

IPR was identically organized in the two research assessments. A publication was assigned to two members of the area panel, who independently chose two anonymous reviewers. The two reviewers performed the IPR of the article by using a predefined format—slightly different between the two research assessments and also between panels in the same assessment. Each referee assigned a final evaluation according to the same final categories adopted for bibliometrics. These final evaluations are conventionally indicated as P1-score and P2-score. Then, the referee reports were received by the two members of the area panel, who formed a so-called “Consensus Group” (CG) for deciding the final score of the article (P-score).

In order to validate the dual system of evaluation, EXP1 and EXP2 considered only the “bibliometric areas” plus Area 13. They had a similar structure. Figs 1 and 2 report the flowcharts of the two experiments. The rationale of both experiments was very simple: a sample of the journal articles submitted to the research assessment was scored by the two methods of evaluations, i.e. through the bibliometric algorithm and IPR. In such a case, IPR was implemented by involving two reviewers, according to the same rules adopted in the research assessment. These raw data were then used for analyzing (i) the agreement between the evaluation obtained through IPR (P-score) and bibliometric algorithms (B-score) and (ii) the agreement between the scores decided by the two reviewers (P1-score and P2-score). The agreement between the scores is computed by using the weighted Cohen’s kappa coefficient [22], a popular index of inter-rater agreement for nominal categories (see e.g. [23]). A high level of agreement between IPR and bibliometric scores was interpreted as validating the dual method of evaluation.

thumbnail
Fig 1. Flowchart of EXP1.

Flowchart has been drawn with diagrams.net and by adopting its symbols and conventions.

https://doi.org/10.1371/journal.pone.0242520.g001

thumbnail
Fig 2. Flowchart of EXP2.

Flowchart has been drawn with diagrams.net and by adopting its symbols and conventions.

https://doi.org/10.1371/journal.pone.0242520.g002

EXP1 and EXP2 differed for a different timing of realization. EXP1 was done simultaneously with VQR1. Hence, the reviewers were unaware of partecipating to EXP1. Indeed, they were unable to distinguish between papers of the EXP1 sample and those they had to evaluate for the research assessment. The unique exception was Area 13, where panelists and referees knew that all the journal articles belonged to the EXP1 sample—since all the journal articles for the research assessment were evaluated automatically according to the journal ranking [6]. In contrast, EXP2 started after the conclusion of the activities of the research assessment. Therefore, panelists and reviewers knew that they were partecipating to EXP2. A second consequence of the different timing was that in EXP1 all the papers of the sample were peer-reviewed, since the successful administrative completion of the research assessment required the evaluation of all submitted articles. On the contrary, in EXP2 some papers did not receive a peer-review evaluation since some reviewers refused to do it. Therefore, in EXP2 there were missing data in the sample, which were not accounted for by ANVUR when the concordance indexes were computed.

4 Measures of agreement, sampling and data

The first step of this works consists in replicating ANVUR’s computations. This end entails the adoption of the measure of agreement chosen by ANVUR. ANVUR used the Cohen’s kappa coefficient and its weighted generalization, a commonly-adopted measure of agreement between classifications of two raters [22, 24]. Despite the Cohen’s kappa coefficient is criticized for some methodological drawbacks (for more details, see [25, 26] among others), practitioners often adopt this index in order to assess the inter-rater agreement for categorical ratings, while its weighted counterpart is preferred when the categories can be considered ordinal (see e.g. [27, p. 548] and [28, p. 596]). Rough guidelines for interpreting Cohen’s kappa values are available and a survey is provided by [6]. The guideline generally adopted is the one by Fagerland et al. [27, p. 550]—based on Landis and Koch [29], and slightly modified by Altman [30].

The replication of ANVUR’s computations is surely useful, albeit not sufficient to reach a complete appreciation of the results of the two experiments. Indeed, some research questions should be carefully addressed. For EXP1 and EXP2, ANVUR [12, 17, Appendix B] adopted a stratified random sampling, where the target population was constituted by the journal articles submitted to the two research assessments. The sizes of the article populations in EXP1 and EXP2 are 99,005 and 77,159, respectively. The sample size was about 10% of the population size, i.e. 9,199 and 7,667 articles for EXP1 and EXP2, respectively. The stratified random samples were proportionally allocated with respect to the sizes of the research areas. The sizes of the strata in EXP1 and EXP2 are reported in Tables 1 and 2. Indeed, the Final Report remarks that: “The sample was stratified according to the distribution of the products among the sub-areas of the various areas” [17, Appendix B, p. 1, our translation]. For EXP1 results were published at a sub-area level, while for EXP2 results were solely published for areas. Moreover, the raw data at the sub-area level are not yet available.

thumbnail
Table 1. Population, sample and sub-sample sizes for scientific areas in EXP1.

https://doi.org/10.1371/journal.pone.0242520.t001

thumbnail
Table 2. Population, sample, sub-sample sizes and number of missing articles for scientific areas in EXP2.

https://doi.org/10.1371/journal.pone.0242520.t002

A first research question deals with the statistical methodology adopted in the experiments. From this perspective, the two experiments were actually implemented in a design-based framework. Hence, their analysis requires a correct inferential setting in order to obtain the estimates of the considered concordance measures. To this aim, in Section 4 the model-based estimation of the weighted Cohen’s kappa coefficient is revised and the design-based estimation of this coefficient is originally developed. On the basis of these theoretical results, it is possible to check if ANVUR’s estimates of agreement are correct. In particular, ANVUR’s estimates of Cohen’s kappa coefficients and the corresponding confidence intervals may be compared with the appropriate design-based counterparts.

ANVUR computed the final results of EXP1 and EXP2 by solely considering a sub-sample of articles—and not the whole sample. This is illustrated in Figs 1 and 2 where the sizes of the populations, of the samples and of the final subsamples are reported. Indeed, ANVUR dropped from the computation of the concordance indexes the articles with an inconclusive bibliometric score IR, i.e. those receiving an IPR evaluation albeit they were not considered for agreement estimation. For EXP1, the reduction of the sample due to the exclusion of the paper classified as IR was not disclosed neither in ANVUR’s official reports nor in [4]. Tables 1 and 2 reports the sizes of the sub-samples for EXP1 and EXP2, respectively. The exclusion of the IR papers might have boosted the value of the agreement measures, as argued by Baccini and De Nicolao [8]. The conjecture sounds as follows. ANVUR removed from EXP1 the most problematic articles for which the bibliometric algorithm was unable to reach a score. It cannot be excluded that these articles were also particularly difficult to evaluate for peer reviewers. Hence, ANVUR calculated the agreement indicators on sub-samples of articles that were “more favorable” to agreement than the complete samples.

The second research question, therefore, deals with the adoption of concordance measures which take into account the number of the IR articles which ANVUR dropped, as well as the number of missing articles. Actually, these articles could be ideally considered as belonging to a rating category for which agreement is not required. In such a case, there exist alternative variants of the weighted Cohen’s kappa, which may suitably manage this issue. Hence, in Section 5 the design-based estimation of these variants of the weighted Cohen’s kappa are also developed. In turn, in Section 6 the resulting point estimates and the corresponding confidence intervals are computed for EXP1 and EXP2, respectively.

A third and last question—which is limited to EXP2—deals with the distribution of missing papers per research area, i.e. those papers for which a peer review score is unavailable. As previously remarked, Table 2 reports the number of missing papers per area. Actually, drawbacks would arise if the distribution of missing articles in the sample occurred in a non-proportional way between the strata, since in this case some research areas could be more represented than others. ANVUR [17] claimed that this was not the case. Thus, in Section 7 a new testing procedure for the homogeneity of missing proportions between strata is developed and applied to EXP2 data.

These three questions are addressed by using the raw data of the two ANVUR experiments. The articles in each database have a unique anonymous identifier. For each article, the available variables are: (i) the research area; (ii) the bibliometric score (B); (iii) the score assigned by the first reviewer (P1); (iv) the score assigned by the second reviewer (P2); (v) the syntetic peer-review score (P). Data are available as S1 File (downloadable also from https://doi.org/10.5281/zenodo.3727460).

5 Design-based estimation of the Cohen’s kappa coefficient

As anticipated, both EXP1 and EXP2 adopted the weighted Cohen’s kappa coefficient as measure of agreement. In order to introduce our proposal for the design-based estimation of the Cohen’s kappa coefficient, first it is instructive to revise its model-based counterpart (for a general discussion on the two paradigms, see e.g. [31]). In the model-based approach, two potential raters classify items into c categories, which are labeled on the set I = {1, …, c} without loss of generality. The couple of evaluations given by the raters for an item is modeled as a bivariate random vector, say (U, V), which takes values on the set I × I. More appropriately, (U, V) should defined as a random element, since categories are indexed by the first c integers for the sake of simplicity—even if they are just labels. The joint probability function of (U, V) is assumed to be where l, mI, while ϑl,m ≥ 0 and . Hence, the parameter space for the underlying model is actually given by

Moreover, it holds that are the marginal probability distributions of U and V, respectively. In practice, ϑl,m represents the probability that an item be classified into the l-th category according to the first rating and into the m-th category according to the second rating. Similarly, ϑl+ and ϑ+l are the probabilities that the item be categorized into the l-th category according to the first rating and the second rating, respectively. Hence, the definition of the weighted Cohen’s kappa in the model-based approach is given by where while the wlm’s are weights which are suitably chosen in order to consider the magnitude of disagreement (see e.g. [27, p. 551]). In particular, the (usual) unweighted Cohen’s kappa coefficient is obtained when wlm = 1 if l = m and wlm = 0, otherwise.

In order to estimate the weighted Cohen’s kappa under the model-based approach, let us assume that a random sample, say (U1, V1), …, (Un, Vn), of n copies of (U, V) is available. Thus, the maximum-likelihood estimators of the ϑlm’s, the ϑl+’s and the ϑ+l’s are readily seen as where 1B is the usual indicator function of a set B, i.e. 1B(u) = 1 if uB and 1B(u) = 0, otherwise. Thus, on the basis of the invariance property of the maximum-likelihood estimation (see e.g. Theorem 7.2.10 by Casella and Berger [32]), the maximum-likelihood estimator of κw,M is provided by where

Actually, is the weighted Cohen’s kappa estimator commonly adopted in practical applications. Finally, it should remarked that the variance of is usually estimated by means of large-sample approximations [41, p. 610].

Under the design-based approach, there exists a fixed population of N items which are classified into the c categories on the basis of two ratings. Hence, the j-th item of the population is categorized according to the first evaluation—say ujI—and the second evaluation—say vjI—for j = 1, …, N. It should be remarked that in this case the N couples (u1, v1), …, (uN, vN) are fixed and given. Thus, the “population” weighted Cohen’s kappa coefficient may be defined as (1) where while

In this case, plm is the proportion of items in the population classified into the l-th category according to the first rating and into the m-th category according to the second rating. Similarly, pl+ and p+l are the proportions of items categorized into the l-th category according to the first rating and the second rating, respectively. Thus, for estimation purposes, the Cohen’s kappa coefficient (1) is conveniently expressed as a smooth function of population totals—i.e. the plm’s, the pl+’s and the p+l’s. It is worth remarking that (1) is a fixed population quantity under the design-based approach, while its counterpart κw,M under the model-based approach is a unknown quantity depending on the model parameters.

Let us now assume that a sampling design is adopted in order to estimate (1) and let us consider a sample of fixed size n. Moreover, let S denote the set of indexes corresponding to the sampled items—i.e. a subset of size n of the first N integers—and let πj be the inclusion probability of the first order for the j-th item. As an example aimed to the subsequent application, let us assume that the population be partitioned into L strata and that Nh be the size of the h-th stratum with h = 1, …, L. Obviously, it holds . If a stratified sampling design is considered, the sample is obtained by drawing nh items in the l-th stratum by means of simple random sampling without replacement in such a way that . Therefore, as is well known, it turns out that πj = nh/Nh if the j-th item is in the h-th stratum (see e.g. [33]). When a proportional allocation is adopted, it also holds that nh = nNh/N—and hence it obviously follows πj = n/N.

In order to obtain the estimation of (1), it should be noticed that are unbiased Horvitz-Thompson estimators of the population proportions plm, pl+ and p+l, respectively. Thus, by bearing in mind the general comments provided by Demnati and Rao [34] on the estimation of a function of population totals, a “plug-in” estimator of (1) is given by (2) where

Even if estimator (2) is biased, its bias is negligible since (1) is a differentiable function of the population totals with non-null derivatives (for more details on such a result, see e.g. [33, p. 106]).

As usual, variance estimation is mandatory is order to achieve an evaluation of the accuracy of the estimator. Since (2) is a rather involved function of sample totals, its variance may be conveniently estimated by the linearization method or by the jackknife technique (see e.g. [34] and references therein). Alternatively, a bootstrap approach—which is based on a pseudo-population method—may be suitably considered (for more details on this topic, see e.g. [35).

It should be remarked that inconclusive ratings occur in EXP1 and EXP2 and—in addition—missing ratings are also present in EXP2. However, even if ANVUR does not explicitly states this issue, its target seems to be the sub-population of items with two reported ratings. Hence, some suitable variants of the Cohen’s kappa coefficient have to be considered. In order to deal with an appropriate definition of the population parameter in this setting, the three suggestions provided by De Raadt et al. [36] could be adopted. For the sake of simplicity, let us suppose that inconclusive or missing ratings are classified into the c-th category. A first way to manage the issue consists in deleting all items which are not classified by both raters and apply the weighted Cohen’s kappa coefficient to the items with two ratings (see also [37]). After some straightforward algebra, this variant of the population weighted Cohen’s kappa coefficient may be written as (3) where

It is worth noting that (3) could be not a satisfactory index, since it does not take into account the size of inconclusive or missing ratings. Similarly to (1), its variant (3) can be estimated as (4) where

The second proposal by De Raadt et al. [36] for a variant of the weighted Cohen’s kappa coefficient is based on Gwet’s kappa [38]. The population weighted Gwet’s kappa may be defined as (5) where

This index considers the sizes of inconclusive or missing ratings. Indeed, even if , the quantity is actually a weighted sum of the products of type pl+ p+l, in contrast to the quantity which is a weighted sum of the products of type (pl+plc)(p+mpcm). In turn, (5) may be estimated by means of (6) where

The third proposal by De Raadt et al. [36] for a variant of (1) stems on assuming null weights for the inconclusive or missing ratings, i.e. by assuming that wlm = 0 if l = c or m = c. Hence, this variant is obviously defined as (7) where

In turn, (7) may be estimated by means of (8) where

The previous findings are applied to the data collected in EXP1 and EXP2 in the following section.

6 Cohen’s kappa coefficient estimation in the Italian experiments

The theoretical results presented in Section 5 can be applied to the raw data of the two experiments. Therefore, it is possible to implement appropriate estimates of the considered weighted Cohen’s kappa coefficients for the agreement (i) between bibliometric (B) and peer-review ratings (P) and (ii) between the ratings of the first referee (P1) and the second referee (P2). The dot-plot graphics of the distributions of the ratings are provided as S1S4 Figs.

Some preliminary considerations are required on the choice of the weights for the computation of Cohen’s kappa. Let W = (wlm) generally denote the square matrix of order c of the weights. The selection of the weights is completely subjective and the adoption of different sets of weights may obviously modify the concordance level. ANVUR presented results for two sets of weights in EXP1 and EXP2. The first set of weights consisted in the usual linear weights, i.e. wlm = 1 − |lm|/(c − 1). In such a case, the matrices of linear weights for EXP1 and EXP2 are given, respectively, by and

The second set was originally developed by ANVUR and named “VQR-weights”. The matrices of VQR-weights for EXP1 and EXP2 are respectively given by and

The VQR-weights were based on the scores adopted in the research assessments even if they appear counter-intuitive, since they attribute different weights to a same category distance. For example, in EXP1 a distance of two categories is weighted with 0.5 if it occurs for the first and the third category, while it is solely weighted with 0.2 if it occurs for the second and fourth category. In order to reproduce ANVUR’s results, the sets of linear weights and VQR-weights are solely considered. In addition, for improving readability, the analysis and the comments are limited to the computation based on VQR-weights, while the results for linear weights are available as S1 Table.

At first, the estimation of (3), (5) and (7) are considered for the agreement of the bibliometric and peer-review ratings by means of the estimators (4), (6) and (8). The estimation was carried out for each area and for the global population in both EXP1 and EXP2. Variance estimation was carried out by means of the Horvitz-Thompson based bootstrap—stemming on the use of a pseudo-population—which is described by Quatember [35, p. 16 and p. 80]. The whole computation was implemented by means of the algebraic software Mathematica [39]. The corresponding Mathematica notebooks are available on request. The point and interval estimates are given in Tables 3 and 4. The columns labeled “ANVUR” report the point and interval estimates provided by ANVUR [12, 17]. Moreover, in Figs 3 and 4 the estimates (4), (6) and (8) and the corresponding confidence intervals at the 95% confidence level are plotted in the “error-bar” style.

thumbnail
Table 3. Cohen’s kappa coefficient estimates (percent) for EXP1 (95% confidence level intervals in parenthesis), bibliometric vs peer review ratings.

https://doi.org/10.1371/journal.pone.0242520.t003

thumbnail
Table 4. Cohen’s kappa coefficient estimates (percent) for EXP2 (95% confidence level intervals in parenthesis), bibliometric vs peer review ratings.

https://doi.org/10.1371/journal.pone.0242520.t004

thumbnail
Fig 3. ”Error-bar” plots of the Cohen’s kappa coefficient estimates (percent) for EXP1, bibliometric vs peer review ratings.

The confidence intervals are at 95% confidence level and estimates corresponding to , and are in red, green and blue, respectively.

https://doi.org/10.1371/journal.pone.0242520.g003

thumbnail
Fig 4. ”Error-bar” plots of the Cohen’s kappa coefficient estimates (percent) for EXP2, bibliometric vs peer review ratings.

The confidence intervals are at 95% confidence level and estimates corresponding to , and are in red, green and blue, respectively.

https://doi.org/10.1371/journal.pone.0242520.g004

Actually, the point estimates given by ANVUR correspond to those computed by means of (4). Thus, even if this issue is not explicitly stated in its reports [12, 17], ANVUR focused on the sub-population of articles with two reported ratings and considered the estimation of (3). Hence, the Cohen’s kappa coefficient assumed by ANVUR does not account for the size of inconclusive ratings in EXP1, and for the size of inconclusive or missing ratings in EXP2. Moreover, the confidence intervals provided by ANVUR—and reported in Tables 3 and 4—are the same computed by means of the packages psych (in the case of EXP1) and vcd (in the case of EXP2) of the software R [40]. Unfortunately, these confidence intervals rely on the model-based approximation for large samples described by Fleiss et al. [41, p. 610]. Thus, even if ANVUR has apparently adopted a design-based inference, the variance estimation is carried out in a model-based approach. The columns corresponding to of Tables 3 and 4 show the appropriate version of ANVUR estimates, i.e. the design-based point estimates and the corresponding confidence intervals, which were computed by the bootstrap method. These confidence intervals are generally narrower than those originally computed by ANVUR—consistently with the fact that a stratified sampling design is carried out, rather than a simple random sampling design.

It is also convenient to consider the two alternative definitions of the weighted Cohen’s kappa coefficient (5) and (7) and the corresponding estimators (6) and (8). These concordance measures take into account the sizes of the discarded articles—as formally explained in Section 5. From Tables 3 and 4, for both EXP1 and EXP2, the point and interval estimates corresponding to are similar to those corresponding to . In contrast, the point and interval estimates corresponding to tend to be sistematically smaller than those corresponding to . Arguably, this outcome should be expected. Indeed, (7) is likely to be more conservative than (3) and (5), since it assigns null weights to IR and missing articles.

By considering Fig 3, the first evidence is that Area 13—i.e. Economics and Statistics—is likely to be an outlier. In particular, point and interval estimates are identical when estimated by using (4), (6) or (8), since in Area 13 the use of simple journal ranking—as remarked in Section 3—did not produce IR score. More importantly, in Area 13 the value of agreement for EXP1 is higher than 0.50 and much higher than the values of all the other areas. Baccini and De Nicolao [6, 7] documented that in Area 13 the protocol of the experiment was substantially modified with respect to the other areas and contributed to boost the concordance between bibliometrics and peer review. In contrast, from Fig 4, Area 13 cannot be considered an outlier as in EXP1—even if it shows slightly higher values of agreement with respect to the other areas. Indeed, in EXP2 Area 13 adopted the same protocol of the other areas. Thus, it could be conjectured that the higher agreement was due to the exclusive use of journal ranking for attributing bibliometric scores.

Let us focus on the other areas in EXP1 and EXP2. The confidence intervals corresponding to and are largely overlapped. For most of the areas, the upper bound of the confidence intervals corresponding to is smaller than the lower bound of the confidence intervals corresponding to and . Therefore, ANVUR’s choice of discarding IR and missing articles presumably boosted the agreement between bibliometrics and peer review. Anyway, the upper bounds of the confidence intervals corresponding to are generally smaller than 0.40, and those corresponding to are generally smaller than 0.30. A baseline for interpreting these values is provided in Table.13.6 by Fagerland et al. [27, p. 550]. According to this guideline, a value of the simple Cohen’s kappa less than or equal to 0.20 is considered as a “poor” concordance and a value in the interval (0.20, 0.40] is considered as a “weak” concordance; values in the intervals (0.40, 0.60] and (0.60, 1.00] are considered respectively as indicating a “moderate” and a “very good” concordance. However, it should be remarked that these considerations are carried out for the simple Cohen’s kappa. Hence, the small values of the weighted Cohen’s kappa coefficients can be interpreted as indicating a concordance even worse than weak.

Subsequently, it is also considered the estimation of the Cohen’s kappa coefficient for the agreement of the ratings attributed to the articles by the two reviewers, i.e. P1 and P2.

Thus, the estimation of (3) is computed for the population of articles, for the sub-population of articles receiving a Definite Bibliometric Rating (DBR) and for the sub-population of articles with an Inconclusive bibliometric Rating (IR). The point and interval estimates are reported in Tables 5 and 6, and displayed in Figs 5 and 6 in “error-bar” style. It should be remarked that—owing to the use of journal ranking—there are no IR articles for Area 13.

thumbnail
Fig 5. ”Error-bar” plots of the Cohen’s kappa coefficient estimates (percent) for EXP1, P1 vs P2 ratings.

The confidence intervals are at 95% confidence level and estimates corresponding to , (DBR) and (IR) are in red, green and blue, respectively.

https://doi.org/10.1371/journal.pone.0242520.g005

thumbnail
Fig 6. ”Error-bar” plots of the Cohen’s kappa coefficient estimates (percent) for EXP2, P1 vs P2 ratings.

The confidence intervals are at 95% confidence level and estimates corresponding to , (DBR) and (IR) are in red, green and blue, respectively.

https://doi.org/10.1371/journal.pone.0242520.g006

thumbnail
Table 5. Cohen’s kappa coefficient estimates (percent) for EXP1 (95% confidence level intervals in parenthesis), P1 vs P2 ratings.

https://doi.org/10.1371/journal.pone.0242520.t005

thumbnail
Table 6. Cohen’s kappa coefficient estimates (percent) for EXP2 (95% confidence level intervals in parenthesis), P1 vs P2 rating.

https://doi.org/10.1371/journal.pone.0242520.t006

In Tables 5 and 6, the column labeled “ANVUR” reports the estimates provided by ANVUR [12, 17]. In turn, ANVUR did not explicitly state that it aimed to estimate (3) in the sub-population of articles with a definite bibliometric rating. However, this issue can be inferred from Tables 5 and 6, where—unless specific errors in the ANVUR computation for some areas—the ANVUR point estimates correspond to for the sub-population DBR. The confidence intervals provided by ANVUR are the same computed by means of the packages psych and vcd of the software R [40]. Thus, in this case also, ANVUR has apparently adopted a design-based inference, even if variance estimation is carried out in a model-based approach. Therefore, in Tables 5 and 6 the column corresponding to for the sub-population DBR reports the appropriate version of ANVUR point and interval estimates. The point estimate of (3) between the two reviewers for the population of articles, i.e. the column corresponding to in Tables 5 and 6, is generally lower then 0.30 with the exception of Area 13. The confidence intervals corresponding to overlap with the confidence intervals corresponding to for the sub-population DBR. From Figs 5 and 6, it is also apparent that is generally greater than for the sub-population IR. This last issue confirms the conjecture by Baccini and De Nicolao [8] that articles for which bibliometric rating was inconclusive were also the more difficult to evaluate for reviewers, by showing a smaller degree of agreement for these papers.

For both experiments, ANVUR directly compared the concordances between P1 and P2 with the ones between peer review and bibliometrics (see [9, p. 8] for a critique to this comparison). As for EXP1, Ancaiani et al. commented the results of the comparison as follows: “the degree of concordance among different reviewers is generally lower than that obtained between the aggregate peer review and the bibliometric evaluation: in this sense, combining evaluations obtained with peer review and bibliometric methods can be considered as more reliable than the usual practice of combining two or more different evaluations obtained by various reviewers of the same article” [4]. Actually, they compared the level of agreement between bibliometrics and peer review (i.e. column ANVUR in Table 3) with the agreement of the two referees for the sub-population DBR (more precisely, column ANVUR in Table 5). When the appropriate estimates are considered, i.e. the second column in Table 5, it is apparent that Ancaiani et al.’s statement is no longer true. Hence, their policy suggestion cannot be considered as evidence-based. Actually Ancaiani et al statement appears true only for Area 13, where the concordance indexes between bibliometrics and peer review are much higher than the corresponding indexes between the two reviewers (see Tables 3 and 5). Also in this case, the exception of Area 13 is probably due to the modification of the protocol of the experiment that boosted the agreement between peer review and bibliometrics.

As to EXP2, the agreement between the two reviewers is similar to the agreement between bibliometrics and peer review—even in Area 13 where the experiment was implemented with a protocol identical to the other areas. These estimates are at odds with ANVUR conclusions: “It is particularly important the result that the degree of agreement between the bibliometric and the peer evaluation is always higher than the one existing between the two individual peer reviews” [17]. Also in this case, ANVUR conclusions were based on estimates computed on the sub-population of articles that boosted—as previously remarked—the values of agreement between bibliometrics and peer review.

7 Testing homogeneity of missing proportions between strata

In the case of EXP2, Section 4 considers the sizes of missing peer ratings as fixed and—accordingly—a design-based approach for the estimation of rating agreement is carried out. However, it could be also interesting to assess the homogeneity of missing proportions in the different areas by assuming a random model for the missing peer ratings, i.e. by considering a model-based approach for missing proportion estimation and testing. In order to provide an appropriate setting in such a case, let us suppose in turn a fixed population of N items partitioned into L strata. Moreover, a stratified sampling design is adopted and the notations introduced in Section 2 are assumed. Hence, each item in the h-th stratum may be missed with probability θh ∈ [0, 1]—independently with respect to the other items. Thus, the size of missing items in the h-th stratum, say Mh, is a random variable (r.v.) distributed according to the Binomial law with parameters Nh and θh, i.e. the probability function (p.f.) of Mh turns out to be

Let us assume that the r.v. Xh represents the size of missing items of the h-th stratum in the sample. By supposing that the items are missed independently with respect to the sampling design, the distribution of the r.v. Xh given the event {Mh = m} is the Hypergeometric law with parameters nh, m and Nh, i.e. the corresponding conditioned p.f. is given by

Hence, on the basis of this finding and by using the result by Johnson et al. [42, p. 377], the r.v. Xh is distributed according to the Binomial law with parameters nh and θh, i.e. the p.f. of Xh is for each h = 1, …, L. Obviously, the Xh’s are independent r.v.’s.

Under the frequentist paradigm, let us consider the null hypothesis of missing proportion homogeneity H0: θh = θ, ∀h = 1, …, L, versus the alternative hypothesis H1: θhθ, ∃h = 1, …, L. For a given such that , the likelihood function under the null hypothesis is given by while the likelihood function under the alternative hypothesis is given by

Thus, the likelihood estimator of θ under the null hypothesis turns out to be , where . In addition, the likelihood estimator of (θ1, …, θL) under the alternative hypothesis turns out to be , where .

The likelihood-ratio test statistic could be adopted in order to assess the null hypothesis. However, in the present setting the large-sample results are precluded, since the sample size n is necessarily bounded by N and the data sparsity could reduce the effectiveness of the large-sample approximations. A more productive approach may be based on conditional testing (see e.g. [43, Chapter 10]). First, it is considered the χ2 test statistic—asymptotically equivalent in distribution to the likelihood-ratio test statistic—which in this case, after some algebra, reduces to

It should be remarked that the r.v. Y is sufficient for θ under the null hypothesis. Hence, the distribution of the random vector (X1, …, XL) given the event {Y = y} does not depend on θ. Moreover, under the null hypothesis, the distribution of the random vector (X1, …, XL) given the event {Y = y} is the multivariate Hypergeometric law with parameters y and (n1, …, nL), i.e. the corresponding conditioned p.f. is where

Thus, by assuming the conditional approach, an exact test may be carried out. Indeed, if r represents the observed realization of the test statistic R, the corresponding P-value is where Cr = {(x1, …, xL)∈A: R(x1, …, xL)≥r}. It should be remarked that the previous P-value may be approximated by means of a Monte Carlo method by generating realizations of a Hypergeometric random vector with parameters y and (n1, …, nL). The generation of each realization requires (L − 1) Hypergeometric random variates—for which suitable algorithms exist—and hence the method is practically feasible.

Alternatively, under the Bayesian paradigm, the missing probability homogeneity between strata may be specified as the model which assumes that Xl is distributed according to the Binomial law with parameters nl and θ, for l = 1, …, L. In contrast, the model under the general alternative postulates that Xl be distributed according to the Binomial law with parameters nl and θl, for l = 1, …, L. By assuming prior distributions in such a way that θ is elicited as the absolutely-continuous r.v. ϴ defined on [0, 1] with probability density function (p.d.f.) given by , while (θ1, …, θL) is elicited as the vector (1, …, L) of absolutely-continuous r.v.’s defined on [0, 1]L with joint p.d.f. given by , the Bayes factor is given by

If conjugate priors are considered, the r.v. ϴ is assumed distributed according to the Beta law with parameters a and b, while (ϴ1, …, ϴL) is the vector of r.v.’s with independent components, in such a way that each ϴl is distributed according to the Beta law with parameters al and bl. It is worth noting that—in a similar setting—a slightly general hierarchical model is considered by Kass and Raftery [44] (see also [45, p. 190]). Hence, the Bayes factor reduces to where—as usual—B(a, b) denotes the Euler’s Beta function with parameters a and b. In the case of non-informative Uniform priors, i.e. when a = b = 1 and al = bl = 1 for l = 1, …, L, it is apparent that B1,0 simplifies to

The testing procedures developed above is applied to the data of EXP2 by considering the areas as the strata (see Table 2). At first, by assuming the frequentist paradigm, the null hypothesis H0 of missing proportion homogeneity between strata is considered. The null hypothesis H0 can be rejected since the P-value corresponding to the test statistic R was less than 10−16. Subsequently, by assuming the Bayesian paradigm and non-informative Uniform priors, the Bayes factor is computed. In turn, the missing proportion homogeneity is not likely, since B1,0 was less than 10−16. Thus, the conclusions are as follows. Actually, the adoption of stratified random sampling in EXP2 was a suitable design choice, since the population of articles has a structural partition into areas. However, missing data occurred in the stratified sample, since some reviewers refused to referee the assigned articles. Even if this issue is disturbing, it would be a minor drawback if the items were proportionally missed with respect to the strata. Indeed, in such a case, as showed in Figs 1 and 2, the phenomenon is intrisic in EXP2—owing to the different implementation of EXP2 with respect to EXP1. Generally, if data were missed at random between strata, the effect on the Cohen’s kappa estimator could be presumably weak. For a discussion on missing data in the design-based approach, see e.g. the monograph by Little and Rubin [46]. Unfortunately, on the basis of the previous results, we have assessed that the articles are not proportionally missed between the areas, but they are missed according to an unknown random mechanism. As a matter of fact, if the data are missing not at random, corrections are much more difficult and unpredicatable biases could arise [46]. As a consequence, the estimates for EXP2 should be considered very carefully, since in some areas the estimated proportion of missing articles is much more elevate with respect to the other areas: e.g. Area 6 with a missing rate 231/1071 ≃ 21.6% and Area 9 with a missing rate given by 108/739 ≃ 14.6%. In addition, these different missing rates occur in the largest strata. Actually, the reasons for which reviewers refused to handle the articles—or to provide the score in the required time—are not known and this issue could introduce a further bias in the results of the assessment.

8 Discussion and conclusion

The Italian governmental agency for research evaluation ANVUR conducted two experiments for assessing the degree of agreement between bibliometrics and peer review. They were based on stratified random samples of articles, which were classified by bibliometrics and by informed peer review. Subsequently, concordance measures were computed between the ratings resulting from the two evaluation techniques. The aim of the two experiments was “to validate the dual system of evaluation” [4] adopted in the research assessments. Indeed, in a nutshell, ANVUR used preferentially bibliometric indicators for evaluating articles in the research assessment exercises. When bibliometric rating was inconclusive, ANVUR commissioned a pair of reviewers to evaluate an article: indeed for these articles peer-review evaluation substituted bibliometrics. Bibliometric and peer reviewer ratings were then summed up for computing the aggregate score of research fields, departments and institutions. The “dual system of evaluation” might have introduced major biases in the results of the research assessments if bibliometrics and peer review generated systematically different scores. A high level of agreement is a necessary condition for the robustness of research assessment results. The two experiments were designed to test the degree of agreement between bibliometrics and peer review at an individual article level.

This paper reconsiders in full the raw data of the two experiments by adopting the same concordance measure—i.e. the weighted Cohen’s kappa coefficient—and also the same systems of weights used in EXP1 and EXP2. In view of analyzing the experiments in the appropriate inferential setting, the design-based estimation of the Cohen’s kappa coefficient and the corresponding confidence interval were developed and adopted for computing the agreement between bibliometrics and peer review in EXP1 and EXP2. Three suggestions are proposed for defining in a proper way the population Cohen’s kappa coefficients to be estimated. In a case, the suggested definition represents the suitable version of the coefficient estimated by ANVUR. The other two definitions are advisable for taking into account the sizes of discarded articles by ANVUR.

As to the agreement between bibliometrics and peer review in EXP1, the point and interval estimates of the considered versions of the weighted Cohen’s kappa indicate a concordance degree that can be considered—at most—weak, for the aggregate population and for each scientific area. In EXP2 the degree of agreement between bibliometrics and peer review is generally even lower than in EXP1.

Results for Area 13, i.e. Economics and Statistics, deserve a separate consideration. In EXP1, Cohen’s kappa coefficient was estimated to be 54.17%. According to [6], this anomalous high value was possibly due to the modification of the experiment protocol in this area. Indeed, in EXP2—when an identical protocol was adopted for all the areas—the agreement for Area 13 was only slightly larger, but still comparable with the other areas.

Two further points have to be considered. First, the registered lower agreement in EXP2 was arguably due to the adopted systems of ratings, which are based on four categories in EXP1 and on five categories in EXP2. Second, the systems of weights developed by ANVUR tended to boost the value of the weighted Cohen’s kappa coefficients with respect to other, more usual, systems of weights (see the S1 Table providing the computations for linear weights). Hence, the estimates indicate that the “real” level of concordance between bibliometrics and peer review is likely to be worse than weak in both EXP1 and EXP2.

The two experiments also investigated the agreement between the two reviewers, when they score each article of the stratified random sample. For EXP1, the correct version of the estimates for the article population indicates that the agreement between the two reviewers tend to be lower than 0.30. A slightly lower concordance level is even obtained for EXP2. In sum, the agreement between pairs of reviewers is weak. In turn, Area 13 represented an exception with the highest level of agreement in both experiments. As previously remarked, in contrast with the other areas, Area 13 adopted a ranking of journals for bibliometric evaluation. When peer reviewers were asked to evaluate a paper, they knew the ranking of journals. Thus, it is possible to conjecture that this very simple information boosted the agreement between reviewers, since they tended to adopt the ranking of journals as a criterion for evaluating articles.

In sum, the two Italian experiments gives concordant evidence that bibliometrics and peer review have less than weak level of agreement at an individual article level. This result is actually consistent with the Metric Tide results [11, 47]. Furthermore, they also show that the agreement between two peer reviewers is in turn very weak. If the agreement between reviewers is interpreted as an estimate of “peer review uncertainty” [2], this uncertainty is of the same order of magnitude of the uncertainty generated by the use of bibliometrics and peer review.

As to EXP2, a further problem arose for the presence of missing values originated by the refusal of some peer reviewers to referee articles of the sample. For EXP2, the results cannot be easily extended even to the population of journal articles submitted to the research assessment.

From the evidence presented in this paper, it is possible to carry out a couple of research policy considerations. The first deals with the Italian research assessments exercises. Results of the experiments cannot be considered at all as validating the use of the dual method of evaluation adopted by ANVUR. At the current state of knowledge, it cannot be excluded that the use of the dual method introduced uncontrollable major biases in the final results of the assessments. Indeed, bibliometrics and peer review show a weak agreement. In particular, the evidence drawn from data in the official research reports [12, 17] shows that peer reviewers’ scores were on average lower than bibliometric ones. Unbiased results at an aggregate level would be produced solely if the distribution of articles evaluated by the two methods was homogenous for the various units of assessment (research field, research area, departments and universities). Official reports show that the distribution was not homogenous. The distributions per research areas of the articles with an inconclusive bibliometric score and consequently evaluated by peer review varied from 0.9% to 26.5% in VQR1 (source: [12, Table 3.5]), and from 0.1% to 19.2% in VQR2 (source: [17, Table 3.5]). Therefore, the aggregate results for research fields, departments and universities might be affected by the proportion of research outputs evaluated by the two different techniques: the higher the proportion of research outputs evaluated by peer review, the lower the aggregate score. From publicly available data, it is possible to show that the average score at the research area level has—rather generally—a negative association with the percentage of papers evaluated by peer review. This issue actually holds for VQR1 and VQR2, as shown in the S5S8 Figs (data available as S2 File). These considerations do not permit to exclude that the results of two Italian research assessments are biased. As a consequence, it is questionable their use for policy purposes and funding distribution.

Generally, the lesson from the two Italian experiments is that the use of a dual method of evaluation in the same research assessment exercise should be at least considered with extreme caution. A low agreement between bibliometrics and peer review at the level of individual article indicates that metrics should not replace peer review at the level of individual article. The use of the dual methods for reducing costs of evaluation, might dramatically worsen the quality of information obtained in a research assessment exercise.

Supporting information

S1 Fig. Joint distribution of peer review and bibliometrics evaluations for EXP1.

Count overlapping points (proportion). All research areas.

https://doi.org/10.1371/journal.pone.0242520.s001

(TIF)

S2 Fig. Joint distribution of peer review and bibliometrics evaluations for EXP1.

Count overlapping points (proportion). Separate plot for each research area.

https://doi.org/10.1371/journal.pone.0242520.s002

(TIF)

S3 Fig. Joint distribution of peer review and bibliometrics evaluations for EXP2.

Count overlapping points (proportion). All research areas.

https://doi.org/10.1371/journal.pone.0242520.s003

(TIF)

S4 Fig. Joint distribution of peer review and bibliometrics evaluations for EXP2.

Count overlapping points (proportion). Separate plot for each research area.

https://doi.org/10.1371/journal.pone.0242520.s004

(TIF)

S5 Fig. Average score and percentage of articles evaluated by peer review for research fields in VQR1.

All areas.

https://doi.org/10.1371/journal.pone.0242520.s005

(TIF)

S6 Fig. Regression lines for average score and percentage of articles evaluated by peer review for different research areas.

VQR1.

https://doi.org/10.1371/journal.pone.0242520.s006

(TIF)

S7 Fig. Average score and percentage of articles evaluated by peer review for research fields in VQR2.

All areas.

https://doi.org/10.1371/journal.pone.0242520.s007

(TIF)

S8 Fig. Regression lines for average score and percentage of articles evaluated by peer review for different research areas.

VQR2.

https://doi.org/10.1371/journal.pone.0242520.s008

(TIF)

S1 File. Raw anonymized data of the experiments EXP1 and EXP2.

https://doi.org/10.1371/journal.pone.0242520.s009

(XLSX)

S2 File. Average score and proportion of peer review evaluations at a scientific field level.

Data from the two Italian research assessment exercises VQR1 and VQR2.

https://doi.org/10.1371/journal.pone.0242520.s010

(XLSX)

S1 Table. Cohen’s kappa estimates with linear weights for EXP1 and EXP2.

4 Tables.

https://doi.org/10.1371/journal.pone.0242520.s011

(PDF)

References

  1. 1. Hicks D. Performance-based university research funding systems. Research Policy. 2012;41(2):251–261.
  2. 2. Traag VA, Waltman L. Systematic analysis of agreement between metrics and peer review in the UK REF. Palgrave Communications. 2019;5(1):29.
  3. 3. Kulczycki E, Korzeń M, Korytkowski P. Toward an excellence-based research funding system: Evidence from Poland. Journal of Informetrics. 2017;11(1):282–298. https://doi.org/10.1016/j.joi.2017.01.001.
  4. 4. Ancaiani A, Anfossi AF, Barbara A, Benedetto S, Blasi B, Carletti V, et al. Evaluating scientific research in Italy: The 2004–10 research evaluation exercise. Research Evaluation. 2015;24(3):242–255.
  5. 5. Pride D, Knoth P. Peer Review and Citation Data in Predicting University Rankings, a Large-Scale Analysis. In: Méndez E, Crestani F, Ribeiro C, David G, Correia Lopes J, editors. Digital Libraries for Open Knowledge. 22nd International Conference on Theory and Practice of Digital Libraries,. Springer International Publishing; 2018.
  6. 6. Baccini A, De Nicolao G. Do they agree? Bibliometric evaluation versus informed peer review in the Italian research assessment exercise. Scientometrics. 2016;108(3):1651–1671.
  7. 7. Baccini A, De Nicolao G. Reply to the comment of Bertocchi et al. Scientometrics. 2016;108(3):1675–1684.
  8. 8. Baccini A, De Nicolao G. A letter on Ancaiani et al. ‘Evaluating scientific research in Italy: the 2004-10 research evaluation exercise’. Research Evaluation. 2017;26(4):353–357.
  9. 9. Baccini A, De Nicolao G. Errors and secret data in the Italian research assessment exercise. A comment to a reply. RT A Journal on Research Policy and Evaluation. 2017;5(1). https://doi.org/10.13130/2282-5398/8872
  10. 10. Wouters P, Thelwall M, Kousha K, Waltman L, de Rijcke S, Rushforth A, et al. The Metric Tide: Literature Review (Supplementary Report I to the Independent Review of the Role of Metrics in Research Assessment and Management). HEFCE; 2015. Available from: http://www.hefce.ac.uk/media/HEFCE,2014/Content/Pubs/Independentresearch/2015/The,Metric,Tide/2015_metrictideS1.pdf.
  11. 11. HEFCE. The Metric Tide: Correlation analysis of REF2014 scores and metrics (Supplementary Report II to the Independent Review of the Role of Metrics in Research Assessment and Management). HEFCE; 2015. Available from: https://re.ukri.org/documents/hefce-documents/metric-tide-lit-review-2/.
  12. 12. ANVUR. Rapporto finale. Valutazione della qualità della ricerca 2004-2010 (VQR 2004-2010). ANVUR; 2013. Available from: http://www.anvur.org/rapporto/files/VQR2004-2010_RapportoFinale_parteprima.pdf http://www.anvur.org/rapporto/.
  13. 13. Bertocchi G, Gambardella A, Jappelli T, Nappi CA, Peracchi F. Bibliometric Evaluation vs. Informed Peer Review: Evidence from Italy. Centre for Studies in Economics and Finance (CSEF); 2013. Available from: http://www.csef.it/WP/wp344.pdf.
  14. 14. Bertocchi G, Gambardella A, Jappelli T, Nappi CA, Peracchi F. Bibliometric evaluation vs. informed peer review: Evidence from Italy. Research Policy. 2015;44(2):451–466.
  15. 15. Benedetto S, Cicero T, Malgarini M, Nappi Cs. Reply to the letter on Ancaiani et al. ‘Evaluating Scientific research in Italy: The 2004–10 research evaluation exercise’. Research Evaluation. 2017;26(4):358–360.
  16. 16. Bertocchi G, Gambardella A, Jappelli T, Nappi CA, Peracchi F. Comment to: Do they agree? Bibliometric evaluation versus informed peer review in the Italian research assessment exercise. Scientometrics. 2016; p. 349–353.
  17. 17. ANVUR. Valutazione della qualità della ricerca 2011-2014. Rapporto finale. ANVUR; 2017. Available from: http://www.anvur.it/rapporto-2016/files/VQR2011-2014_RapportoFinale_2016.pdf.
  18. 18. Alfò M, Benedetto S, Malgarini M, Scipione S. On the use of bibliometric information for assessing articles quality: an analysis based on the third Italian research evaluation exercise; 2017. Available from: https://sti2017.paris/wp-content/uploads/2017/11/rp-m-alfo-et-al.pdf.
  19. 19. Abramo G, D’Angelo CA. Refrain from adopting the combination of citation and journal metrics to grade publications, as used in the Italian national research assessment exercise (VQR 2011–2014). Scientometrics. 2016;109(3):2053–2065.
  20. 20. Abramo G, D’Angelo CA. On tit for tat: Franceschini and Maisano versus ANVUR regarding the Italian research assessment exercise VQR 2011–2014. Journal of Informetrics. 2017;11(3):783–787. https://doi.org/10.1016/j.joi.2017.06.003.
  21. 21. Franceschini F, Maisano D. Critical remarks on the Italian research assessment exercise VQR 2011–2014. Journal of Informetrics. 2017;11(2):337–357. http://dx.doi.org/10.1016/j.joi.2017.02.005.
  22. 22. Cohen J. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement. 1960;20(1):37–46.
  23. 23. Sheskin DJ. Handbook of Parametric and Nonparametric Statistical Procedures. London: Chapman and Hall; 2003.
  24. 24. Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin. 1968;70(4):213–220. http://dx.doi.org/10.1037/h0026256.
  25. 25. Strijbos JW, Martens RL, Prins FJ, Jochems WMG. Content analysis: What are they talking about? Computers & Education. 2006;46(1):29–48. https://doi.org/10.1016/j.compedu.2005.04.002.
  26. 26. Uebersax JS. Diversity of decision-making models and the measurement of interrater agreement. Psychological Bulletin. 1987;101(1):140–146.
  27. 27. Fagerland MW, Lydersen S, Laake P. Statistical Analysis of Contingency Tables. Boca Raton: CRC Press; 2017.
  28. 28. Berry KJ, Johnston JE, Mielke PW. The Measurement of Association. A Permutation Statistical Approach. Cham: Springer Nature Switzerland; 2018.
  29. 29. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–174.
  30. 30. Altman DG. Practical statistics for medical research. Boca Raton: Chapman and Hall/CRC; 1991.
  31. 31. Koch GG, Gillings DB. Inference, Design Based vs. Model Based. Encyclopedia of Statistical Sciences. John Wiley and Sons; 2006. Available from: https://onlinelibrary.wiley.com/doi/abs/10.1002/0471667196.ess1235.
  32. 32. Casella G, Berger RL. Statistical Inference (2nd ed.). Pacific Grove: Duxbury Press; 2002.
  33. 33. Thompson ME. Theory of Sample Surveys. London: Chapman and Hall; 1997.
  34. 34. Demnati A, Rao JNK. Linearization variance estimators for survey data (with discussion). Survey Methodology. 2004;30(1):17–34.
  35. 35. Quatember A. Pseudo-Populations: A Basic Concept in Statistical Surveys. New York: Springer; 2015.
  36. 36. De Raadt A, Warrens MJ, Bosker RJ, Kiers HAL. Kappa Coefficients for Missing Data. Educational and Psychological Measurement. 2019;79(3):558–576.
  37. 37. Strijbos JW, Stahl G. Methodological issues in developing a multi-dimensional coding procedure for small-group chat communication. Learning and Instruction. 2007;17(4):394–404. https://doi.org/10.1016/j.learninstruc.2007.03.005.
  38. 38. Gwet KL. Handbook of Inter-rater Reliability: the Definitive Guide to Measuring the Extent of Agreement among Multiple Raters. Gaithersburg: Advanced Analytics; 2014.
  39. 39. Wolfram Research Inc. Mathematica. 10th ed. Champaign, Illinois: Wolfram Research, Inc.; 2014.
  40. 40. Team R Core. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2019. Available from: https://www.r-project.org/.
  41. 41. Fleiss JL, Levin B, Myunghee CP. Statistical Methods for Rates and Proportions. Hoboken, New Jersey: John Wiley and Sons; 2003.
  42. 42. Johnson NL, Kemp AW, Kotz S. Univariate Discrete Distributions, Third Edition. New Jersey: John Wiley and Sons; 2005.
  43. 43. Lehmann EL, J P R. Testing Statistical Hypotheses (3rd ed.). New York: Springer; 2005.
  44. 44. Kass RE, Raftery AE. Bayes factor and model uncertainty. Journal of the American Statistical Association. 1995;90:773–795.
  45. 45. Albert J. Bayesian Computation with R (2nd ed.). New York: Springer; 2009.
  46. 46. Little RJA, Rubin DB. Statistical Analysis with Missing Data. Hoboken: Wiley; 2020.
  47. 47. Wilsdon J, Allen L, Belfiore E, Campbell P, Curry S, Hill S, et al. The Metric Tide: Report of the Independent Review of the Role of Metrics in Research Assessment and Management. HEFCE; 2015. Available from: https://tinyurl.com/yxljbdzm