On the agreement between bibliometrics and peer review: Evidence from the Italian research assessment exercises

This paper analyzes the concordance between bibliometrics and peer review. It draws evidence from the data of two experiments of the Italian governmental agency for research evaluation. The experiments were performed by the agency for validating the adoption in the Italian research assessment exercises of a dual system of evaluation, where some outputs were evaluated by bibliometrics and others by peer review. The two experiments were based on stratified random samples of journal articles. Each article was scored by bibliometrics and by peer review. The degree of concordance between the two evaluations is then computed. The correct setting of the experiments is defined by developing the design-based estimation of the Cohen’s kappa coefficient and some testing procedures for assessing the homogeneity of missing proportions between strata. The results of both experiments show that for each research areas of science, technology, engineering and mathematics the degree of agreement between bibliometrics and peer review is—at most—weak at an individual article level. Thus, the outcome of the experiments does not validate the use of the dual system of evaluation in the Italian research assessments. More in general, the very weak concordance indicates that metrics should not replace peer review at the level of individual article. Hence, the use of the dual system in a research assessment might worsen the quality of information compared to the adoption of peer review only or bibliometrics only.


Introduction
Efficient implementation of a research assessment exercise is a common challenge for policy makers. Even if attention is limited to scientific quality or scientific impact, there is a trade-off between the quality of information produced by a research assessment and its costs. Until now, two models have prevailed [1]: a first model based on peer review, such as the British Research Excellence Framework (REF), and a second model based on bibliometric indicators, such as Australian Excellence in Research (ERA), for the years preceding 2018. The first model is considered more costly than the second. In the discussion on the pros and cons of the two models, a central topic deals with the agreement between bibliometrics and peer review. Most a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 part of the scholarly works has analyzed the REF by adopting a post-assessement perspective [2]. Indeed, results of the REF at various levels of aggregation are compared with those obtained by using bibliometric indicators. A clear statistical evidence on the concordance of bibliometrics and peer review would represent a very strong argument in favor of the substitution of the latter with the former. Indeed, the claim for such a substitution-based on agreement and minor costs-could likely appear pragmatic and hence more acceptable for academics than the argument based on juxtaposition of "objective bibliometric data" and "subjective peer reviews" (among others, see e.g. [3]).
However, there are two problems hindering the adoption of the bibliometric model for research assessment. The first is how to handle the scientific fields for which bibliometrics is not easily applicable, namely social sciences and humanities. The second is how to manage research outputs not covered in bibliographic databases, such as books or articles in national languages. In these cases, no substitution is possible and peer review appears as the unique possible tool for evaluating research outputs.
As a consequence, a third model of research assessment has emerged, where bibliometrics and peer review are jointly adopted: some research outputs are evaluated by bibliometrics and others by peer review. The evaluations produced by the two techniques are subsequently mixed together for computing synthetic indicators at various levels of aggregation. The Italian governmental agency for research evaluation (ANVUR) applied extensively this model in its research assessment exercises (VQR), and called it "dual system of evaluation" [4]. In reference to this model, the question of the agreement between bibliometrics and peer review has a constitutive nature. Indeed, a high agreement would ensure that final results of a research assessment-at each possible level of aggregation-are not biased by the adoption of two different instruments of evaluation. In the simplest scenario, this issue might happen when bibliometrics and peer review produce scores which substantially agree, for instance, when the research outputs evaluated by bibliometrics receive the same score by peer review-except for random errors. In contrast, let us consider a second scenario where scores produced by bibliometrics and peer review do not agree: for instance, bibliometrics produces scores systematically lower or higher than peer review. In this more complex case, the disagreement might not be a problem solely if the two systems of evaluation are distributed homogeneously, e.g. at random, among units of assessment. Even if the concordance is not accurate at the individual article level, the errors may offset at an aggregate level [2,5]. In sum, the agreement between bibliometrics and peer review is functional for validating results of the assessment.
ANVUR tried to validate the use of the dual system of evaluation by implementing two extensive experiments on the agreement between bibliometrics and peer review, for each national research assessment of the years 2004-2010 (VQR1) and 2011-2014 (VQR2). The two experiments are hereinafter indicated as EXP1 and EXP2, respectively. They consisted in evaluating a random sample of articles by using both bibliometrics and peer review, and, subsequently, in assessing their degree of agreement at an individual publication level. ANVUR presented the results of EXP1 and of EXP2 as the evidence of a substantial concordance between bibliometrics and peer review. In turn, this agreement would validate the use of the dual system of evaluation and the final results of the research assessements.
Two of the authors of the present paper documented the flaws of EXP1 and contested the interpretation of data as indicative of a substantial agreement [6][7][8][9]. The present paper takes advantage of the recent availability of the raw data of the two experiments, in order to deepen the analysis and reach conclusive results on issues that had remained open due to the sole availability of aggregated data. Therefore, this paper aims to replicate the ANVUR analysis in order to draw a solid evidence on the concordance between bibliometrics and peer review.
The paper is organized as follows. In Section 2 the literature on the two Italian experiments is framed in the general discussion on the agreement between bibliometrics and peer review. Section 3 presents the structure of EXP1 and EXP2 by reminding the essential issues of the Italian research assessment exercises. Section 4 introduces the main research questions on the sampling design and the measures of agreement. Section 5 develops the correct framework for the design-based estimation of the Cohen's kappa coefficient. Section 6 presents the estimates of Cohen's kappa coefficients for EXP1 and EXP2, by comparing the current results with ANVUR's findings. In Section 7, a further problem with missing data in EXP2 is presented and the homogeneity of missing proportions between scientific areas is assessed. Section 8 discusses results and concludes with some suggestions for research evaluation policy.

A short review of the literature
Most part of the literature on the agreement between bibliometrics and peer review considers the British REF. Overviews of this literature are provided by [2,5,10]. It is therefore possible to limit the discussion to a central issue which is functional to the development of this paper. By and large, results on agreement do not converge when different approaches and statistical tools are used. Notably, the analysis conducted by the Higher Education Funding Council for England (HEFCE) in the so-called Metric Tide report "has shown that individual metrics give significantly different outcomes from the REF peer review process, showing that metrics cannot provide a like-for-like replacement for REF peer review" [11]. This analysis was performed at an individual article level by comparing the quality profile attributed by peer reviews to a set of bibliometric indicators for articles submitted to REF. Traag and Waltman [2] criticized results of the Metric Tide report by arguing that the individual publication level "is not appropriate in the context of REF". They claimed that the appropriate level is the institutional one, since "the goal of the REF is not to assess the quality of individual publications, but rather to assess 'the quality of research in UK higher education institutions'. Therefore, the question should not be whether the evaluation of individual publications by peer review can be replaced by the evaluation of individual publications by metrics but rather whether the evaluation of institutions by peer review can be replaced by the evaluation of institutions by metric". In a similar vein, Pride and Knoth [5] documented that a high concordance between bibliometric and peer-review indicators for REF is achieved when the analysis is conducted at an institutional level.
These claims should be framed in a "post-assessment" perspective, where the issue at stake is to verify the coherence between results obtained by applying one evaluative technique or the other at the desired institutional level. In the case of REF the coherence to be verified is between the adopted technique, i.e. peer review, and the alternative, i.e. bibliometrics. This viewpoint is very different from that developed in the Italian experiments and considered in this paper. In the present case, the question is whether it is possible to interchangeably use bibliometrics and peer review at an individual article level. To this end, the analysis of the agreement between bibliometrics and peer review at the level of individual publications is therefore fully justified. In turn, Traag and Waltman [2] support the study of the concordance at an individual publication level when the issue is the possibility that bibliometrics replaces peer review at an individual level. In reference to Metric Tide report, they explicitly wrote that "the analysis at the level of individual publications is very interesting. The low agreement at the level of individual publications supports the idea that metrics should generally not replace peer review in the evaluation of a single individual publication" [2].
As anticipated, ANVUR implemented EXP1 and EXP2 in order to justify the use of a dual system of evaluation in VQR1 and VQR2. As to EXP1, results were initially published as part of the official report of the research assessment exercise [12]. In the official report results are synthesized by stating that "there is a more than adequate concordance between evaluation carried out through peer reviews and through bibliometrics. This results fully justifies the choice (. . .) to use both techniques of assessment" [12 Appendix B, pp. 25-26, translation by the authors] (See also [6]) Ancaiani et al. [4] republished the complete results of EXP1, by claiming a "fundamental agreement" between bibliometrics and peer review "supporting" the choice of using both techniques in the VQR1. Moreover, they also interpreted the experiment as indicating that "combining evaluations obtained with peer review and bibliometric methods can be considered more reliable than the usual practice of combining two or more different evaluations obtained by various reviewers of the same article".
The specific results obtained in EXP1 for the field of Economics and Statistics were largely disseminated. Bertocchi and coauthors published as far as five identical working papers where they interpreted the results of EXP1 by claiming that bibliometrics and peer review "are close substitutes" (among the others [13]). In the version finally published in a scholarly journal, they concluded that "the agencies that run these evaluations could feel confident about using bibliometric evaluations and interpret the results as highly correlated with what they would obtain if they performed informed peer review" [14].
The results and the interpretation of EXP1 were challenged by two of the authors of the present paper on the basis of published data only, since they were unable to access raw data at the time undisclosed by ANVUR (the whole thread of papers, comments and replies includes [6-9, 15, 16]). The first critical appraisal was about the interpretation of the degree of concordance. Baccini and De Nicolao [6,7] argued that, according to the available statistical guidelines, the degree of concordance between bibliometrics and peer review has to be interpreted as "unacceptable" or "poor" for all the considered research fields. The unique exception-confirmed by a statistical meta-analysis of the data-was Economics and Statistics, for which the protocol of the experiment was substantially modified with respect to the other fields. Baccini and De Nicolao [8,9] also raised some questions on the sampling protocol used for EXP1, which are considered in details also in this paper.
As for to EXP2, the results were published in the official report [17] and presented in a conference [18]. The synthesis of the results apparently confirmed the outcome of EXP1. The results of EXP2, summarized in the conclusion of the report, state that there is a "non-zero correlation" "between peer review evaluation and bibliometric evaluation". The degree of agreement is "modest but significant. Of particular importance is the result that the degree of concordance (class and inter-row) between the bibliometric evaluation and the peer evaluation is always higher than the one existing between the two individual peer reviews" [17, Appendix B, p. 33, translation by the authors]. These results are interpreted as indicating that "the combined used of bibliometric indicators for citations and journal impact may provide a useful proxy for peer review judgements" [18].
As anticipated, this paper aims to draw definitive evidence from the two experiments. This analysis is possible since ANVUR accepted to disclose the anonymous individual data of both EXP1 and EXP2. The mail to the President of ANVUR containing the request is dated March 12th 2019. The decision of disclosing the data was communicated by mail dated March 26th 2019. Access to the data was open on April 9th 2019. It is therefore possible to replicate the results of EXP1 and EXP2, by verifying in details ANVUR methods and calculations. Replication is solely possible at the research area levels, since-according to a communication dated 16th March 2019-the data for the sub-areas "are no longer available" in the ANVUR archives. For a correct understanding of the research questions, the following section presents a description of EXP1 and EXP2 in the context of the Italian research assessments.

A brief description of the Italian experiments
EXP1 and EXP2 were designed and performed during VQR1 and VQR2, respectively. Italian research assessement exercises aimed to evaluate research institutions, research areas and fields, both at national and institutional level (i.e. universities and departments). Synthetic indicators were obtained by aggregating the scores received by the research outputs submitted by the institutions. All the researchers with a permanent position had to submit a fixed number-with few exceptions-of research outputs (3 in VQR1 and 2 in VQR2). VQR1 and VQR2 were organized in 16 research area panels. Research areas were distinguished between "bibliometric areas", i.e. science, technology, engineering and mathematics (namely Mathematics and Informatics (Area 1), Physics (Area 2), Chemistry (Area 3), Earth Sciences (Area 4), Biology (Area 5), Medicine (Area 6), Agricultural and Veterinary Sciences (Area 7), Civil Engineering (Area 8b), Industrial and Information Engineering (Area 9)), and "non bibliometric areas", i.e. social science and humanities (namely Architecture (Area 8a) Antiquities, Philology, Literary studies, Art History (Area 10), History, Philosophy, Pedagogy and Psychology (Areas 11a and 11b), Law (Area 12), Economics and Statistics (Area 13), Political and Social Sciences (Area 14)).
Both research assessments performed evaluations of the submitted research outputs by using a "dual system of evaluation" where some outputs were evaluated by bibliometric algorithms and others by "Informed Peer Review" (IPR). Informed peer review indicates that reviewers were asked to evaluate a submitted research item by being provided with its complete metadata and, if available, with its bibliometric indicators. Actually, this dual system of evaluation regarded only the bibliometric areas plus Economics and Statistics (Area 13). Indeed in the non-bibliometric areas, panels evaluated all the submitted research products exclusively by peer review. In the bibliometric areas, instead, while books, book chapters and articles in not-indexed journals were evaluated by IPR, journal articles were evaluated for the most part by applying bibliometric algorithms. VQR1 and VQR2 adopted two different bibliometric algorithms. Both algorithms combined the number of citations received by an article and a journal indicator, e.g. the impact factor. The complete description of the algorithms and their critical appraisal can be found in [6,[19][20][21]. Both algorithms were built in such a way that, if the two indicators were coherent, they generated a categorical score (B-score) and a corresponding numerical value used for computing aggregate results for institutions. Namely, in the VQR1 there were four categories: Excellent (A, score 1), Good (B, score 0.8), Acceptable (C, score 0.5), Limited (D, score 0); in the VQR2 there were five categories: Excellent (A, score 1), Elevated (B, score 0.7), Fair (C, score 0.4), Acceptable (D, score 0.1), Limited (E, score 0). If the two bibliometric indicators gave incoherent indications for an article, e.g. high number of citations and low impact factor or viceversa, the algorithm classified it as "IR" (Inconclusive Rating) and it was evaluated by IPR. In both VQR1 and VQR2, Area 13 (Economics and Statistics) did not adopt the bibliometric algorithms for evaluating articles. They were substituted by classifications of journals directly developed by the area panel, including the same number of categories as in the algorithms. Therefore, all the articles received the score of the journal where they were published and no article was classified as IR.
IPR was identically organized in the two research assessments. A publication was assigned to two members of the area panel, who independently chose two anonymous reviewers. The two reviewers performed the IPR of the article by using a predefined format-slightly different between the two research assessments and also between panels in the same assessment. Each referee assigned a final evaluation according to the same final categories adopted for bibliometrics. These final evaluations are conventionally indicated as P1-score and P2-score. Then, the referee reports were received by the two members of the area panel, who formed a so-called "Consensus Group" (CG) for deciding the final score of the article (P-score).
In order to validate the dual system of evaluation, EXP1 and EXP2 considered only the "bibliometric areas" plus Area 13. They had a similar structure. Figs 1 and 2 report the flowcharts of the two experiments. The rationale of both experiments was very simple: a sample of the journal articles submitted to the research assessment was scored by the two methods of evaluations, i.e. through the bibliometric algorithm and IPR. In such a case, IPR was implemented by involving two reviewers, according to the same rules adopted in the research assessment. These raw data were then used for analyzing (i) the agreement between the evaluation obtained through IPR (P-score) and bibliometric algorithms (B-score) and (ii) the agreement between the scores decided by the two reviewers (P1-score and P2-score). The agreement between the scores is computed by using the weighted Cohen's kappa coefficient [22], a popular index of inter-rater agreement for nominal categories (see e.g. [23]). A high level of

PLOS ONE
On the agreement between bibliometrics and peer review: Evidence from Italian research assessment exercises agreement between IPR and bibliometric scores was interpreted as validating the dual method of evaluation. EXP1 and EXP2 differed for a different timing of realization. EXP1 was done simultaneously with VQR1. Hence, the reviewers were unaware of partecipating to EXP1. Indeed, they were unable to distinguish between papers of the EXP1 sample and those they had to evaluate for the research assessment. The unique exception was Area 13, where panelists and referees knew that all the journal articles belonged to the EXP1 sample-since all the journal articles for the research assessment were evaluated automatically according to the journal ranking [6]. In contrast, EXP2 started after the conclusion of the activities of the research assessment. Therefore, panelists and reviewers knew that they were partecipating to EXP2. A second consequence of the different timing was that in EXP1 all the papers of the sample were peer-reviewed, since the successful administrative completion of the research assessment required the evaluation of all submitted articles. On the contrary, in EXP2 some papers did not receive a peer-review evaluation since some reviewers refused to do it. Therefore, in EXP2 there were missing data in the sample, which were not accounted for by ANVUR when the concordance indexes were computed.

Measures of agreement, sampling and data
The first step of this works consists in replicating ANVUR's computations. This end entails the adoption of the measure of agreement chosen by ANVUR. ANVUR used the Cohen's kappa coefficient and its weighted generalization, a commonly-adopted measure of agreement between classifications of two raters [22,24]. Despite the Cohen's kappa coefficient is criticized for some methodological drawbacks (for more details, see [25,26] among others), practitioners often adopt this index in order to assess the inter-rater agreement for categorical ratings, while its weighted counterpart is preferred when the categories can be considered ordinal (see e.g. [27, p. 548] and [28, p. 596]). Rough guidelines for interpreting Cohen's kappa values are available and a survey is provided by [6]. The guideline generally adopted is the one by Fagerland et al. [27, p. 550]-based on Landis and Koch [29], and slightly modified by Altman [30].
The replication of ANVUR's computations is surely useful, albeit not sufficient to reach a complete appreciation of the results of the two experiments. Indeed, some research questions should be carefully addressed. For EXP1 and EXP2, ANVUR [12, 17, Appendix B] adopted a stratified random sampling, where the target population was constituted by the journal articles submitted to the two research assessments. The sizes of the article populations in EXP1 and EXP2 are 99,005 and 77,159, respectively. The sample size was about 10% of the population size, i.e. 9,199 and 7,667 articles for EXP1 and EXP2, respectively. The stratified random samples were proportionally allocated with respect to the sizes of the research areas. The sizes of the strata in EXP1 and EXP2 are reported in Tables 1 and 2. Indeed, the Final Report remarks

PLOS ONE
On the agreement between bibliometrics and peer review: Evidence from Italian research assessment exercises that: "The sample was stratified according to the distribution of the products among the subareas of the various areas" [17, Appendix B, p. 1, our translation]. For EXP1 results were published at a sub-area level, while for EXP2 results were solely published for areas. Moreover, the raw data at the sub-area level are not yet available. A first research question deals with the statistical methodology adopted in the experiments. From this perspective, the two experiments were actually implemented in a design-based framework. Hence, their analysis requires a correct inferential setting in order to obtain the estimates of the considered concordance measures. To this aim, in Section 4 the model-based estimation of the weighted Cohen's kappa coefficient is revised and the design-based estimation of this coefficient is originally developed. On the basis of these theoretical results, it is possible to check if ANVUR's estimates of agreement are correct. In particular, ANVUR's estimates of Cohen's kappa coefficients and the corresponding confidence intervals may be compared with the appropriate design-based counterparts.
ANVUR computed the final results of EXP1 and EXP2 by solely considering a sub-sample of articles-and not the whole sample. This is illustrated in Figs 1 and 2 where the sizes of the populations, of the samples and of the final subsamples are reported. Indeed, ANVUR dropped from the computation of the concordance indexes the articles with an inconclusive bibliometric score IR, i.e. those receiving an IPR evaluation albeit they were not considered for agreement estimation. For EXP1, the reduction of the sample due to the exclusion of the paper classified as IR was not disclosed neither in ANVUR's official reports nor in [4]. Tables 1 and 2 reports the sizes of the sub-samples for EXP1 and EXP2, respectively. The exclusion of the IR papers might have boosted the value of the agreement measures, as argued by Baccini and De Nicolao [8]. The conjecture sounds as follows. ANVUR removed from EXP1 the most problematic articles for which the bibliometric algorithm was unable to reach a score. It cannot be excluded that these articles were also particularly difficult to evaluate for peer reviewers. Hence, ANVUR calculated the agreement indicators on sub-samples of articles that were "more favorable" to agreement than the complete samples.
The second research question, therefore, deals with the adoption of concordance measures which take into account the number of the IR articles which ANVUR dropped, as well as the number of missing articles. Actually, these articles could be ideally considered as belonging to a rating category for which agreement is not required. In such a case, there exist alternative variants of the weighted Cohen's kappa, which may suitably manage this issue. Hence, in Section 5 the design-based estimation of these variants of the weighted Cohen's kappa are also developed. In turn, in Section 6 the resulting point estimates and the corresponding confidence intervals are computed for EXP1 and EXP2, respectively.
A third and last question-which is limited to EXP2-deals with the distribution of missing papers per research area, i.e. those papers for which a peer review score is unavailable. As previously remarked, Table 2 reports the number of missing papers per area. Actually, drawbacks would arise if the distribution of missing articles in the sample occurred in a non-proportional way between the strata, since in this case some research areas could be more represented than others. ANVUR [17] claimed that this was not the case. Thus, in Section 7 a new testing procedure for the homogeneity of missing proportions between strata is developed and applied to EXP2 data.
These three questions are addressed by using the raw data of the two ANVUR experiments. The articles in each database have a unique anonymous identifier. For each article, the available variables are: (i) the research area; (ii) the bibliometric score (B); (iii) the score assigned by the first reviewer (P1); (iv) the score assigned by the second reviewer (P2); (v) the syntetic peer-review score (P). Data are available as S1 File (downloadable also from https://doi.org/10. 5281/zenodo.3727460).

Design-based estimation of the Cohen's kappa coefficient
As anticipated, both EXP1 and EXP2 adopted the weighted Cohen's kappa coefficient as measure of agreement. In order to introduce our proposal for the design-based estimation of the Cohen's kappa coefficient, first it is instructive to revise its model-based counterpart (for a general discussion on the two paradigms, see e.g. [31]). In the model-based approach, two potential raters classify items into c categories, which are labeled on the set I = {1, . . ., c} without loss of generality. The couple of evaluations given by the raters for an item is modeled as a bivariate random vector, say (U, V), which takes values on the set I × I. More appropriately, (U, V) should defined as a random element, since categories are indexed by the first c integers for the sake of simplicity-even if they are just labels. The joint probability function of (U, V) is assumed to be where l, m 2 I, while ϑ l,m � 0 and P c l¼1 P c m¼1 W l;m ¼ 1. Hence, the parameter space for the underlying model is actually given by Moreover, it holds that are the marginal probability distributions of U and V, respectively. In practice, ϑ l,m represents the probability that an item be classified into the l-th category according to the first rating and into the m-th category according to the second rating. Similarly, ϑ l+ and ϑ +l are the probabilities that the item be categorized into the l-th category according to the first rating and the second rating, respectively. Hence, the definition of the weighted Cohen's kappa in the modelbased approach is given by while the w lm 's are weights which are suitably chosen in order to consider the magnitude of disagreement (see e.g. [27, p. 551]). In particular, the (usual) unweighted Cohen's kappa coefficient is obtained when w lm = 1 if l = m and w lm = 0, otherwise. In order to estimate the weighted Cohen's kappa under the model-based approach, let us assume that a random sample, say (U 1 , V 1 ), . . ., (U n , V n ), of n copies of (U, V) is available. Thus, the maximum-likelihood estimators of the ϑ lm 's, the ϑ l+ 's and the ϑ +l 's are readily seen aŝ where 1 B is the usual indicator function of a set B, i.e. 1 B (u) = 1 if u 2 B and 1 B (u) = 0, otherwise. Thus, on the basis of the invariance property of the maximum-likelihood estimation (see e.g. Theorem 7.2.10 by Casella and Berger [32]), the maximum-likelihood estimator of κ w,M is provided byk Actually,k w;M is the weighted Cohen's kappa estimator commonly adopted in practical applications. Finally, it should remarked that the variance ofk w;M is usually estimated by means of large-sample approximations [41, p. 610].
Under the design-based approach, there exists a fixed population of N items which are classified into the c categories on the basis of two ratings. Hence, the j-th item of the population is categorized according to the first evaluation-say u j 2 I-and the second evaluation-say v j 2 I-for j = 1, . . ., N. It should be remarked that in this case the N couples (u 1 , v 1 ), . . ., (u N , v N ) are fixed and given. Thus, the "population" weighted Cohen's kappa coefficient may be defined as where In this case, p lm is the proportion of items in the population classified into the l-th category according to the first rating and into the m-th category according to the second rating. Similarly, p l+ and p +l are the proportions of items categorized into the l-th category according to the first rating and the second rating, respectively. Thus, for estimation purposes, the Cohen's kappa coefficient (1) is conveniently expressed as a smooth function of population totals-i.e. the p lm 's, the p l+ 's and the p +l 's. It is worth remarking that (1) is a fixed population quantity under the design-based approach, while its counterpart κ w,M under the model-based approach is a unknown quantity depending on the model parameters.
Let us now assume that a sampling design is adopted in order to estimate (1) and let us consider a sample of fixed size n. Moreover, let S denote the set of indexes corresponding to the sampled items-i.e. a subset of size n of the first N integers-and let π j be the inclusion probability of the first order for the j-th item. As an example aimed to the subsequent application, let us assume that the population be partitioned into L strata and that N h be the size of the h-th stratum with h = 1, . . ., L. Obviously, it holds N ¼ P L h¼1 N h . If a stratified sampling design is considered, the sample is obtained by drawing n h items in the l-th stratum by means of simple random sampling without replacement in such a way that n ¼ P L h¼1 n h . Therefore, as is well known, it turns out that π j = n h /N h if the j-th item is in the h-th stratum (see e.g. [33]). When a proportional allocation is adopted, it also holds that n h = nN h /N-and hence it obviously follows π j = n/N.
In order to obtain the estimation of (1), it should be noticed that are unbiased Horvitz-Thompson estimators of the population proportions p lm , p l+ and p +l , respectively. Thus, by bearing in mind the general comments provided by Demnati and Rao [34] on the estimation of a function of population totals, a "plug-in" estimator of (1) is given byk Even if estimator (2) is biased, its bias is negligible since (1) is a differentiable function of the population totals with non-null derivatives (for more details on such a result, see e.g. [33, p. 106]).
As usual, variance estimation is mandatory is order to achieve an evaluation of the accuracy of the estimator. Since (2) is a rather involved function of sample totals, its variance may be conveniently estimated by the linearization method or by the jackknife technique (see e.g. [34] and references therein). Alternatively, a bootstrap approach-which is based on a pseudo-population method-may be suitably considered (for more details on this topic, see e.g. [35).
It should be remarked that inconclusive ratings occur in EXP1 and EXP2 and-in addition -missing ratings are also present in EXP2. However, even if ANVUR does not explicitly states this issue, its target seems to be the sub-population of items with two reported ratings. Hence, some suitable variants of the Cohen's kappa coefficient have to be considered. In order to deal with an appropriate definition of the population parameter in this setting, the three suggestions provided by De Raadt et al. [36] could be adopted. For the sake of simplicity, let us suppose that inconclusive or missing ratings are classified into the c-th category. A first way to manage the issue consists in deleting all items which are not classified by both raters and apply the weighted Cohen's kappa coefficient to the items with two ratings (see also [37]). After some straightforward algebra, this variant of the population weighted Cohen's kappa coefficient may be written as The second proposal by De Raadt et al. [36] for a variant of the weighted Cohen's kappa coefficient is based on Gwet's kappa [38]. The population weighted Gwet's kappa may be defined as The third proposal by De Raadt et al. [36] for a variant of (1) stems on assuming null weights for the inconclusive or missing ratings, i.e. by assuming that w lm = 0 if l = c or m = c. Hence, this variant is obviously defined as The previous findings are applied to the data collected in EXP1 and EXP2 in the following section.

Cohen's kappa coefficient estimation in the Italian experiments
The theoretical results presented in Section 5 can be applied to the raw data of the two experiments. Therefore, it is possible to implement appropriate estimates of the considered weighted Cohen's kappa coefficients for the agreement (i) between bibliometric (B) and peer-review ratings (P) and (ii) between the ratings of the first referee (P1) and the second referee (P2). The dot-plot graphics of the distributions of the ratings are provided as S1-S4 Figs.
Some preliminary considerations are required on the choice of the weights for the computation of Cohen's kappa. Let W = (w lm ) generally denote the square matrix of order c of the weights. The selection of the weights is completely subjective and the adoption of different sets of weights may obviously modify the concordance level. ANVUR presented results for two sets of weights in EXP1 and EXP2. The first set of weights consisted in the usual linear weights, i.e. w lm = 1 − |l − m|/(c − 1). In such a case, the matrices of linear weights for EXP1 and EXP2 are given, respectively, by W ¼ The second set was originally developed by ANVUR and named "VQR-weights". The matrices of VQR-weights for EXP1 and EXP2 are respectively given by W ¼ The VQR-weights were based on the scores adopted in the research assessments even if they appear counter-intuitive, since they attribute different weights to a same category distance. For example, in EXP1 a distance of two categories is weighted with 0.5 if it occurs for the first and the third category, while it is solely weighted with 0.2 if it occurs for the second and fourth category. In order to reproduce ANVUR's results, the sets of linear weights and VQR-weights are solely considered. In addition, for improving readability, the analysis and the comments are limited to the computation based on VQR-weights, while the results for linear weights are available as S1 Table. At first, the estimation of (3), (5) and (7) are considered for the agreement of the bibliometric and peer-review ratings by means of the estimators (4), (6) and (8). The estimation was carried out for each area and for the global population in both EXP1 and EXP2. Variance estimation was carried out by means of the Horvitz-Thompson based bootstrap-stemming on the use of a pseudo-population-which is described by Quatember [35, p. 16 Tables 3 and 4. The columns labeled "ANVUR" report the point and interval estimates provided by ANVUR [12,17]. Moreover, in Figs 3 and 4 the estimates (4), (6) and (8) and the corresponding confidence intervals at the 95% confidence level are plotted in the "error-bar" style.

PLOS ONE
On the agreement between bibliometrics and peer review: Evidence from Italian research assessment exercises Actually, the point estimates given by ANVUR correspond to those computed by means of (4). Thus, even if this issue is not explicitly stated in its reports [12,17], ANVUR focused on the sub-population of articles with two reported ratings and considered the estimation of (3). Hence, the Cohen's kappa coefficient assumed by ANVUR does not account for the size of inconclusive ratings in EXP1, and for the size of inconclusive or missing ratings in EXP2. Moreover, the confidence intervals provided by ANVUR-and reported in Tables 3 and 4are the same computed by means of the packages psych (in the case of EXP1) and vcd (in the case of EXP2) of the software R [40]. Unfortunately, these confidence intervals rely on the model-based approximation for large samples described by Fleiss et al. [41, p. 610]. Thus, even if ANVUR has apparently adopted a design-based inference, the variance estimation is carried out in a model-based approach. The columns corresponding tok ð1Þ w of Tables 3 and 4 show the appropriate version of ANVUR estimates, i.e. the design-based point estimates and the  corresponding confidence intervals, which were computed by the bootstrap method. These confidence intervals are generally narrower than those originally computed by ANVUR-consistently with the fact that a stratified sampling design is carried out, rather than a simple random sampling design.
It is also convenient to consider the two alternative definitions of the weighted Cohen's kappa coefficient (5) and (7) and the corresponding estimators (6) and (8). These concordance measures take into account the sizes of the discarded articles-as formally explained in Section 5. From Tables 3 and 4, for both EXP1 and EXP2, the point and interval estimates corresponding tok ð2Þ w are similar to those corresponding tok ð1Þ w . In contrast, the point and interval estimates corresponding tok ð3Þ w tend to be sistematically smaller than those corresponding tok ð1Þ w . Arguably, this outcome should be expected. Indeed, (7) is likely to be more conservative than (3) and (5), since it assigns null weights to IR and missing articles. By considering Fig 3, the first evidence is that Area 13-i.e. Economics and Statistics-is likely to be an outlier. In particular, point and interval estimates are identical when estimated by using (4), (6) or (8), since in Area 13 the use of simple journal ranking-as remarked in Section 3-did not produce IR score. More importantly, in Area 13 the value of agreement for EXP1 is higher than 0.50 and much higher than the values of all the other areas. Baccini and De Nicolao [6,7] documented that in Area 13 the protocol of the experiment was substantially modified with respect to the other areas and contributed to boost the concordance between bibliometrics and peer review. In contrast, from Fig 4, Area 13 cannot be considered an outlier as in EXP1-even if it shows slightly higher values of agreement with respect to the other areas. Indeed, in EXP2 Area 13 adopted the same protocol of the other areas. Thus, it could be conjectured that the higher agreement was due to the exclusive use of journal ranking for attributing bibliometric scores.
Let us focus on the other areas in EXP1 and EXP2. The confidence intervals corresponding tok ð1Þ w andk ð2Þ w are largely overlapped. For most of the areas, the upper bound of the confidence intervals corresponding tok ð3Þ w is smaller than the lower bound of the confidence intervals corresponding tok ð1Þ w andk ð2Þ w . Therefore, ANVUR's choice of discarding IR and missing articles presumably boosted the agreement between bibliometrics and peer review. Anyway, the upper bounds of the confidence intervals corresponding tok ð2Þ w are generally smaller than 0.40, and those corresponding tok ð3Þ w are generally smaller than 0.30. A baseline for interpreting these values is provided in Table.13.6 by Fagerland et al. [27, p. 550]. According to this guideline, a value of the simple Cohen's kappa less than or equal to 0.20 is considered as a "poor" concordance and a value in the interval (0.20, 0.40] is considered as a "weak" concordance; values in the intervals (0.40, 0.60] and (0.60, 1.00] are considered respectively as indicating a "moderate" and a "very good" concordance. However, it should be remarked that these considerations are carried out for the simple Cohen's kappa. Hence, the small values of the weighted Cohen's kappa coefficients can be interpreted as indicating a concordance even worse than weak.
Subsequently, it is also considered the estimation of the Cohen's kappa coefficient for the agreement of the ratings attributed to the articles by the two reviewers, i.e. P1 and P2.
Thus, the estimation of (3) is computed for the population of articles, for the sub-population of articles receiving a Definite Bibliometric Rating (DBR) and for the sub-population of articles with an Inconclusive bibliometric Rating (IR). The point and interval estimates are reported in Tables 5 and 6, and displayed in Figs 5 and 6 in "error-bar" style. It should be remarked that-owing to the use of journal ranking-there are no IR articles for Area 13.
In Tables 5 and 6, the column labeled "ANVUR" reports the estimates provided by ANVUR [12,17]. In turn, ANVUR did not explicitly state that it aimed to estimate (3) in the sub-population of articles with a definite bibliometric rating. However, this issue can be inferred from Tables 5 and 6, where-unless specific errors in the ANVUR computation for some areas-the ANVUR point estimates correspond tok ð1Þ w for the sub-population DBR. The confidence intervals provided by ANVUR are the same computed by means of the packages psych and vcd of the software R [40]. Thus, in this case also, ANVUR has apparently adopted a design-based inference, even if variance estimation is carried out in a model-based approach. Therefore, in Tables 5 and 6 the column corresponding tok ð1Þ w for the sub-population DBR reports the appropriate version of ANVUR point and interval estimates. The point estimate of  [12]. Reproduced in [4]. b Estimated with the wrong system of weights as reported in [8]. Benedetto et al. [15] justified it as "factual error in editing of the

PLOS ONE
On the agreement between bibliometrics and peer review: Evidence from Italian research assessment exercises (3) between the two reviewers for the population of articles, i.e. the column corresponding tô k ð1Þ w in Tables 5 and 6, is generally lower then 0.30 with the exception of Area 13. The confidence intervals corresponding tok ð1Þ w overlap with the confidence intervals corresponding tô k ð1Þ w for the sub-population DBR. From Figs 5 and 6, it is also apparent thatk ð1Þ w is generally greater thank ð1Þ w for the sub-population IR. This last issue confirms the conjecture by Baccini and De Nicolao [8] that articles for which bibliometric rating was inconclusive were also the more difficult to evaluate for reviewers, by showing a smaller degree of agreement for these papers.
For both experiments, ANVUR directly compared the concordances between P1 and P2 with the ones between peer review and bibliometrics (see [9, p. 8] for a critique to this  "the degree of concordance among different reviewers is generally lower than that obtained between the aggregate peer review and the bibliometric evaluation: in this sense, combining evaluations obtained with peer review and bibliometric methods can be considered as more reliable than the usual practice of combining two or more different evaluations obtained by various reviewers of the same article" [4]. Actually, they compared the level of agreement between bibliometrics and peer review (i.e. column ANVUR in Table 3) with the agreement of the two referees for the sub-population DBR (more precisely, column ANVUR in Table 5). When the appropriate estimates are considered, i.e. the second column in Table 5, it is apparent that Ancaiani et al.'s statement is no longer true. Hence, their policy suggestion cannot be considered as evidence-based. Actually Ancaiani et al statement appears true only for Area 13, where the concordance indexes between bibliometrics and peer review are much higher than the corresponding indexes between the two reviewers (see Tables 3 and 5). Also in this case, the exception of Area 13 is probably due to the modification of the protocol of the experiment that boosted the agreement between peer review and bibliometrics.
As to EXP2, the agreement between the two reviewers is similar to the agreement between bibliometrics and peer review-even in Area 13 where the experiment was implemented with a protocol identical to the other areas. These estimates are at odds with ANVUR conclusions: "It is particularly important the result that the degree of agreement between the bibliometric and the peer evaluation is always higher than the one existing between the two individual peer reviews" [17]. Also in this case, ANVUR conclusions were based on estimates computed on the sub-population of articles that boosted-as previously remarked-the values of agreement between bibliometrics and peer review.

Testing homogeneity of missing proportions between strata
In the case of EXP2, Section 4 considers the sizes of missing peer ratings as fixed and-accordingly-a design-based approach for the estimation of rating agreement is carried out. However, it could be also interesting to assess the homogeneity of missing proportions in the different areas by assuming a random model for the missing peer ratings, i.e. by considering a model-based approach for missing proportion estimation and testing. In order to provide an appropriate setting in such a case, let us suppose in turn a fixed population of N items partitioned into L strata. Moreover, a stratified sampling design is adopted and the notations introduced in Section 2 are assumed. Hence, each item in the h-th stratum may be missed with probability θ h 2 [0, 1]-independently with respect to the other items. Thus, the size of missing items in the h-th stratum, say M h , is a random variable (r.v.) distributed according to the Binomial law with parameters N h and θ h , i.e. the probability function (p.f.) of M h turns out to be Let us assume that the r.v. X h represents the size of missing items of the h-th stratum in the sample. By supposing that the items are missed independently with respect to the sampling design, the distribution of the r.v. X h given the event {M h = m} is the Hypergeometric law with parameters n h , m and N h , i.e. the corresponding conditioned p.f. is given by  while the likelihood function under the alternative hypothesis is given by Thus, the likelihood estimator of θ under the null hypothesis turns out to beŷ ¼ Y=n, where Y ¼ P L h¼1 X h . In addition, the likelihood estimator of (θ 1 , . . ., θ L ) under the alternative hypothesis turns out to be ðŷ 1 ; . . . ;ŷ L Þ, The likelihood-ratio test statistic could be adopted in order to assess the null hypothesis. However, in the present setting the large-sample results are precluded, since the sample size n is necessarily bounded by N and the data sparsity could reduce the effectiveness of the largesample approximations. A more productive approach may be based on conditional testing (see e.g. [43,Chapter 10]). First, it is considered the χ 2 test statistic-asymptotically equivalent in distribution to the likelihood-ratio test statistic-which in this case, after some algebra, reduces It should be remarked that the r.v. Y is sufficient for θ under the null hypothesis. Hence, the distribution of the random vector (X 1 , . . ., X L ) given the event {Y = y} does not depend on θ. Moreover, under the null hypothesis, the distribution of the random vector (X 1 , . . ., X L ) given the event {Y = y} is the multivariate Hypergeometric law with parameters y and (n 1 , . . ., n L ), i.e. the corresponding conditioned p.f. is x h 2 f max ð0; n h À n þ yÞ; . . . ; min ðn h ; yÞg; Thus, by assuming the conditional approach, an exact test may be carried out. Indeed, if r represents the observed realization of the test statistic R, the corresponding P-value is PðR � r j fY ¼ ygÞ ¼ Alternatively, under the Bayesian paradigm, the missing probability homogeneity between strata may be specified as the model M 0 which assumes that X l is distributed according to the Binomial law with parameters n l and θ, for l = 1, . . ., L. In contrast, the model M 1 under the general alternative postulates that X l be distributed according to the Binomial law with parameters n l and θ l , for l = 1, . . ., L. By assuming prior distributions in such a way that θ is elicited as the absolutely-continuous r.v. θ defined on [0, 1] with probability density function (p.d.f.) given by f , while (θ 1 , . . ., θ L ) is elicited as the vector ( 1 , . . ., L ) of absolutely-continuous r.v.'s defined on [0, 1] L with joint p.d.f. given by f 1 ;...; L , the Bayes factor is given by l¼1 y x l l ð1 À y l Þ n l À x l f y 1 ;...;y L ðy 1 ; . . . ; y L Þdy 1 . . . dy L R ½0;1� y y ð1 À yÞ nÀ y f y ðyÞdy : If conjugate priors are considered, the r.v. θ is assumed distributed according to the Beta law with parameters a and b, while (θ 1 , . . ., θ L ) is the vector of r.v.'s with independent components, in such a way that each θ l is distributed according to the Beta law with parameters a l and b l . It is worth noting that-in a similar setting-a slightly general hierarchical model is considered by Kass and Raftery [44] (see also [45, p. 190]). Hence, the Bayes factor reduces to B 1;0 ¼ Bða; bÞ Bðy þ a; n À y þ bÞ where-as usual-B(a, b) denotes the Euler's Beta function with parameters a and b. In the case of non-informative Uniform priors, i.e. when a = b = 1 and a l = b l = 1 for l = 1, . . ., L, it is apparent that B 1,0 simplifies to Bðy þ 1; n À y þ 1Þ : The testing procedures developed above is applied to the data of EXP2 by considering the areas as the strata (see Table 2). At first, by assuming the frequentist paradigm, the null hypothesis H 0 of missing proportion homogeneity between strata is considered. The null hypothesis H 0 can be rejected since the P-value corresponding to the test statistic R was less than 10 −16 . Subsequently, by assuming the Bayesian paradigm and non-informative Uniform priors, the Bayes factor is computed. In turn, the missing proportion homogeneity is not likely, since B 1,0 was less than 10 −16 . Thus, the conclusions are as follows. Actually, the adoption of stratified random sampling in EXP2 was a suitable design choice, since the population of articles has a structural partition into areas. However, missing data occurred in the stratified sample, since some reviewers refused to referee the assigned articles. Even if this issue is disturbing, it would be a minor drawback if the items were proportionally missed with respect to the strata. Indeed, in such a case, as showed in Figs 1 and 2, the phenomenon is intrisic in EXP2-owing to the different implementation of EXP2 with respect to EXP1. Generally, if data were missed at random between strata, the effect on the Cohen's kappa estimator could be presumably weak. For a discussion on missing data in the design-based approach, see e.g. the monograph by Little and Rubin [46]. Unfortunately, on the basis of the previous results, we have assessed that the articles are not proportionally missed between the areas, but they are missed according to an unknown random mechanism. As a matter of fact, if the data are missing not at random, corrections are much more difficult and unpredicatable biases could arise [46]. As a consequence, the estimates for EXP2 should be considered very carefully, since in some areas the estimated proportion of missing articles is much more elevate with respect to the other areas: e.g. Area 6 with a missing rate 231/1071 ' 21.6% and Area 9 with a missing rate given by 108/739 ' 14.6%. In addition, these different missing rates occur in the largest strata. Actually, the reasons for which reviewers refused to handle the articles-or to provide the score in the required time-are not known and this issue could introduce a further bias in the results of the assessment.

Discussion and conclusion
The Italian governmental agency for research evaluation ANVUR conducted two experiments for assessing the degree of agreement between bibliometrics and peer review. They were based on stratified random samples of articles, which were classified by bibliometrics and by informed peer review. Subsequently, concordance measures were computed between the ratings resulting from the two evaluation techniques. The aim of the two experiments was "to validate the dual system of evaluation" [4] adopted in the research assessments. Indeed, in a nutshell, ANVUR used preferentially bibliometric indicators for evaluating articles in the research assessment exercises. When bibliometric rating was inconclusive, ANVUR commissioned a pair of reviewers to evaluate an article: indeed for these articles peer-review evaluation substituted bibliometrics. Bibliometric and peer reviewer ratings were then summed up for computing the aggregate score of research fields, departments and institutions. The "dual system of evaluation" might have introduced major biases in the results of the research assessments if bibliometrics and peer review generated systematically different scores. A high level of agreement is a necessary condition for the robustness of research assessment results. The two experiments were designed to test the degree of agreement between bibliometrics and peer review at an individual article level.
This paper reconsiders in full the raw data of the two experiments by adopting the same concordance measure-i.e. the weighted Cohen's kappa coefficient-and also the same systems of weights used in EXP1 and EXP2. In view of analyzing the experiments in the appropriate inferential setting, the design-based estimation of the Cohen's kappa coefficient and the corresponding confidence interval were developed and adopted for computing the agreement between bibliometrics and peer review in EXP1 and EXP2. Three suggestions are proposed for defining in a proper way the population Cohen's kappa coefficients to be estimated. In a case, the suggested definition represents the suitable version of the coefficient estimated by ANVUR. The other two definitions are advisable for taking into account the sizes of discarded articles by ANVUR.
As to the agreement between bibliometrics and peer review in EXP1, the point and interval estimates of the considered versions of the weighted Cohen's kappa indicate a concordance degree that can be considered-at most-weak, for the aggregate population and for each scientific area. In EXP2 the degree of agreement between bibliometrics and peer review is generally even lower than in EXP1.
Results for Area 13, i.e. Economics and Statistics, deserve a separate consideration. In EXP1, Cohen's kappa coefficient was estimated to be 54.17%. According to [6], this anomalous high value was possibly due to the modification of the experiment protocol in this area. Indeed, in EXP2-when an identical protocol was adopted for all the areas-the agreement for Area 13 was only slightly larger, but still comparable with the other areas.
Two further points have to be considered. First, the registered lower agreement in EXP2 was arguably due to the adopted systems of ratings, which are based on four categories in EXP1 and on five categories in EXP2. Second, the systems of weights developed by ANVUR tended to boost the value of the weighted Cohen's kappa coefficients with respect to other, more usual, systems of weights (see the S1 Table providing the computations for linear  weights). Hence, the estimates indicate that the "real" level of concordance between bibliometrics and peer review is likely to be worse than weak in both EXP1 and EXP2.
The two experiments also investigated the agreement between the two reviewers, when they score each article of the stratified random sample. For EXP1, the correct version of the estimates for the article population indicates that the agreement between the two reviewers tend to be lower than 0.30. A slightly lower concordance level is even obtained for EXP2. In sum, the agreement between pairs of reviewers is weak. In turn, Area 13 represented an exception with the highest level of agreement in both experiments. As previously remarked, in contrast with the other areas, Area 13 adopted a ranking of journals for bibliometric evaluation. When peer reviewers were asked to evaluate a paper, they knew the ranking of journals. Thus, it is possible to conjecture that this very simple information boosted the agreement between reviewers, since they tended to adopt the ranking of journals as a criterion for evaluating articles.
In sum, the two Italian experiments gives concordant evidence that bibliometrics and peer review have less than weak level of agreement at an individual article level. This result is actually consistent with the Metric Tide results [11,47]. Furthermore, they also show that the agreement between two peer reviewers is in turn very weak. If the agreement between reviewers is interpreted as an estimate of "peer review uncertainty" [2], this uncertainty is of the same order of magnitude of the uncertainty generated by the use of bibliometrics and peer review.
As to EXP2, a further problem arose for the presence of missing values originated by the refusal of some peer reviewers to referee articles of the sample. For EXP2, the results cannot be easily extended even to the population of journal articles submitted to the research assessment.
From the evidence presented in this paper, it is possible to carry out a couple of research policy considerations. The first deals with the Italian research assessments exercises. Results of the experiments cannot be considered at all as validating the use of the dual method of evaluation adopted by ANVUR. At the current state of knowledge, it cannot be excluded that the use of the dual method introduced uncontrollable major biases in the final results of the assessments. Indeed, bibliometrics and peer review show a weak agreement. In particular, the evidence drawn from data in the official research reports [12,17] shows that peer reviewers' scores were on average lower than bibliometric ones. Unbiased results at an aggregate level would be produced solely if the distribution of articles evaluated by the two methods was homogenous for the various units of assessment (research field, research area, departments and universities). Official reports show that the distribution was not homogenous. The distributions per research areas of the articles with an inconclusive bibliometric score and consequently evaluated by peer review varied from 0.9% to 26.5% in VQR1 (source: [12, Table 3.5]), and from 0.1% to 19.2% in VQR2 (source: [17, Table 3.5]). Therefore, the aggregate results for research fields, departments and universities might be affected by the proportion of research outputs evaluated by the two different techniques: the higher the proportion of research outputs evaluated by peer review, the lower the aggregate score. From publicly available data, it is possible to show that the average score at the research area level hasrather generally-a negative association with the percentage of papers evaluated by peer review. This issue actually holds for VQR1 and VQR2, as shown in the S5-S8 Figs (data available as S2 File). These considerations do not permit to exclude that the results of two Italian research assessments are biased. As a consequence, it is questionable their use for policy purposes and funding distribution.
Generally, the lesson from the two Italian experiments is that the use of a dual method of evaluation in the same research assessment exercise should be at least considered with extreme caution. A low agreement between bibliometrics and peer review at the level of individual article indicates that metrics should not replace peer review at the level of individual article. The use of the dual methods for reducing costs of evaluation, might dramatically worsen the quality of information obtained in a research assessment exercise.