Peer Review Evaluation Process of Marie Curie Actions under EU’s Seventh Framework Programme for Research

We analysed the peer review of grant proposals under Marie Curie Actions, a major EU research funding instrument, which involves two steps: an independent assessment (Individual Evaluation Report, IER) performed remotely by 3 raters, and a consensus opinion reached during a meeting by the same raters (Consensus Report, CR). For 24,897 proposals evaluated from 2007 to 2013, the association between average IER and CR scores was very high across different panels, grant calls and years. Median average deviation (AD) index, used as a measure of inter-rater agreement, was 5.4 points on a 0-100 scale (interquartile range 3.4-8.3), overall, demonstrating a good general agreement among raters. For proposals where one rater disagreed with the other two raters (n=1424; 5.7%), or where all 3 raters disagreed (n=2075; 8.3%), the average IER and CR scores were still highly associated. Disagreement was more frequent for proposals from Economics/Social Sciences and Humanities panels. Greater disagreement was observed for proposals with lower average IER scores. CR scores for proposals with initial disagreement were also significantly lower. Proposals with a large absolute difference between the average IER and CR scores (≥10 points; n=368, 1.5%) generally had lower CR scores. An inter-correlation matrix of individual raters' scores of evaluation criteria of proposals indicated that these scores were, in general, a reflection of raters’ overall scores. Our analysis demonstrated a good internal consistency and general high agreement among raters. Consensus meetings appear to be relevant for particular panels and subsets of proposals with large differences among raters’ scores.


Introduction
Peer review plays a central role in evaluating scientific research, either when it is communicated in scientific journals [1,2] or when it is submitted to granting bodies [3]. While peer review in journals has been widely studied and has emerged as a separate research discipline [4], the knowledge about peer review of grant applications is relatively scarce [5,6]. A 2007 Cochrane systematic review which addressed the effectiveness of the grant peer review process demonstrated little empirical evidence for the effects of grant peer review [3]. Recent qualitative and quantitative research of grant peer review at institutional, funding agency or national levels identified the variability of the expert reviewers (raters) assessment criteria as an important aspect of the grant review process either within the same [7][8][9][10][11][12][13][14][15][16][17] or between different [18] review systems. Some studies have reported differences between grant reviews in different research disciplines, with greater agreement between raters in humanities compared to other fields [13]. Whereas some studies did not find advantages of panel discussions to individual raters in improving the reliability of grant peer review [8,14], others demonstrated practical value of panel meetings for a subset of grant applications [19]. Except the study of Mutz et al [13], which used more than 8 thousand proposals from the Austrian Science Fund to model inter-rater reliabilities, other observational studies into the reliability of grant peer review used much smaller numbers, ranging from about 30 [12] to more than 800 proposals [14,20].
We investigated applications to one of the largest research granting schemes of the European Union-the "Marie Curie Actions" (MCA-the acronym is used here to refer to all different Marie Curie-related actions, programs and fellowships) within the EU's Seventh Framework Programme for Research and Technological Development (FP7). Since 1996, "Marie Curie" has become synonymous with EU funding dedicated to the mobility, capital enhancement and potential improvement of research human resources. Over time, the MCA gained reputation among the research community, and is today one of the most popular and well-respected EU research funding programmes. Conceived to support training, international mobility and career development of researchers within and beyond Europe, the MCA budget has grown with successive research Framework Programs and has typically represented 8-10% of its total funding.
In the FP7 context, the MCA budget amounted to 4750 million Euro (for the 2007-2013 period) [21], and comprised a set of different schemes, each targeting a specific audience and addressing a particular aspect of the policy objectives detailed in the 'People' Specific Programme [22]. Among those schemes, the 'Initial Training Networks' (ITN) was the main instrument to fund training of early-stage researchers via research training programmes organised by consortia of research-performing organisations. This scheme represented about 40% of the total MCA budget, and has been the most competitive in terms of success rate (the ratio of funded projects vs. evaluated proposals). The 'Industry-Academia Partnerships and Pathways' (IAPP) was dedicated to the transfer of knowledge between the public and commercial/ private research sectors. The 'Individual Fellowships' offered grants to support internationally mobile individual experienced researchers (typically post-doctoral researchers), and comprised 3 different schemes, of which the Intra-European Fellowships (IEF) addressed mobility within Europe. More detailed information on the scope and objectives of each of these schemes is available in the legal documents setting up FP7 and the 'People' Specific Programme [22]. The MCA are particularly appreciated by the research community for the excellent research training they offer, as well as attractive working conditions, career development and perspectives, and knowledge-exchange opportunities, at all stages of research careers, via cross-border and cross-sector mobility. Despite the diversity of schemes and target audiences, the MCA share some common features across the programme. One of them, of fundamental importance, is the centralized process of submission and evaluation of proposals against a pre-defined set of criteria, managed since 2009 by the Research Executive Agency (REA), and leading every year to the award of hundreds of grants and fellowships selected for funding. The number of MCA applications has increased dramatically in recent years, especially for the 'Individual Fellowships'-by far the most popular actions in terms of number of submitted proposals. From 2007 to 2013, the number of 'Individual Fellowships' proposals submitted annually increased threefold, from about 2600 to more than 8100 proposals. During this 7 year period, about 51,000 MCA proposals were submitted in response to more than 60 calls for proposals, accounting for almost a third of all FP7 submitted proposals (about 160,000 proposals have been submitted in the context of nearly 500 calls for proposals).
The MCA submission and evaluation process consists of the following steps: i) preparation, drafting and submission of a complete proposal by the applicant(s) before the deadline stipulated in the official documents of the calls for proposals; ii) Eligibility checks performed by the REA services, in order to ensure that the proposals submitted comply with the requirements specified in the Work Programme for the call in question; iii) Allocation of each proposal to a set of (at least) 3 external experts reviewers-raters-based on the best possible match between the expertise of available raters and the scientific field of the proposal (and also ensuring a fair representation of raters' nationalities and gender balance), and checking conflicts of interest; iv) A remote evaluation phase, where raters assess the proposals allocated to them, against 4 (for ITN and IAPP) or 5 (for IEF) evaluation criteria, and draft an Individual Evaluation Report (IER) for each proposal, so that each proposal has (a minimum of) 3 IER; v) Consensus meetings organised in Brussels, in the presence of all raters. Each proposal is discussed by the 3 raters who had evaluated it remotely. One of the 3 raters, acting as a rapporteur, prepares a Consensus Report (CR), with the comments and scores commonly agreed by all 3 raters; vi) The creation of the Evaluation Summary Report (ESR), which is the final version of the CR sent to applicants. The ESR score always corresponds to the CR score.
The aim of the present study was to use the large set of data available from the MCA grants applications to examine its peer-review evaluation process, in particular the agreement among raters in the different phases of the evaluation workflow.

Data sources
The data for this study consisted of n = 24,897 proposals (n = 74,691 individual evaluation reports-reviews), representing nearly half of all applications submitted to MCA calls for proposals in FP7. The actions considered for this study were: IEF, as a representative of the 'individual fellowships'; ITN, as a representative of large multi-beneficiary grants, and; IAPP, as a representative of schemes with a smaller number of applications. The calls selected were the following: IAPP, from 2007 to 2009 and for 2011 (4 calls); ITN, 2008 and from 2010 to 2012 (4 calls); IEF, from 2007 to 2013 (7 calls). No calls for proposals were organized for IAPP in 2010, and for ITN in 2009. Also, the ITN call in 2007 was organized as a 2-stage submission process and was thus not considered for this study. The calls for proposals for IAPP in 2012 and 2013 and for ITN in 2013 were organized with the support of a different information technology tool, which did not allow data extraction for the current study.
The data analysed in this study corresponds to steps iv) and v) of the MCA submission and evaluation process described in the previous section. All proposals for which IER and CR were recorded, i.e. which have been evaluated by raters, were included in this study, including those where the proposal was evaluated but withdrawn by the applicant or declared ineligible at any later stage (those cases represent less than 0.15% of the total number). Also, for those cases where a fourth rater was assigned to the proposal (less than 0.2% of all proposals), the fourth IER was disregarded.
Proposals were evaluated in different scientific panels, each with a separate pool of raters: Chemistry, Economic and Social Sciences/Humanities, Information Science/Engineering, Environment/Geosciences, Life Sciences, Mathematics, and Physics. In order to process the data, IER and CR scores given for each evaluation criterion, by each of the 3 raters, were extracted and sorted, together with the final score (given on a 0-100 point scale). Each criterion is scored using a scale from 0 (fail) to 5 (excellent). The final score is then calculated using weighting factors for each individual evaluation criteria and then multiplied by 20 to get the 0-100 score. IEF proposals were evaluated against 5 criteria: Science and Technology (S&T) Quality, Training, Researcher, Implementation, and Impact; whereas ITN and IAPP were evaluated against 4 criteria: S&T Quality, Training (ITN)/Transfer of Knowledge (IAPP), Implementation, and Impact. This means that, in an example of an ITN call, if a proposal gets the score of 4.2 (out of max. 5) for criterion 1, which weighs 30%, 4.7 for criterion 2 (20% weight), 3.8 for criterion 3 (30% weight) and 4.4 for criterion 4 (20% weight), then the composite score is calculated as 4.2×0.3+4.7×0.2+3.8×0.3+4.4×0.2 = 4.22; the final score is then 4.22×20 = 84.40 (out of max. 100).

Data analysis
The final dataset was complete and did not have missing values in for individual evaluation scores. To evaluate the inter-rater agreement of the three raters who evaluated an individual proposal, we calculated the Average Deviation (AD) index [23] for each proposal. Although interpreted as a measure of inter-rater agreement [24], the AD index is actually a measure of disagreement that involves determining the average difference between scores of individual raters and the average scores of all raters. If an AD index is equal to zero, there is a perfect agreement between the raters. We decided to use this as a main measure of inter-rater agreement rather than other measures, such as intraclass correlation (ICC) or within group agreement index (r wg ) which are sometimes used for similar situations [13,25], because AD index does not require the specification of null distribution and estimates inter-rater disagreement in the units of the original scale, making it easier to understand and interpret, and is therefore considered a more pragmatic measure [23,24]. Moreover, simulation research has singled out AD index as performing well relative to other inter-rater agreement indices [24][25][26]. We also calculated the intraclass correlation coefficients (ICC) which are often used in assessing raters' agreement [13]; we used one-way random ICC because different raters rated different proposals.
Categorical data were presented as absolute and relative frequencies, and continuous data as means and standard deviations (for normally distributed data) or medians and interquartile ranges (for non-normally distributed data). Although we analysed the data for the whole population of proposals and did not perform sampling, where statistical analysis may be redundant [27], we compared the datasets from different evaluation panels and calls. One-way ANOVA was used to test differences in CR scores between panels, paired samples t-test was used to test differences between the CR and the average of the 3 Individual Evaluation Reports (AVIER) scores. Pearson's correlation coefficient was used to test bivariate associations (between CR and AVIER, inter-correlations of IER criteria of different raters and between IER and CR scores for separate criteria) and principal component analysis was used to investigate the latent structure of IER criteria ratings. Simple linear regression analyses were used to assess the relationship of CR or AVIER as a dependent and AD index as a predictor variable. Because of the size of the dataset, the level of significance was set at p<0.01 for all statistical tests. All statistical analyses were performed using SPSS 17 for Windows (SPSS Inc., Chicago, IL, USA).

Results
Overall, the mean (± standard deviation, SD) final CR scores for all proposals was 79.8 ±11.0 (0-100 scale), showing a bell shaped distribution skewed to the left, i.e. towards higher scores (Fig A in S1 File), with consistently lower scores in Economic and Social Sciences/Humanities, Information Science/Engineering and Mathematics panels (Table 1). There were no significant differences in CR or AVIER scores over different years (Fig B in S1 File.).
We first tested the association between the CR and AVIER scores. Pearson's correlation coefficients for all comparisons were around 0.95 and above, indicating a nearly perfect correlation between the CR and AVIER scores. (Table 2). We next examined whether there were any systematic differences between the CR and AVIER scores.
Differences were negligible (about 1 point on 0-100 scale), varied non-systematically in both directions (Table 2), and showed a normal distribution (Fig C in S1 File). Furthermore, 61.4% of all proposals had less than 2 points difference between AVIER and CR scores. The fact that they reached statistical significance was a result of the very high number of proposals and did not reveal any practically meaningful differences between the CR and AVIER scores. The AD index for individual evaluation criteria was also very low, indicating that raters agreed very well in their independent evaluations of the proposals (Table A in S1 File).
We also calculated the AD index for the IER scores. Distribution of AD indices was positively skewed for all proposals, as well as for all subgroups of proposals, showing very good inter-rater agreement for the majority of proposals (Fig 1). Medians and interquartile ranges were therefore used as descriptives (Table 2). Median AD indices showed good general agreement among raters. The overall median AD index was 5.4 points (on a scale 0-100), and for three quarters of all proposals it was equal or below 8.3 points. We also calculated one-way random intraclass correlation coefficients for individual evaluators' reports (IER) for all proposals (ICC = 0.67, 95%CI = 0.66-0.68) and for separate evaluation panels. Consistently with AD indices, ICCs indicated good inter-rater agreement (Table A in S1 File). Table 1. Mean consensus report (CR) scores (±standard deviation, SD) across evaluation panels for all proposals and for proposals with disagreements among raters in their Individual Evaluation Report (IER) score and between Consensus Report (CR) and average Individual Evaluation Report (AVIER) scores.

Panel
Mean score (±SD) in proposals where: There were, however, cases where the agreement was lower. Possible disagreements could be due to two typical situations-either one rater scores a proposal in a completely different way than the other two raters, or all three raters differ in their scores. To test these possibilities, we first isolated the proposals where there was high agreement between two out of three raters. We defined the cut-off for this difference in the following way: 1) two raters agreed if the difference between their scores was less than or equal to 5 -because 5.4 was the median AD for all proposals; and 2) the third rater disagreed with other two if the AD index for the proposal was equal to or greater than 10 -because this would put it above the third quartile of all the AD indices for IER scores. There were 1424 (5.7%) cases (Table 3). Pearson's correlation between CR and AVIER scores, although somewhat lower than for the total population of proposals, was still very high (r CR/AVIER = 0.913, p<0.001). The difference between CR and AVIER scores, although statistically significant due to large sample size, was again extremely small (CR- Table 2. Associations and differences between consensus reports (CR) and average individual evaluation report (AVIER) scores and inter-rater agreement (average deviation index, AD index) for all proposals, and for different actions and panels*.  (Table B in S1 File). Next we isolated the proposals where there was a disagreement between all three raters, defined as the difference between each pair of IER scores equal to or greater than 10 points,  Table 3. Distribution of proposals with disagreements among raters in their Individual Evaluation Report (IER) score and between Consensus Report (CR) and average Individual Evaluation Report (AVIER) scores across evaluation panels* putting them above the third quartile of all AD indices. There were 2075 (8.3%) cases (Table 3). Pearson's correlation between CR and AVIER scores was again very high (r CR/AVIER = 0.917, p<0.001). The difference between CR and AVIER scores was not statistically significant (CR-AVIER = 0.2; p = 0.093, paired samples t-test). The findings were similar across evaluation panels, but IAPP and ITN had a consistently a greater percentage of proposals with differences among raters when compared to IEF (Table B in S1 File).
To assess the relationship between AD indices as a measure of inter-rater agreement and the scores of the proposals, we first performed a simple linear regression analysis with CR or AVIER scores as a dependent and AD index as a predictor variable. Both measures were significantly and negatively associated with AD index (AVIER = 86.4-1.1×AD, R 2 = 0.19, p<0.001 and CR = 86.3-1.0×AD, R 2 = 0.15, p<0.001 ; Fig 2), indicating more disagreement for proposals with lower scores. We found statistically significant differences in CR scores depending on the initial agreement (Table 1) and grouped together the proposals where either one rater differed from the other two who agreed or where all of them differed among themselves and compared this subgroup to all other proposals. CR scores were on average 9 points lower in cases where raters initially disagreed (no initial disagreement: M±SD = 81.0±10.1 vs. initial disagreement: M±SD = 72.3±13.0; p<0.001, independent samples t-test).
We also isolated a group of proposals where the absolute difference between the CR and the AVIER scores was equal to or greater than 10 points (Table 3). There were 368 (1.5%) cases. Positive and negative differences were equally distributed (180 or 48.9% positive and 188 or 51.1% negative differences), indicating that, when the CR corrected the initial AVIER score, it was as likely to increase it as to decrease it. However, these proposals had significantly lower CR scores than other proposals (69.3±19.8 vs 79.8±11.0, respectively; p<0.001, independent samples t-test). The findings were similar across scientific panels but again IAPP and ITN calls had more proposals with such differences when compared to IEF (Table B in S1 File). For all subgroups' comparisons ( Table 2, Table B in S1 File), the Economic and Social Sciences/ Humanities panel had a higher proportion of proposals with disagreements among raters and differences between the CR and AVIER scores.
To investigate whether there was a possible pattern in scoring individual IER criteria (5 for IEF and 4 for ITN or IAPP), we investigated the inter-correlations of these criteria (Table 4 and  Table C in S1 File). Our interest was in comparing the correlation of the scores of different raters for the same criterion and correlations of the same rater's scores of different criteria across proposals. If correlations of different raters' scores for the same criterion were high it would mean that, by those criteria, they assess specific characteristics of the proposal with a similar level of objectivity (i.e. different raters score the same item similarly). If, on the other hand, correlations of the same rater's scores of different criteria were high, it would imply that a rater decides upon a general score for the proposal and uses that score as a reference point to score each individual criterion. We found low correlations between different rater's scores for the same criterion and the same proposal and high correlations of the same rater's scores of different criteria for the same proposal (Table 4), which indeed suggests that raters scored proposals in a more holistic way and, generally, assessed each criterion in relation to the other criteria for the same proposal (Table C in S1 File).
Finally, to take this analysis a step further, we performed a principal components analysis with the evaluation criteria. This procedure is usually done when investigating a latent structure that underlies a set of items (in our case, criteria scored by three raters). The analysis is performed on a set of scores of different items and the resulting solution can reveal a smaller number of components, or factors, that underlie individual scores. In other words, resulting components may be understood as an empirical suggestion as to how the items can be grouped together based on what they actually measure. In our case, if criteria represented measures of specific, i.e. different, aspects of proposals that were objectively measured, we would expect four (ITN and IAPP calls)/five (IEF call) components corresponding to the four/five criteria, with each component including the three raters' scores of the same characteristic or aspect (i.e. criterion) of proposals.
The results from the analysis of the data pointed in another direction, already visible in the inter-correlation matrices ( Table 4, S3 Table). We extracted three components, each representing a single rater, which confirmed our previous conclusion that criteria scores reflected   Peer Review of EU Research Grants the rater's global score rather than specific aspects of the proposal. The three-component solution explained large portion of variance (73%) and component loadings were very high (all above 0.7).

Discussion
Our analysis of a large dataset of grant proposals submitted to the Marie Curie Actions over the FP7 lifetime demonstrates a good internal consistency and overall high agreement among expert reviewers-raters. It is difficult to compare our results directly to other studies that analysed reliability of individual vs panel grant reviews, mainly because of the differences in the review processes, the number of raters reviewing a single proposal and statistical approaches [8,14]. Furthermore, the MCA evaluation procedure does not include a common panel assessing more than one proposal, but rather operates with groups of raters (almost always composed of 3 experts) discussing their evaluation-consensus meetings-for each individual proposal.
Studies that looked at the value of a larger panel [8,14] showed that panel discussions of proposals contribute to the reduction of disagreement among individual raters but do not contribute to the overall improvement of reliability of the final score. In our study, the agreement between raters was in general very high, both among their independent assessments and when these were compared to the score reached after the consensus meetings' discussion during the central meeting. This finding was consistent over different types of grant schemes and over the years of the MCA. It may at least in part be related to the high quality of the submitted proposal and high competitiveness of MCA calls, as indicated by generally high scores for the whole study population (distribution of scores shifted to high scores, S1 Fig). Overall, disagreement was greater for proposals with lower scores. Also, when raters initially disagreed in their independent evaluation, this disagreement influenced the final scores, which were generally lower. Our finding on greater disagreement of raters for proposals with lower scores contrasts the study of Cicchetti [17], who reported greater agreement of reviewers for grants with lower ratings, as well as for rejection recommendation in scientific journals. The differences may be related to distinct peer review processes in grant programmes and in journals. For example, greater reviewer's agreement on manuscript rejection in Ciccheti study concerned predominantly psychology journals [17], whereas medical journals also report either higher agreement for acceptance than rejection recommendation (84% vs. 31% agreement rate in a general medical journal [28]) or poor agreement on any recommendation [29]. Intraclass correlation coefficient for IERs and individual evaluation criteria were also high, confirming high agreement among rates and complementing the data from AD index analysis. More recently, comparison of average deviation measures against theoretically expected distributions were suggested [24]. We could not directly compare AD indices for IER from our study to published significance criteria for different null distributions because IER scores are on a 0-100 scale and the published theoretical criteria provide cut-off values for scales with 4, 5 or 7 categories. When we transformed IER scores to a 5 point scale and calculated AD indices, median AD index was 0.27 (Q1-Q3 = 0.16-0.41, which was lower than cut-off values for all published null-distributions [24], indicating statistically significant agreement at p<0.05 level. The consistency and high agreement of raters' scores, as well as the consistency of an individual rater's scores across individual evaluation criteria, indicate that, at least for some of the proposals, the remote assessments and its average score (AVIER) can provide reliable final judgment of the proposal. This is an important finding in view of the high competition for MCA funding, particularly for the individual fellowships scheme. In times of financial crisis, national governments rarely increase their budget for research and innovation. Therefore, Horizon 2020, the new EU Framework Programme for Research and Innovation for the period 2014-2020, is perceived by many as the main instrument to fund competitive research at the European level. As a consequence of that, Horizon 2020 funding schemes may face oversubscription, with a possible increase in the number of applications for grants over the years. This means that the current evaluation process will struggle to be both financially and logistically sustainable, as more proposals submitted imply more experts, more time-consuming review processes, and more constraints from the logistical point of view (e.g. space and facilities to organize the consensus meetings). Previous studies support the rationale of using more remote, distance-based, approaches as a good alternative for peer review processes, given the environmental impact and costs savings involved when compared to on-site evaluations [30]. The results of our study demonstrate that it may be safe and appropriate, across the different types of schemes and evaluation panels, to predominantly use a remote-based evaluation process and the average score of individual assessments as the final proposal's score. This is especially true for the individual-driven grant applications. Our study also identifies sets of proposals, which constitute about 15% of the proposals' population that may need more discussion in order to reach consensus on the final score. These proposals are identified as those which have a large difference between one vs. the other two raters, large differences among all three raters, or those with a large difference between the average of the initial (remote) scores (AVIER) and the final consensus score. Furthermore, IAPP and ITN calls had a greater number of proposals with disagreements, demonstrating that the evaluation of complex proposals, involving partnerships of several research groups with multidisciplinary and inter-sectorial features, require a more elaborate review procedure. As the number of proposals submitted to ITN and IAPP is much smaller than to the individual fellowships, a two-step review procedure, as presently in place, with both remote and consensus phases, appears to be justified for these schemes. Consensus meetings and discussion among raters may be especially relevant for the evaluation of panels in the field of Economic Sciences, Social Sciences and Humanities due to the significantly higher proportion of proposals with disagreement. Future research should be directed to more qualitative approaches to understanding reviewers' decision-making processes and their reasoning during consensus meetings, as well as differences among different research fields. Rare qualitative studies into grant application peer review process have uncovered important aspects of peer review as a social phenomenon involving collaborative decision-making processes, social expectations and professional interests and interpersonal issues [7,11].
Our study was retrospective and we cannot make claims on what influences scoring or agreement among reviewers. Also, AD index has a limitation as a measure of inter-rater reliability because it represents disagreement in units of a rating scale. Therefore the AD indices obtained in our study may not be comparable to other studies where raters used different scales. Although there are suggestions for more refined approaches to estimating inter-rater reliability in cases like our study [31,32], we decided to use the AD index because of its simplicity and straightforwardness in interpretation [23][24][25][26]. Despite these limitations, we can be confident in our observations and on the reliability of our results because of the very large sample of data accumulated over 7 years.
In conclusion, this study provides the first comprehensive analysis of the grant review process of one of the major EU research granting programmes. It demonstrates the consistency and reliability of the MCA review process and identifies critical parts that can be simplified and improved without impacting on the high quality and reliability of its outcome. The question remains whether the analysis of the peer review procedure can be used to predict outcome measure of research success, such as publications, citations and/or patents [33]. We recommend continuous quality assurance analyses, such as the one performed in this study, as useful tools to follow and analyse trends of peer review processes and their relation to research outcomes in order to allow adequate and more efficient decisions on evaluation procedures to ensure excellence in research.
Supporting Information S1 File. Description of proposals and inter-reater agreements. Fig A: Distribution of proposals by their Consensus Report (CR) scores. Fig B: Mean Consensus Reports (CR) scores for different years. Fig C: Distribution of proposals by their differences between Consensus Reports (CR) and average Individual Evaluation Reports (AVIER) scores. Table A. Inter-rater agreement (average deviation index, AD index) for individual evaluation criteria across all evaluation panels. Table B. Distribution of proposals, across panels and type of action, where a) one rater disagrees with other two raters; b) all raters disagree with each other; c) difference between the Consensus Report (CR) and average Individual Evaluation Report (AVIER) score is large.