The authors have declared that no competing interests exist.
The author(s) have made the following declarations about their contributions: Conceived and designed the experiments: AEW NS. Performed the experiments: AEW NS. Analyzed the data: AEW NS. Contributed reagents/materials/analysis tools: AEW NS. Wrote the paper: AEW NS.
Because both subjective post-publication review and the number of citations are highly error prone and biased measures of merit of scientific papers, journal-based metrics may be a better surrogate.
The assessment of scientific publications is an integral part of the scientific process. Here we investigate three methods of assessing the merit of a scientific paper: subjective post-publication peer review, the number of citations gained by a paper, and the impact factor of the journal in which the article was published. We investigate these methods using two datasets in which subjective post-publication assessments of scientific publications have been made by experts. We find that there are moderate, but statistically significant, correlations between assessor scores, when two assessors have rated the same paper, and between assessor score and the number of citations a paper accrues. However, we show that assessor score depends strongly on the journal in which the paper is published, and that assessors tend to over-rate papers published in journals with high impact factors. If we control for this bias, we find that the correlation between assessor scores and between assessor score and the number of citations is weak, suggesting that scientists have little ability to judge either the intrinsic merit of a paper or its likely impact. We also show that the number of citations a paper receives is an extremely error-prone measure of scientific merit. Finally, we argue that the impact factor is likely to be a poor measure of merit, since it depends on subjective assessment. We conclude that the three measures of scientific merit considered here are poor; in particular subjective assessments are an error-prone, biased, and expensive method by which to assess merit. We argue that the impact factor may be the most satisfactory of the methods we have considered, since it is a form of pre-publication review. However, we emphasise that it is likely to be a very error-prone measure of merit that is qualitative, not quantitative.
Subjective assessments of the merit and likely impact of scientific publications are routinely made by scientists during their own research, and as part of promotion, appointment, and government committees. Using two large datasets in which scientists have made qualitative assessments of scientific merit, we show that scientists are poor at judging scientific merit and the likely impact of a paper, and that their judgment is strongly influenced by the journal in which the paper is published. We also demonstrate that the number of citations a paper accumulates is a poor measure of merit and we argue that although it is likely to be poor, the impact factor, of the journal in which a paper is published, may be the best measure of scientific merit currently available.
How should we assess the merit of a scientific publication? Is the judgment of a well-informed scientist better than the impact factor (IF) of the journal the paper is published in, or the number of citations that a paper receives? These are important questions that have a bearing upon both individual careers and university departments. They are also critical to governments. Several countries, including the United Kingdom, Canada, and Australia, attempt to assess the merit of the research being produced by scientists and universities and then allocate funds according to performance. In the United Kingdom, this process was known until recently as the Research Assessment Exercise (RAE) (
In a recent attempt to investigate how good scientists are at assessing the merit and impact of a scientific paper, Allen et al.
Subsequently, Wardle
The RAE and similar procedures are time consuming and expensive. The last RAE, conducted in 2008, cost the British government £12 million to perform
Here we investigate three methods of assessing the merit of a scientific publication: subjective post-publication peer review, the number of citations a paper accrues, and the IF. We do not attempt to define merit rigorously; it is simply the qualities in a paper that lead a scientist to rate a paper highly; it is likely that this largely depends upon the perceived importance of the paper. We also largely restrict our analysis to the assessment of merit rather than impact; for example, as we show below, the number of citations, which is a measure of impact, is a very poor measure of the underlying merit of the science, because the accumulation of citations is highly stochastic. We have considered the IF, rather than other measures of journal impact, of which there are many (see
To investigate methods of assessing scientific merit we used two datasets
If scientists are good at assessing the merit of a scientific publication, and they agree on what merit is, then there should be a good level of agreement between assessors. Indeed assessors gave the same score in 47% and 50% of cases in the WT and F1000 datasets, respectively (
Second Assessor | |||||
1 | 2 | 3 | 4 | ||
First assessor | 1 | 60 (42) | 97 | 13 | 0 |
2 | 104 | 229 (222) | 76 | 1 | |
3 | 12 | 59 | 42 (23) | 8 | |
4 | 0 | 3 | 6 | 6 (0.3) |
Table gives the number of papers rated 1 to 4 for the WT data. Figures in parentheses are the numbers expected by chance alone. Note the ordering of assessors is of no consequence in the WT data since the assessments were performed simultaneously and independently.
Second Assessor | ||||
Recommended | Must Read | Exceptional | ||
First assessor | Recommended | 365 (295) | 197 | 39 |
Must Read | 240 | 255 (223) | 76 | |
Exceptional | 46 | 66 | 44 (19) |
Table gives the number of papers rated recommended, must read, or exceptional for F1000 papers when both assessments were made within 12 months. Figures in parentheses are the numbers expected by chance alone. Note the second assessor scored the paper after the first assessor and may have known the score the first assessor gave.
Strikingly, as Allen et al.
We can attempt to quantify the relative influence of IF and merit on assessor score by assuming that the number of citations is a measure of merit and then regressing assessor score against IF and the number of citations simultaneously; in essence this procedure asks how strong the relationship is between assessor score and IF when the number of citations is held constant, and between assessor score and the number of citations when IF is held constant. This analysis shows that assessor score is more strongly dependent upon the IF than the number of citations as judged by standardized regression gradients (WT, IF (bs = 0.39) and citations (bs = 0.16); F1000, IF (bs = 0.30) and citations (bs = 0.12)). The analysis underestimates the effect of the IF because the number of citations is affected by the IF of the journal in which the paper was published
The strength of the relationship between assessor score and the IF can be further illustrated by considering papers, in the largest of our datasets, that have similar numbers of citations to each other—those distributed around the mean in the F1000 dataset with between 90 and 110 citations (
The numbers of papers in each category are 131, 194, and 128 for IF<10, 10<IF<20, and IF>20, respectively.
If we remove the influence of IF upon assessor score, the correlations between assessor scores drop below 0.2 (partial correlations between assessor scores controlling for IF: WT, r = 0.15,
Journal | Correlation between Assessor Scores | Correlation between Assessor Score and the Number of Citations | ||
Correlation | Correlation | |||
Cell | 114 | 0.23 |
203 | 0.11 |
Current Biology | 28 | −0.16 | 103 | 0.23 |
Development | 22 | −0.18 | 100 | −0.089 |
Journal of Biological Chemistry | 14 | 0.44 | 219 | 0.15 |
Journal of Cell Biology | 29 | −0.022 | 103 | 0.22 |
Journal of Neuroscience | 12 | −0.063 | 133 | −0.057 |
Journal of the American Chemical Society | 22 | 0.42 | 126 | 0.043 |
Molecular Cell | 32 | −0.049 | 121 | 0.15 |
Nature | 217 | 0.15 |
375 | 0.20 |
Neuron | 34 | 0.24 | 116 | 0.13 |
PNAS | 115 | 0.32 |
531 | 0.093 |
Science | 199 | 0.019 | 355 | 0.15 |
Average | 0.11 | 0.11 |
p<0.001.
We can quantify the performance of assessors as follows. Let us consider an additive model in which the score given by an assessor depends upon the merit of the paper plus some error. Under this model the correlation between assessor scores is expected to be
Overall it seems that subjective assessments of science are poor; they do not correlate strongly to each other and they appear to be strongly influenced by the journal in which the paper was published, with papers in high-ranking journals being afforded a higher score than their intrinsic merit warrants.
Scientists appear to be poor at assessing the intrinsic merit of a publication, but are they better at predicting the future impact of a scientific paper? There are many means by which impact might be assessed; here we consider the simplest of these, the number of citations a paper has received. As with the correlation between assessor scores, the correlation between the assessor score and the number of citations a paper has accumulated are modest (WT r = 0.38,
Part of the correlation between assessor scores and the number of citations may be due to the fact that assessors rank papers in high IF journals more highly (
An alternative to the subjective assessment of scientific merit is the use of bibliometric measures such as the IF of the journal in which the paper is published or the number of citations the paper receives. The number of citations a paper accumulates is likely to be subject to random fluctuation—two papers of similar merit will not accrue the same number of citations even if they are published in similar journals. We can infer the relative error variance associated with this process as follows. Let us assume that the number of citations within a journal is due to the intrinsic merit of the paper plus some error. The correlation between assessor score and the number of citations is therefore expected to be
If we assume that assessors and the number of citations are unaffected by the IF of the journal, then we estimate the ratio of the error variance associated with citations to be approximately 1.5 times the variance in merit (WT
The IF might potentially be a better measure of merit than either a post-publication assessment or the number of citations, since several individuals are typically involved in a decision to publish, so the error variance associated with their combined assessment should be lower than that associated with the number of citations; although such benefits can be partially undermined by having a single individual determine whether a manuscript should be reviewed or by rejecting manuscripts if one review is unsupportive. Unfortunately, it seems likely that the IF will also be subject to considerable error. If we combine
Our results have some important implications for the assessment of science. We have shown that scientists are poor at estimating the merit of a scientific publication; their assessments are error prone and biased by the journal in which the paper is published. In addition, subjective assessments are expensive and time-consuming. Scientists are also poor at predicting the future impact of a paper, as measured by the number of citations a paper accumulates. This appears to be due to two factors; scientists are not good at assessing merit and the accumulation of citations is a highly stochastic process, such that two papers of similar merit can accumulate very different numbers of citations just by chance.
The IF and the number of citations are also likely to be poor measures of merit, though they may be better measures of impact. The number of citations is a poor measure of merit for two reasons. First, the accumulation of citations is a highly stochastic process, so the number of citations is only poorly correlated to merit. It has previously been suggested that the error variance associated with the accumulation of citations is small based on the strong correlation between the number of citations in successive years
The IF is likely to be poor because it is based on subjective assessment, although it does have the benefit of being a pre-publication assessment, and hence not influenced by the journal in which the paper has been published. In fact, given that the scientific community has already made an assessment of a paper's merit in deciding where it should be published, it seems odd to suggest that we could do better with post-publication assessment. Post-publication assessment cannot hope to be better than pre-publication assessment unless more individuals are involved in making the assessment, and even then it seems difficult to avoid the bias in favour of papers published in high-ranking journals that seems to pervade our assessments. However, the correlation between merit and IF is likely to be far from perfect. In fact the available evidence suggests there is little correlation between merit and IF, at least amongst low IF journals. The IF depends upon two factors, the merit of the papers being published by the journal and the effect that the journal has on the number of citations for a given level of merit. In the most extensive analysis of its kind, Lariviere and Gingras
The IF has a number of additional benefits over subjective post-publication review and the number of citations as measures of merit. First, it is transparent. Second, it removes the difficult task of determining which papers should be selected for submission to an assessment exercise such as the RAE or REF; is it better to submit a paper in a high IF journal, a paper that has been highly cited, even if it appears in a low IF journal, or a paper that the submitter believes is their best work? Third, it is relatively cheap to implement. And fourth it is an instantaneous measure of merit.
The use of IF as a measure merit is unpopular with many scientists, a dissatisfaction that has recently found its voice in the San Francisco Declaration of Research Assessment (DORA) (
It has been argued that the IF is a poor measure of merit because the variation in the number of citations, accumulated by papers published in the same journal, is large
The REF will be performed in the United Kingdom next year in 2014. The assessment of publications forms the largest component of this exercise. This will be done by subjective post-publication review, with citation information being provided to some panels. However, as we have shown, both subjective review and the number of citations are very error prone measures of merit, so it seems likely that these assessments will also be extremely error prone, particularly given the volume of assessments that need to be made. For example, sub-panel 14 in the 2008 version of the RAE assessed ∼9,000 research outputs, each of which was assessed by two members of a 19 person panel; therefore each panel member assessed an average of just under 1,000 papers within a few months. We have also shown that assessors tend to overrate science in high IF journals, and although the REF
In our research we have not been able to address another potential problem for a process such as the REF. It seems very likely that assessors will differ in their mean score—some assessors will tend to give higher scores than other assessors. This could potentially affect the overall score for a department, particularly if the department is small and its outputs scored by relatively few assessors.
The REF actually represents an unrivalled opportunity to investigate the assessment of scientific research and to assess the quality of the data produced by such an exercise. We would therefore encourage the REF to have all components of every submission assessed by two independent assessors and then investigate how strongly these are correlated and whether some assessors score more generously than others. Only then can we determine how reliable the data are.
In summary, we have shown that none of the measures of scientific merit that we have investigated are reliable. In particular subjective peer review is error prone, biased, and expensive; we must therefore question whether using peer review in exercises such as the RAE and the REF is worth the huge amount of resources spent on them. Ultimately the only way to obtain (a largely) unbiased estimate of merit is to have pre-publication assessment, by several independent assessors, of manuscripts devoid of author's names and addresses. Nevertheless this will be a noisy estimate of merit unless we are prepared to engage many reviewers for each paper.
We compiled subjective assessments from two sources. The largest of these datasets was from the F1000 database (
Because most journals are poorly represented in each dataset we estimated the within and between journal variance in the number of citations as follows. We rounded the IF to the nearest integer then grouped journals according to the integer value. We then performed ANOVA on those groups for which we had ten or more publications.
Estimates of the error variance in assessment relative to variance in merit can be estimated as follows. Let us assume that the score (
If we similarly assume that the number of citations a paper accumulates depends linearly on the merit and some error (with variance
(DOCX)
(DOCX)
(DOCX)
We are very grateful to the Faculty of 1000 and Wellcome Trust for giving us permission to use their data. We are also grateful to Liz Allen, John Brookfield, Juan Pablo Couso, Stephen Curry, and Kevin Dolby for helpful discussion.
Faculty of 1000
impact factor
Research Assessment Exercise
Research Excellence Framework
Wellcome Trust