Conceived and designed the experiments: DK NL CK. Performed the experiments: DK NL CK. Analyzed the data: DK NL CK. Wrote the paper: DK NL CK.
The authors have declared that no competing interests exist.
The Working Group on Peer Review of the Advisory Committee to the Director of NIH has recommended that at least 4 reviewers should be used to assess each grant application. A sample size analysis of the number of reviewers needed to evaluate grant applications reveals that a substantially larger number of evaluators are required to provide the level of precision that is currently mandated. NIH should adjust their peer review system to account for the number of reviewers needed to provide adequate precision in their evaluations.
On February 21, 2008 the recommendations of the Working Group on Peer Review of the Advisory Committee to the Director of the National Institutes of Health (NIH) were posted on the internet
Thus, the Advisory Committee has left the actual number of reviewers to evaluate each grant application ambiguous. No guidelines were provided to determine the number of reviewers that would be needed. Consequently, we have conducted a statistical analysis to provide guidance in arriving at appropriate numbers. Our analysis shows an inherent statistical inconsistency in the NIH peer review recommendations concerning the number of reviewers. We also demonstrate how crucial this number is and how it influences the precision of the eventual score.
For each grant proposal reviewers from the relevant scientific community are asked to report their evaluations within a pre-defined scale. The average grade obtained through this process is considered a valid estimate of the “true” value of the proposal.
The survey sample size is a crucial parameter in determining whether we can rely on these mean estimates. Elementary sampling techniques give us the minimum number of respondents that are needed for the evaluation procedure to deliver reliable estimates:
There are two important implications of this equation. First, the inverse correlation between n and L indicates that more reviewers are needed to obtain a more fine-grained or precise evaluation. Moreover, this relation is exponential so that greater precision comes with an increasingly greater number of reviewers.
Second, typically the standard deviation σ of a population is not observed and needs to be estimated. Since the data necessary to estimate σ for the review of biomedical research proposals have not been collected in a statistically robust sampling system, we have relied on a model system of peer review with short movie proposals reviewed on a scale from 1 to 5 by undergraduate students [Lacetera, Kaplan, Kaplan, submitted]. We used short movie proposals in order to increase the potential sample size since all undergraduate students could be considered expert enough to grade the proposals. In this study 10 proposals were scored by an average of 48 reviewers. The average standard deviation was approximately 1.0 with a standard deviation considerably less than 0.1. Therefore, we estimate σ to be equal to 1. Obviously, a more accurate estimate of the standard deviation can eventually be obtained for each form of application requested by NIH, although it should be clear that a large number of independent evaluators is required to make any estimate of σ reliable.
Using equation (1), we can assess the effect of having 4 reviewers for each proposal. With four reviewers and a standard deviation of 1, the review would be expected to distinguish applications at the level of the unit interval:
Yet, in the evaluation of grant proposals NIH currently uses a 41-grade scale with a range of scores from 1.0 to 5.0
In
The disconnect between the needed precision in order to allocate funds in a fair way and the number of reviewers required for this level of precision demonstrates a major inconsistency that underlies NIH peer review. With only four reviewers used for the evaluation of applications, an allocation system that requires a precision level in the range of 0.01 to 0.1 is not statistically meaningful and consequently not reliable. Moreover, the 4 reviewers NIH proposes are not independent which degrades the precision that could be obtained otherwise.
Consequently, NIH faces a major challenge. On the one hand, a fine-grained evaluation is mandated by their review process. On the other hand, for such criterion to be consistent and meaningful, an unrealistically high number of evaluators, independent of each other, need to be involved for each and every proposal.
Further insights can be derived from the analysis of expression (1). The value of σ is a measure of the underlying variability in the ratings. The minimum number of reviewers for any given degree of ratings precision decreases with decreasing standard deviations. The standard deviation across ratings is also an indicator of the degree of agreement among different reviewers. If the standard deviation is small, for instance equal to 0.01 instead of our previous working estimate of 1.0, there is essentially consensus among the referees. If σ = 0.01, then the following relation holds:
Our estimate of σ is not based on an analysis of biomedical research experts judging research projects close to their area of specialty. Scoring standard deviations for large numbers of experts obtained in a statistically acceptable sampling system have not been collected. Instead, as described above, we have used a model system that has allowed us to readily collect opinion data about proposals with undefined potential. Although we believe our estimate is reasonable, it is informative to visualize how the sample size estimate varies with different values of standard deviation for a level of precision of 0.1 (
The importance of scoring accuracy ultimately relates to the rank ordering of proposals. In our model system there were 5 movie proposals with mean scores ranging from 3.46 to 3.64. We have analyzed how the rank ordering of these 5 proposals varied as reviewers were randomly included in the analysis from 1 to 40 reviewers (
The 5 proposals were closely spaced with mean scores of 3.46 to 3.64. Proposals that had the same score were given an averaged rank; the figures changed little by assigning proposals with the same score the highest
It is clear from our analysis that NIH needs to adjust their peer review system to account for low precision evaluations. Additionally, it would be valuable to determine the standard deviations of scores given by independent reviewers. This information could be used to obtain more appropriate estimates of σ and consequently would be invaluable in designing and implementing a statistically rational system of social choice for NIH.
Our data demonstrate that funding decisions will vary widely with the number of reviewers in considering proposals that are closely scored. Making choices between applications that vary by less than 1 will require larger numbers of reviewers than NIH has been contemplating. Recognition of the statistical inconsistencies of NIH peer review will allow for the implementation of new policies that take into consideration the accepted relationship between the number of reviewers, the precision of scoring needed, and the standard deviation of the scores given.
The Working Group also recommended shortening the length of the application although no specific suggestions were included2. Obviously, the length of the application impacts the number of reviewers that could possibly be used for scoring. More reviewers can be used for shorter applications.
It is commonly accepted that NIH will not fund clinical trials that do not include a cogent sample size determination. It is ironic that NIH insists on this analysis for clinical studies but has not recognized its value in evaluating its own system of peer review. We posit that this analysis should be considered in the revisions of NIH scientific review.
The NIH peer review structure has not been based in rigorous applications of statistical principles involving sampling