How Criterion Scores Predict the Overall Impact Score and Funding Outcomes for National Institutes of Health Peer-Reviewed Applications.

Understanding the factors associated with successful funding outcomes of research project grant (R01) applications is critical for the biomedical research community. R01 applications are evaluated through the National Institutes of Health (NIH) peer review system, where peer reviewers are asked to evaluate and assign scores to five research criteria when assessing an application's scientific and technical merit. This study examined the relationship of the five research criterion scores to the Overall Impact score and the likelihood of being funded for over 123,700 competing R01 applications for fiscal years 2010 through 2013. The relationships of other application and applicant characteristics, including demographics, to scoring and funding outcomes were studied as well. The analyses showed that the Approach and, to a lesser extent, the Significance criterion scores were the main predictors of an R01 application's Overall Impact score and its likelihood of being funded. Applicants might consider these findings when submitting future R01 applications to NIH.


Introduction
The National Institutes of Health (NIH) is the world's leading biomedical and behavioral research organization and spends about three-quarters of its nearly $30.1 billion budget on extramural grant research funding to support research in universities, medical schools and research institutions [1]. Peer review is the cornerstone of the NIH's extramural research program. Applications for research funding from NIH's extramural research program are vetted through the peer review process [2]. Over the years, the NIH has made periodic efforts to improve its peer review system to ensure fairness and efficiency in evaluating grant applications. The most recent effort began in June of 2007 [3]. The enhancements to the NIH peer review system were implemented, in phases, beginning in 2009 [4]. The key modifications Coordination II (IMPACII), the database of record for information collected from NIH extramural grant applications, awards and applicants during the receipt, review and award management process. For each application, data were obtained on whether the application was funded, its final Overall Impact score, and its five research criterion scores, which were delinked from the reviewers providing the scores. The research criterion scores were calculated for each a Overall Impact score averages only include discussed applications.
b Other application and applicant characteristics evaluated, but not shown here due to space limitations, are: Council round of review, human or animal subject concerns, solicitation type (unsolicited, program announcement or request for application), locus of review (Center for Scientific Review v. other NIH Institutes and Centers), review group type (standing study section v. special emphasis panel), direct costs requested, # of years of support requested, the NIH administering Institute or Center (IC), the geographical region of the institution and the previous NIH funding history of the applicant. c A new application is a type 1 application. A type 2 application is a renewal, also known as competing continuation. A type 3 application can be a competing revision for additional support to expand the scope of study or can be a non-competing administrative supplement application for additional support to cover increased costs. A type 9 application is a renewal for which the awarding institute or center changes. d An application submitted for the first time is an A0 application or an initial submission. A previously submitted unfunded A0 application resubmitted for new funding consideration is an A1 application or a first resubmission. A previously unfunded A1 application resubmitted for new funding consideration is an A2 application or a second resubmission. The policy on resubmission in place for applications submitted during the study period, FY 2010-FY 2013, can be found at http://grants.nih.gov/grants/guide/notice-files/NOT-OD-09-003.html. e A new investigator is defined as a principal investigator who has not previously competed successfully as a principal investigator for a substantial independent research award. A new investigator who is within 10 years of completing his/her terminal research degree or is within 10 years of completing medical residency (or equivalent) is considered an early stage investigator. A principal investigator who is not a new investigator is an experienced investigator. A list of NIH grant activities that do not disqualify a principal investigator from being considered as a new investigator can be found at http:// grants.nih.gov/grants/new_investigators/index.htm. f An application including only one principal investigator (PI) is a single PI application. An application including more than one principal investigator is a multiple PI (MPI) application. g An application involving (1) only human subjects for research is a humans only application, (2) only animal subjects for research is an animals only application, (3) both human and animal subjects for research is a humans and animals application, and (4) neither human nor animal subjects for research is a no humans or animals application. h An application's rank is based on the rank order of the application's submitting organization or institution with respect to the total amount of NIH research grant funding received by that organization compared to all other organizations over the five year period prior to the fiscal year of the application. The lower the rank, the higher is the previous level of funding from NIH. i The type of the institution or organization submitting the application.
j Race of a principal investigator is the racial category that was self-reported by the principal investigator. Applications whose principal investigator reports more than one race category or applications with multiple principal investigators who report different race categories are included in the 'Other' category. k Ethnicity of a principal investigator is the ethnicity selection that was self-reported by the principal investigator. Applications with multiple principal investigators who report different ethnicities are included in the 'MPI Multiple Ethnicities' category. l Gender of a principal investigator is the gender selection that was self-reported by the principal investigator. Applications with multiple principal investigators who report different genders are included in the 'MPI Multiple Gender' category. m Degree represents the highest degree attained by a principal investigator. Applications with multiple principal investigators reporting more than one degree type are included in the 'MPI Multiple Degree Types' category. The "Other" degree category includes degree types such as veterinary, dental and unknown degrees. n Age of a principal investigator is calculated by subtracting the principal investigator's birth year from the application's fiscal year. Applications with multiple principal investigators who report different age group categories are included in the 'MPI Multiple Age Groups' category. Those with an erroneous birth date (less than 24 or greater than 90) or missing birth date are included in the 'Unknown' age category. criterion by averaging all individual criterion scores available for a particular application. In addition, data were extracted on other characteristics related to the application (such as whether it was a new or renewal application), the applicant (such as applicant demographics and personal NIH funding history) and the applicant's institution (such as the institutional funding history with NIH). All demographic data were self-reported, on a voluntary basis, by the applicants. Data on the SRG where the application was reviewed were also obtained. See Table 1 for a full list of variables evaluated for each application. Descriptive summary statistics, as well as correlations between the five criterion scores and the Overall Impact score were produced.

Models
Two general models were developed: 1) the Impact model, a linear regression model with the Overall Impact score serving as the dependent variable; and 2) the Funding model, a logistic regression model with the likelihood of being funded serving as the dependent variable. The five research criteria were used as the main predictors in both models, controlling for other application and applicant characteristics delineated in Table 1. Both models controlled for the FY of the application to account for changes in the distribution of Overall Impact scores or funding patterns over time. Hierarchical random effects models, with applications clustered by SRG, were employed to account for possible differences in scoring behavior and funding outcomes between peer review groups. In addition to controlling for the potential clustering of scores by SRG, the use of random effects, by way of intraclass correlations, allowed for the decomposition of the total variation in the models into two categories: within-SRG variation and between-SRG variation [16][17][18]. Three sub-models were developed in a step-wise fashion to assess the marginal contribution of each set of characteristics in both general models. Sub-model A focused on the five research criterion scores, including any significant interactions between them. Sub-model B added the other control variables to sub-model A. Sub-model C was identical to sub-model B, but removed the criterion scores. Sub-model C served to illustrate how the various application and applicant characteristics appeared to be associated with the Impact score and relative odds of funding when the quality of the application, as measured by the criterion scores, was not taken into account.
Because the ND applications are not assigned Overall Impact scores, only the 71,651 applications that were discussed in SRG meetings and assigned Overall Impact scores from FY 2010 to FY 2013, were used to fit the Impact model. ND applications were not removed from the Funding model because their funding outcomes were known, and data on the five research criterion scores were still available. However, applications precluded from being considered for funding were removed, i.e., those with unresolved human subject or animal concerns and resubmitted applications that had a previous version funded. Removing these applications left 111,533 R01-equivalent applications for the Funding model.
Data analyses were performed using Stata 13 (StataCorp). Model estimates and their 95% confidence intervals (CIs) were computed. The Funding model results were expressed as odds ratios. For ease of interpretation, the coefficients of the criterion score estimates were inverted in the Funding model, so that odds ratios greater than unity should be interpreted as the magnitude of the increase in odds of funding due to a one unit decrease (improvement) of the given criterion. Results were considered statistically significant if they had a P-value of less than 0.05, using 2-sided testing.
The NIH Office of Human Subjects Research Protections was consulted and determined this work to be classified as a program evaluation that did not require human subjects research review by an Institutional Review Board.

Results
Fig 1 shows the distribution of the Overall Impact score and criterion scores in the form of boxplots. The criterion scores for Approach had the greatest variability and highest (or worst) scores, with an interquartile range (IQR) of 2.0 and median of 4.3. The criterion scores for Significance and Innovation both had IQRs of 1.2 and medians of 3.0. Investigator(s) and Environment criterion scores were clustered in the low score ranges with median scores of 2.0 and IQRs of 1.0, indicating that most applications received excellent marks for Investigator and Environment. Table 2 provides the correlations between the criterion scores for each of the five research criteria and the Overall Impact score. All criteria had moderate to high correlations with one another, ranging from 0.55 between Significance and Environment to 0.75 between Investigator(s) and Environment. Environment had the lowest correlation with the Overall Impact score, whereas Approach had the highest correlation with the Overall Impact score (0.44 and 0.84, respectively). Table 1 shows that the average Overall Impact scores and funding rates varied widely according to different application characteristics. For example, new (type 1) applications had an average Overall Impact score of 37.1 and funding rate of 14.2% while renewal (type 2) applications fared better, with an average Overall Impact score of 30.9 and funding rate of 30.1%. Initial submissions (A0s) had an average Overall Impact score of 38.1 and funding rate of 11.2%, whereas resubmissions (A1s) had a more favorable average Overall Impact score and funding rate (31.7 and 30.6%, respectively). Applications from Early Stage Investigators (ESIs) had an average Overall Impact score of 38.4 and a 17.6% funding rate, whereas applications from experienced investigators had a better average Overall Impact score and funding rate (33.9 and 18.8%, respectively). Applications submitted by white principal investigators (PIs) had an average Overall Impact score of 34.8 and a funding rate of 19.0%; in contrast, applications submitted by black PIs had poorer outcomes (average Impact score: 38.1; funding rate: 11.8%). Male PIs had Overall Impact scores and funding rates of 35.3 and 17.9%, respectively, whereas female PIs had corresponding worse scores and funding rates of 36.2 and 16.4%, respectively. Fig 2 shows boxplot distributions of the Overall Impact score by IC, with IC names masked. Median scores ranged considerably by IC, from 33 to 50.5. IQRs ranged from 15 to 22 across ICs. Fig 3 shows the percentage of reviewed applications that were funded by each IC. This rate ranged widely from 7.1% to 28.9%. The rank order of the Overall Impact scores and funding rates by ICs, shown in Figs 2 and 3, respectively, do not match as might be expected: ICs that had better (lower) ranges of Overall Impact scores did not necessarily have higher funding levels. This is due, in part, to differences in the number of applications received and available grant funding dollars between the different ICs, and demonstrates the importance of controlling for IC, particularly in the Funding model. S1 and S2 Tables are similar to Table 1, except that they show summary statistics for discussed and ND applications, respectively. In comparing the two tables, ND applications had worse (higher) mean criterion scores for all five research criteria, compared to discussed applications. Furthermore, the Approach criterion had the worst mean scores for both discussed and ND applications. Among discussed applications, the Approach criterion was more variable, with a higher standard deviation than the other criterion scores, underscoring the former criterion's importance in predicting the Overall Impact score amongst discussed applications. In contrast to discussed applications, which had an overall 29.8% funding rate over the study period, ND applications had almost no chance of being funded (only one ND application was funded in FY 2010-2013).
The Impact model and Funding model results are shown in Tables 3 and 4, separated by sub-model. In sub-model A, with independent variables limited to the criterion scores, all were highly significant in the Impact model, with the coefficients in rank order for Approach, Significance, Innovation, Investigator(s) and Environment estimated at 7.6 (95% CI, 7.5-7.7), 3.4 (3.3-3.5), 1.4 (1.3-1.5), 1.0 (0.9-1.0) and -0.2 (-0.3--0.1), respectively. That is, a one point improvement in the Approach score was associated with a 7.6 point improvement in the Overall Impact score, controlling for the other criterion scores. The Funding model results for submodel A had coefficients in the same rank order, with odds ratio estimates of 6.2 (5.9-6.5), 2.1 (2.0-2.2), 1.5 (1.4-1.6), 1.0 (1.0-1.1) and 0.9 (0.8-0.9), respectively, e.g., for every one point improvement in the Approach score, the odds of funding increased by a factor of 6.2. There was a highly significant interaction between Approach and Significance in both the Impact and              How Criterion Scores Predict NIH Peer-Reviewed Application Outcomes applications and 94.7% of unfunded applications, for an overall correct prediction rate of 89.3%. The intraclass correlation coefficient, which measures the amount of variation accounted for by SRGs, was 4.2% in the Impact model and 17.8% in the Funding model; i.e., an application's criterion scores were much better indicators of its review and funding outcomes than the SRG in which it was reviewed. In sub-model B, which adds the full set of application and applicant controls to sub-model A, the coefficients of the criterion scores were largely unchanged. For the Funding model, the only major departure from sub-model A was that the Investigator(s) odds ratio coefficient increased to 1.4 (1. 3-1.5), showing that applications with better Investigator(s) criterion scores were associated with better odds of funding once the other application and applicant characteristics were taken into account. Many of the application control factors had statistically significant relationships to the Overall Impact score and odds of funding. Of note, renewal applications were predicted to have Overall Impact scores 0.7 (-0.8--0.6) points lower (better) than otherwise identical new applications and their odds of funding were predicted to be 1.4 (1.3-1.5) times better. First resubmission applications (A1s) were predicted to have Overall Impact scores 1.3 (-1.5--1.2) points lower and odds of funding 2.2 (2.1-2.3) times greater than otherwise identical initial submissions (A0s). Applications submitted by ESIs were predicted to have Overall Impact scores 1.2 (-1.5--0.8) points lower and odds of funding 2.6 (2.2-3.1) times greater than otherwise identical applications from experienced investigators. Applications submitted by black PIs had Overall Impact scores 0.6 (0.1-1.1) points higher or worse than applications submitted by white PIs with the same measured characteristics, though there was no statistically significant difference in odds of funding. Applications submitted by female PIs had slightly better Overall Impact scores (0.2 [-0.3--0.1] points lower) than those submitted by male PIs, but the odds of funding were not statistically different, all else equal. See Tables 3 and  4 for the full set of control variables. Sub-model B improved the model fit and predictive accuracy of sub-model A by a very small amount, approximately one percentage point in each case.
Differences amongst subgroups in the application and applicant control variables increased substantially in sub-model C, which omits the criterion scores from the full model, sub-model B. Renewal applications were predicted to have Overall Impact scores 3.5 (-3.7--3.3) points lower and odds of funding 2.2 (2.1-2.3) times greater than new ones. First resubmission applications were predicted to have Overall Impact scores 5.6 (-5.8--5.4) points lower and odds of funding 3.7 (3.6-3.8) times greater than initial submissions. In contrast to sub-model B, applications submitted by ESI's were predicted to have Overall Impact scores 1.3 (0.7-1.9) points higher or worse than experienced applications and their funding advantage was reduced to an odds ratio of 1.5 (1.4-1.7). Therefore, the ESI advantage in Overall Impact scores and funding odds was observed only after controlling for the criterion scores. Applications submitted by black PIs and female PIs appeared less likely to be funded, with the odds ratios of black PIs and female PIs falling to 0.7 (0.6-0.8) and 0.9 (0.9-0.9), respectively, and becoming statistically significant in absence of the criterion scores. The amount of variation explained by sub-model C was low (R 2 = 16.9%) and the overall correct prediction rate was lower, 80.7% (only 9.6% for funded applications and 97.7% for unfunded applications).

Discussion
The Impact and Funding model results demonstrate that the criterion scores are the best predictors of an application's Overall Impact score and its likelihood of receiving funding. The model fit statistics support this observation. The R 2 , or variation explained, and correct prediction rate only improved by one percentage point when going from models which included only the criterion scores, to those which included all the other application and applicant control factors. Furthermore, when the criterion scores were removed from the full model, the variation explained and correct prediction rate fell off markedly, and the control variables increased in magnitude and many became statistically significant. Among the criterion scores, there was a clear hierarchy in terms of each criterion's relationship with the Overall Impact score and funding odds. In both the Impact model (which contained only discussed applications) and the Funding model (which contained both discussed and non-discussed applications), the Approach score had the strongest association, with more than double the effect of the next largest predictor, the Significance score. The predictive effect of the Environment score was very small and went in a counterintuitive direction, with better Environment scores having worse Overall Impact scores and funding odds, all else equal. This finding suggests that some applications with poor Overall Impact scores can be associated with strong Environment scores, even after controlling for the other criterion scores. Furthermore, in another set of models (not shown here) where whether an application was discussed or not served as the dependent variable, the criterion score coefficients followed the same rank order, with Approach being by far the largest predictor of whether or not an application was discussed.
The criterion scores were moderately to strongly correlated with one another. This is because highly meritorious applications tended to score well on all five criteria, and vice versa for less meritorious applications. As in Lindner et al. [15], these relatively high correlations raised concerns of multicollinearity (MC). MC does not cause bias when estimating coefficients in a correctly specified model, but it can increase the variability of the estimates [19]. This problem was mitigated by the large number of applications in the model [20], which decreased the variance inflation factor (VIF) of each research criterion. VIF measures how much the variance of an estimated regression coefficient is increased because of collinearity with the other independent variables. The literature on MC typically points to VIF scores of more than 4 as potential signs of multicollinearity problems, though this is only a rule of thumb [21]. No VIF score for the criterion scores was above 2.2 in any of the models.
The summary statistics revealed relatively large differences in Overall Impact scores and funding outcomes between applications with different characteristics, such as the difference between funding rates for new and renewal applications. Sub-model C, which controlled for different application characteristics simultaneously, still exhibited these large differences. However, the multivariate models which took into account the application's criterion scores explained many of the apparent differences in outcomes among different sorts of applications. One notable exception is the fact that ESI applications (and to a lesser extent other applications submitted by New Investigators) had a small advantage in the Impact model and a large advantage in the Funding model. This finding is reflective of NIH policy which strives to support new investigators on new R01-equivalent awards at success rates comparable to that of established investigators submitting new applications.
Consistent with the findings of Ginther et al. [11], the present study found large differences in NIH R01 funding rates by race in the absence of the measured influence of criterion scores. Criterion scores were introduced in FY 2010, and thus were not available for the applications evaluated by Ginther. Differences in outcomes by gender were also discovered in the summary data of the present study. These demographic differences diminished or disappeared once the criterion scores were included in the full models. However, bias cannot be ruled out, particularly in the first stage of peer review, where small but statistically significant differences remain in the Impact model. To ensure fairness, NIH is undertaking an extensive review of potential bias in the peer review system (see http://acd.od.nih.gov/prsub.htm). In contrast to the Impact model, the Funding model showed almost no differences in funding outcomes by demographics once all the measured characteristics of the application were taken into account.

Conclusion
The research criterion scores, specifically the Approach and, to a lesser extent, the Significance score, are the most important predictors of an R01 application's Overall Impact score and its likelihood of being funded. Other factors, such as the New Investigator status of the application, are associated, particularly with funding outcomes. But the model results show that the quality of the application, as measured by the criterion scores, is the best predictor of an application's eventual success. Applicants might consider these findings when submitting future R01 applications to NIH.