Figures
Abstract
Peer review is a decisive factor in selecting research grant proposals for funding. The usefulness of peer review depends in part on the agreement of multiple reviewers’ judgments of the same proposal, and on each reviewer’s consistency in judging proposals. Peer reviewers are also instructed to disregard characteristics that are not among the evaluation criteria. However, for example, the gender identity—of the investigator or reviewer—may be associated with differing evaluations. This experiment sought to characterize the psychometric properties of peer review among 605 experienced peer reviewers and to examine possible differences in peer review judgments based on peer reviewer and investigator gender. Participants evaluated National Institutes of Health-style primary reviewers’ overall impact statements that summarized the study’s purpose, its overall evaluation, and its strengths and weaknesses in five criterion areas: significance, approach, investigator, innovation, and environment. Evaluations were generally consistent between reviewers and within reviewers over a two-week period. However, there was less consistency in judging proposals with weaknesses. Regarding gender differences, women reviewers tended to provide more positive evaluations, and women investigators received better overall evaluations. Unsuccessful grant applicants use reviewer feedback to improve their proposals, which could be made more challenging with inconsistent reviews. Peer reviewer training and calibration could increase reviewer consistency, which is especially relevant for proposals with weaknesses according to this study’s results. Evidence of systematic differences in proposal scores based on investigator and reviewer gender may also indicate the usefulness of calibration and training. For example, peer reviewers could score practice proposals and discuss differences prior to independently scoring assigned proposals.
Citation: Schmaling KB, Gallo SA (2024) An experimental study of simulated grant peer review: Gender differences and psychometric characteristics of proposal scores. PLoS ONE 19(12): e0315567. https://doi.org/10.1371/journal.pone.0315567
Editor: Fanli Jia, Seton Hall University, UNITED STATES OF AMERICA
Received: June 12, 2024; Accepted: November 27, 2024; Published: December 17, 2024
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: The data underlying the results presented in the study are available from https://doi.org/10.17605/OSF.IO/92QEZ.
Funding: This work was supported by the National Science Foundation (www.nsf.gov), award numbers 1951132 (SG) and 1951251 (KS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Peer review is a crucial component in the determination of grant applications that receive funding and manuscripts that are published. The importance of peer reviewers’ input is tempered by concerns about reviewers’ reliability and about reviewers’ potential lack of impartiality regarding characteristics that are not among the evaluation criteria, such as gender.
Two types of reliability are valuable in the investigation of the psychometric soundness of grant evaluation: interrater agreement, the consistency of the scores of different reviewers evaluating the same application; and test-retest reliability, the consistency of the scores of the same reviewer across time. Demonstrating reliability is important because validity presupposes reliability.
Interrater reliability has been the focus of most grant peer review reliability research. This focus is sensible because between-reviewer variability makes it less likely that an application is funded [1], and funding is needed to conduct most research and to have a successful scientific career. A recent study [2] summarized the interrater reliability reported in seven previous studies, which reported multiple-rater intraclass correlations (ICCs) of 0.00 to 0.50. However, several of these studies had used a restricted range of applications, e.g., the top 18%. The authors conducted de novo data analyses using the full range of applications from two different sources of peer review and found better ICCs of 0.61 and 0.64. In additional studies, ICCs of 0.41 were reported among reviewers of proposals to the Canadian [3] and Swiss [4] funding agencies, but the reliability coefficients for applications to a UK funding agency were less than 0.20 [5]. Across different funders, diverse disciplines, and different metrics of agreement, interrater reliability has ranged from zero to 0.64. No reports of grant peer reviewer test-retest reliability were found in the literature.
For many years, peer reviewers for the US National Institutes of Health (NIH) have judged proposals on five component scores—significance, innovation, investigator, approach, and environment—which informed an overall impact score. (This system is scheduled to change in 2025 [6]). Using Pearson correlations to measure the association of the six NIH scores to each other, the strongest relationship was between the overall impact score and the approach (r = 0.84) [7]. The scientific approach has been found to be the overall impact score-driving component in other studies [8].
Between-reviewer inconsistency could occur if some reviewers used information not among the evaluation criteria. Gender identity—both peer reviewer and investigator gender—is one potential variable associated with differences between reviewers’ scores. For example, a recent review found differences in some grant outcomes by investigator gender, such as the success rate of resubmitted proposals [9]. A Canadian study [3] found that female investigators received worse scores (.02 to .06 points lower on 1.0 (poor) to 4.9 (excellent) point scale, depending on scientific area and previous funding success) and female reviewers gave worse scores than male reviewers (.05 points lower). An Austrian study [10] found small gender differences between scores: women investigators were associated with .37 fewer points than men on a 0 (poor) to 100 (excellent) rating scale. In a US study, gender was a significant predictor of NIH overall impact scores: female investigators received better scores (.2 points lower) than men on a 1 (exceptional) to 9 (poor) scale [7]. These studies suggest that grant proposal evaluations may differ by reviewer and investigator gender.
The purpose of the present study was to characterize component scores’ association with the overall impact score; evaluate interrater and test-retest reliability; and examine the effects of reviewer and investigator gender on reviewers’ scores. We posed the following research questions: which component scores are most strongly associated with the overall scores? Do the psychometric characteristics of proposal scores achieve satisfactory levels of interrater and test-retest reliability? Is the gender of the reviewer, the investigator, or both, associated with differences in proposal scores? To investigate these questions, mock overall impact statements (OISs) for putative NIH investigator-initiated research grant (R01 mechanism) applications were used as the research materials. Participants with recent NIH-style review experience rated the OISs as if they were unassigned reviewers that provided scores based on the OIS of assigned reviewers.
Materials and methods
The design, participants, and measures for this study were described in detail previously [11]. These methods are summarized below.
Participants and procedure
Participants were recruited from two sources. The first source was comprised of 10,990 people in the American Institute of Biological Science’s (AIBS) Scientific Peer Advisory and Review Services database who had reviewed for or submitted applications to one of the funders for whom AIBS had conducted peer review. The second source was comprised of 1678 scientists listed on sixteen 2020 NIH integrated review groups’ rosters. NIH reviewers’ email addresses were identified by web searches. Only scientists affiliated with US-based institutions with US-based emails were used. Between these two sources, 12,253 emails were delivered. Fig 1 is a flow diagram of the numbers of participants who were recruited with delivered emails and responded for each study; and for the first study, those who responded and were ineligible or eligible, provided incomplete or complete data, and had missing demographic data.
Participants were emailed invitations to participate with a link to complete the study on-line using Qualtrics™. The email described how they were identified (AIBS or NIH roster) and invited them to participate if they had done one or more NIH-style reviews in the last 5 years. One reminder invitation was sent two weeks after the initial invitation to those who had not completed the survey or opted out from further contact. 725 scientists accessed the study link and 605 scientists completed the study; (see Fig 1). The target sample size of 600 was determined by a priori power analyses based on data from a pilot study [11].
Measures
The study components were administered in the following order, unless randomly determined as noted below.
The email invitations and the consent form described the study and inclusion criteria, which was having conducted one or more NIH-style reviews in the last 5 years. This research was deemed exempt from human subjects by the Washington State University Office of Research Assurances and granted exemption from IRB review consistent with 45 CFS 46.104(d)(2). Demographic information was collected, including degrees, year the earliest doctoral degree was earned, gender identity, English as first language, racial/ethnic identities, and the number of study sections/review committees on which the participant had served in the past five years for the NIH, Agency for Healthcare Research and Quality (AHRQ), Department of Defense (DoD), National Science Foundation (NSF), and other. If the participant reported serving on zero NIH, AHRQ, or DoD study sections, which all use similar review criteria and scoring procedures, they were told they did not meet the inclusion criteria and thanked for their interest.
Participants who met the inclusion criteria received two overall impact statements (OISs) to read and score. The process for the development and pilot testing of these OISs was described previously [11]. One OIS was the same for all participants, which we refer to herein as the control OIS. It described an outstanding proposal with no weaknesses in any of the criterion areas, and made no reference to the gender of the investigator (i.e., by name or pronouns). The other OIS was randomly assigned, which we refer to herein as the comparison OIS: the principal investigator’s gender was male or female (gender-specific names and pronouns were used). Also, the narrative suggested either no risk or weakness, or some risk in the scientific approach, the investigator, or both. There were six possible comparison OISs based on the combination of two categories of gender by three levels of risk (low PI risk and higher approach risk; higher PI risk and low approach risk; higher PI risk and higher approach risk) (S1 Fig). The order in which the participants received the two OISs was randomly determined. The randomization occurred after participants met inclusion criteria and was designed to produce balanced numbers of participants by gender, randomized OIS order (control or comparison), and randomized version of the comparison OIS. But unequal numbers of participants could result if randomized participants did not provide complete data. Anonymized study data are available on the Open Science Framework [12].
Test-retest methods
The methods of the test-retest reliability component of the study follow. The first 186 participants who completed the first study, and who gave permission to be potentially recontacted for a future related study, were invited to complete “a similar study” two weeks after they completed the initial assessment. These participants were balanced across the six randomly chosen combinations of investigator gender and risk for the comparison OIS. Eighty-nine participants began the second study and 83 completed it (see Fig 1). The target sample size of 80 was based on an a priori power analysis from the pilot study. Each participant received the same OISs for both studies. The two assessments differed only in how the scoring options were described. The first study provided both the standard NIH numeric scale and each number’s adjectival definition, i.e., 1 = exceptional to 9 = poor, and the second study provided only adjectives without the numbers to decrease recall cueing from the first study. For example, a rating of “1/exceptional” in study one corresponded to a rating of “exceptional” in study two, which was coded with a value of 1: the pairs of ratings from each participant were used to calculate test-retest reliability.
Statistical analysis
Descriptive statistics were used to characterize the participants, and univariate statistics were used to compare those who participated in study one with NIH reviewers; the participants in study one with study two; and men with women participants. Spearman correlations were used to examine component scores’ association with the overall impact score, separately for each OIS (control; risky approach; risky investigator; and risky approach and investigator). For interrater reliability, percent agreement (percent endorsing the modal score) was used and single-measure intraclass correlations (ICC) were calculated for absolute agreement with mixed effects (reviewer effects were random and component scores were fixed) with 95% confidence intervals for the set of six scores. ICC values of 0.00 to 0.40 are considered to represent poor agreement, 0.41 to 0.59 fair, 0.60 to 0.74 good, and 0.75 to 1.00 are considered to represent excellent agreement [13]. For test-retest reliability, single-measure ICCs were calculated between ratings over the two studies, separately by OIS, and Bland-Altman plots were used to examine if the first and second ratings were systematically biased.
The final set of analyses examined investigator and reviewer gender differences in scores, using only the 604 participants who endorsed binary gender identity. For the control OIS, Q-Q plots of scores’ distributions revealed the presence of outliers, and Levine’s test for equality of error variances showed heterogeneity in variances by reviewer gender for the innovation and approach scores. Consequently, nonparametric Mann-Whitney U tests were used to determine if women and men reviewers differed in each of the six control OIS scores. The Benjamini-Hochberg critical value was calculated to adjust for the number of tests. Effects sizes were reported based on partial eta-squared (η2).
For the comparison OIS scores, homogeneity of error variances (Levine’s test) and of covariance matrices (Box’s test) were inspected for violations of these assumptions: none of these tests were significant. Therefore, MANOVA was used to examine differences in comparison OIS score by reviewer gender, investigator gender, and their interaction with the control OIS score as a covariate. The covariate was used to control for reviewers’ general scoring tendencies. Demographic differences between men and women reviewers were also considered as possible covariates. Effects sizes were reported based on partial eta-squared (η2). All analyses were conducted using IBM SPSS V28.0 (New York, NY).
Results
Participants
As shown in Fig 1, complete data for the OIS ratings were provided by 605 participants for the first study, and 83 participants for the second test-retest study. The 605 participants were mostly White (74.7%; Asian 17.7%; Latinx, 3.1%; Black/African American 1.2%; Native American/Indigenous 0.2%; multiethnic 2.6%; missing 0.5%) male (63.8%: 36.0% female, 0.2% nonbinary) had only PhDs (68.8%; only MDs 15.1%; MD/PhDs 11.2%; other doctoral degrees with or without MDs and/or PhDs 4.6%; missing 0.3%) who received their first doctoral degree 29.0 (SD = 9.65) years previously and had participated on 9.97 (SD = 7.99) NIH-style panels in the past five years. Men and women participants did not differ in the number of NIH-style panels in the last five years, but participants’ first doctoral degrees had been earned more recently for women (27.92 years) than men (29.63 years, t(590) = -2.09, p = 0.037, d = -0.18), and women were more likely to have only PhD degrees (74.65%) than men (65.71%, χ2(1) = 5.18, p = 0.023), and less likely to identify as multiethnic or non-White than men (17.9% vs. 29%, respectively, χ2(1) = 9.03, p = .003).
For comparison, in 2018, men comprised 65.5% and Whites comprised 69.5% of all NIH reviewers [14]. Our sample contained slightly fewer men (~2%) and more Whites (~5%) than NIH reviewers. Years since the first doctoral degree has not been reported for NIH reviewers as far as we know, however, in 2018, 83.3% of NIH peer reviewers held the academic title of “Professor” or “Associate Professor,” indicative of substantial seniority. We could not locate peer reviewer demographic information for AHRQ or DoD, but noted that 9.3% of our sample had more peer review experience for those agencies than NIH so we cannot evaluate the similarity of their reviewers to our sample.
The subset of 83 participants who completed the second test-retest study did not differ significantly from those who completed only the first study in terms of the years since receiving their first doctoral degree, gender, PhDs only, and White ethnicity. Those who completed the second test-retest study had participated in fewer panels in the last five years (M = 8.35, SD = 7.42) than those who completed only the first study (M = 10.23, SD = 8.05, t(603) = 1.98, p = 0.048, d = 0.125).
Interrater agreement
Table 1 shows the average and modal ratings for the overall impact score and the five criterion scores separately for each OIS. The overall impact score for the control OIS was most frequently rated a 2, corresponding to an “outstanding” proposal. The overall impact scores for the comparison OISs with risky investigator, approach, or both were most frequently rated a 2, 3, and 5, respectively, corresponding to “outstanding,” “excellent,” and “good” proposals.
Percentage agreement, calculated as the percentage of participants who rated the OIS with the modal score, was used as a measure of interrater agreement. For the control OIS, the most frequent scores were 1 or 2, with up to 61% (for environment) of participants endorsing the modal score. For the comparison OISs, the modal scores varied more between criteria, as would be expected because of the experimental manipulations. For the OIS with a risky approach, the approach was most frequently rated a 5 (25% agreement). For the OIS with a risky investigator, this criterion was most frequently rated a 3 (30% agreement). For the OIS with both a risky approach and investigator, the approach was most frequently rated a 5 (29% agreement) and the investigator was most frequently rated a 4 (31% agreement). Decreasing agreement was associated with increasing risk. The average percent of participants that endorsed the modal criterion scores in the control condition (52% agreement) was 18% greater than the average percent of participants that endorsed the modal criterion scores in the condition with both a risky investigator and approach (34% agreement).
The single-measure intraclass correlations (ICCs) follow, by OIS: control OIS, ICC = 0.614 (95% CI = 0.542, 0.675); risky approach OIS, ICC = 0.276 (95% CI = 0.139, 0.409); risky investigator OIS, ICC = 0.258 (95% CI = 0.147, 0.370); and both risky approach and investigator, ICC = 0.271 (95% CI = 0.149, 0.392).
Associations of the component scores with the overall impact score
As shown in Table 2, approach scores had the strongest relationship with the overall impact score for all OISs.
Test-retest reliability
Across the six scores, the average ICC for the control OIS was 0.52, and for the comparison OISs it was 0.31 for risky approach, 0.33 for risky investigator, and 0.47 for risky approach and investigator. Table 3 shows the test-retest ICCs separately for each score and each OIS. Environment tended to be rated inconsistently. There was more variability in reviewers’ test-retest consistency for the comparison OISs than for the control OIS, and among the comparison OISs, single sources of risk were rated more inconsistently than were two sources of risk.
Visual inspection of the Bland-Altman plots of the difference in a reviewer’s score versus their average score revealed no evidence of systematic test-retest bias for the overall impact score for both the control and comparison OIS (see S2 Fig). The visual impressions were supported by statistical tests of the relationship between with difference and average scores, which were not significant.
Gender comparisons
Mann-Whitney U tests were used to examine differences in scores based on reviewer gender for each component of the control OIS, using (see Table 4). According to Benjamini-Hochberg critical values, there were significant differences between women and men reviewers for all criterion scores except environment. In all cases, women reviewers gave significantly lower (better) scores than men reviewers. The magnitude of this difference was 11% for the overall impact score.
Next, MANOVA was used to examine differences in comparison OIS score by reviewer gender, investigator gender, and their interaction, with the control OIS score as a covariate. (The control OIS score had a significant effect but none of the three reviewer characteristics that differed between men and women reviewers—years since first doctoral degrees, having only PhD degrees, identifying as White only—had significant multivariate effects: they were omitted from the analyses reported here.) Multivariate tests found significant main effects for reviewer gender (F(6, 595) = 2.41, p = .026, η2 = .024) and investigator gender (F(6,595) = 2.16, p = .046, η2 = .021), but the reviewer by investigator gender interaction was not significant (F(6,595) = 0.80, p = .568, η2 = .008). Table 5 shows the results by score. Innovation differed significantly by reviewer gender: women reviewers gave lower scores than men reviewers: estimated marginal means = 2.12 and 2.39, respectively. Women investigators received lower (better) overall scores than men investigators: estimated marginal means = 3.28 and 3.57, respectively.
Discussion
This study found that peer reviewers’ reliability was moderate overall, whether it was examined as agreement with the modal score or test-retest reliability after a two-week delay. However, reliability decreased when the proposal had areas of weakness. For example, the proportion of reviewers who used the modal score for the control OIS was halved for the manipulated criteria (approach, investigator, or both) in the comparison OIS.
In terms of associations among the component scores, this study reinforces previous studies’ results [7, 8]: of the five components of the NIH review criteria, the approach score had the strongest association with the overall impact score. Furthermore, although the descriptions of significance, innovation, and environment were the same for all of the comparison OISs, ratings of these criteria seemed to have been negatively influenced, or contaminated, by weaknesses in the approach and/or the investigator. While speculative, this pattern of results was consistent with negativity bias [15]: weaknesses were more influential and salient than strengths. The test-retest data also reflected the salience of weaknesses: test-retest agreement was greater for the components that were manipulated in the comparison OISs—risky investigator and/or approach—than test-retest agreement for these components in the control OIS. In other words, reviewers may have remembered weaknesses more than strengths and scored the former more reliably [16]. We were unable to validate this potential interpretation by, for example, follow-up interviews with the participants. Such work remains to be done in future research.
This study adds to the extant literature on peer reviewers’ reliability. Reviewers were relatively consistent with each other on par with previous studies [2–5]. Reviewers were also consistent with themselves over a two-week period. However, better reliability was observed for an outstanding application than one with areas of relative weakness. Given that research funding is highly competitive, it is arguably most important that reviewers reliably identify outstanding proposals—which they did in this study. If more research funding was to become available, the reliable identification of good and very good proposals that might qualify for funding would become more essential. Improved reviewer reliability could also lead to more consistent feedback to investigators. Investigators can be frustrated with inconsistent or contradictory feedback [17]. Resubmission is often required for success and responsiveness to previous reviews is an important component in the evaluation of resubmitted proposals. Reviewer training videos and calibration discussions are intended to enhance reliability, but do not do so consistently [18, 19]. Further work along these lines would be valuable.
This study also examined differences in scores based on reviewer and investigator gender. For the control OIS, women peer reviewers gave significantly better scores to all criteria except environment than did men peer reviewers, accounting for small amounts (1%) of the variance in scores. For the comparison OIS (with control OIS scores as covariates), women reviewers gave better innovation scores than did men reviewers, accounting for 2% of the variance in scores. These results are consistent with some previous studies that found women reviewers to give better scores than men [7], for example in the sciences [20]. But they are inconsistent with other studies that found women reviewers to give worse scores than men reviewers [3, 10, 21]. We do not know why women reviewers gave better scores, which could be due to more positive or less critical mindsets, desire to nurture other scientists, or other factors. But as there is no gold standard, we cannot determine who is accurate, and who is biased.
Our study also found that investigators’ gender accounted for about 2% of the variance in scores for proposals with weaknesses. The effect sizes and gender differences in scores may seem small, but small differences can result in unfunded or funded proposals. For example, among 2023 R01 applications to the National Institute of General Medical Sciences, the likelihood of funding decreased with scores at approximately the 22nd percentile and above (overall impact scores are converted to percentiles for each review panel) (see Fig 4 in reference [22]). Reviewers gave better overall impact scores to women investigators than men, contrary to studies that have found women to receive worse scores [3, 10]. However, no differences by investigator gender were revealed for the other criteria. We offer two speculative reasons for the gender difference: First, participants may have guessed that investigator gender was manipulated in the study and wished to appear to be fair. But only three (0.5%) participants left comments that referenced gender in the study. A woman’s name and pronouns may cue positive bias [23] to attempt to counteract historical discrimination against women scientists, or to mitigate concerns about appearing sexist. Finally, there were no interactions between reviewer and investigator gender for the overall or component scores, suggesting a lack of gender affinity bias—a preference for others of the same gender.
Limitations
Although a minority of the invited reviewers participated, the characteristics of those who did so were similar to NIH reviewers at the time. More recently, however, NIH reviewers have included fewer men and Whites: these changes in NIH reviewer demographics means that our sample reflects current panels’ demography less well than the panels of several years ago. This study was modeled on the role of unassigned reviewers in NIH study sections, who must score a proposal they have not read based on the reviews of the assigned reviewers. In actual panels, unassigned reviewers’ scores are very similar (0.06 points greater) to assigned reviewers’ scores [24]. In this study, however, unlike actual NIH study sections, no meetings and discussions occurred that would have also informed the reviewers’ scores. No gender was specified for the investigator of the proposal without weaknesses, the control OIS, so we cannot know if women investigators would have been evaluated more positively than men, as they were for proposals with weaknesses. Furthermore, the basic science in the OISs could have been interpreted as research that women scientists do, possibly leading to more positive evaluations of the control OIS by women reviewers, and of better overall scores for women investigators of the comparison OIS. This possibility is tempered by evidence that both men and women MDs and life science PhDs have stronger implicit and explicit associations of men with science than of women [25].
Conclusions
Experienced grant peer reviewers are relatively reliable with each other and with themselves over a two-week period when they evaluate a proposal without weaknesses. Women reviewers gave this strong proposal better scores than did men reviewers. Peer reviewer reliability suffers, however, when evaluating proposals with weaknesses. In proposals with weaknesses, women reviewers continued to give better scores than men, but only in one area: innovation. Furthermore, proposals from women investigators were scored more favorably, but only the overall impact score: no investigator gender differences were found for scores on the other criteria (significance, innovation, investigator, approach, environment). Further research to determine the sources of and rationales for these inconsistencies would be valuable. Gender differences in reviewers’ scores for the same OIS may also suggest the need for calibration, as do differences in scores by investigator gender. Although gold standards for “correct” scores do not exist, peer reviewers could score practice OISs and discuss differences prior to independently scoring new proposals. And—if replicated—be mindful of any tendencies to evaluate proposals differently based on gender identity.
Supporting information
S2 Fig. Bland-Altman plots for the control and comparison OISs.
https://doi.org/10.1371/journal.pone.0315567.s002
(DOCX)
References
- 1. Graves N, Barnett AG, Clarke P. Funding grant proposals for scientific research: retrospective analysis of scores by members of grant review panel. BMJ. 2011;343:d4797. pmid:21951756
- 2. Erosheva EA, Martinková P, Lee CJ. When zero may not be zero: A cautionary note on the use of inter-rater reliability in evaluating grant peer review. J R Stat Soc Ser A Stat Soc. 2021;184(3):904–19.
- 3. Tamblyn R, Girard N, Qian CJ, Hanley J. Assessment of potential bias in research grant peer review in Canada. CMAJ. 2018;190(16):E489–e99. pmid:29685909
- 4. Reinhart M. Peer review of grant applications in biology and medicine. Reliability, fairness, and validity. Scientometrics. 2009;81(3):789–809.
- 5. Jerrim J, Vries R. Are peer reviews of grant proposals reliable? An analysis of Economic and Social Research Council (ESRC) funding applications. Soc Sci J. 2023;60(1):91–109.
- 6.
National Institutes of Health. Simplifying review of research project grant applications 2023. https://grants.nih.gov/policy/peer/simplifying-review.htm.
- 7. Eblen MK, Wagner RM, RoyChowdhury D, Patel KC, Pearson K. How criterion scores predict the overall impact score and funding outcomes for National Institutes of Health peer-reviewed applications. PLoS One. 2016;11(6):e0155060. pmid:27249058
- 8. Kaatz A, Lee YG, Potvien A, Magua W, Filut A, Bhattacharya A, et al. Analysis of National Institutes of Health R01 application critiques, impact, and criteria scores: Does the sex of the principal investigator make a difference? Acad Med. 2016;91(8):1080–8. pmid:27276003
- 9. Schmaling KB, Gallo SA. Gender differences in peer reviewed grant applications, awards, and amounts: a systematic review and meta-analysis. Res Integr Peer Rev. 2023;8(1):2. pmid:37131184
- 10. Mutz R, Bornmann L, Daniel HD. Heterogeneity of inter-rater reliabilities of grant peer reviews and its determinants: a general estimating equations approach. PLoS One. 2012;7(10):e48509. pmid:23119041
- 11. Gallo SA, Schmaling KB. Peer review: Risk and risk tolerance. PLoS One. 2022;17(8):e0273813. pmid:36026494
- 12.
Schmaling KB, Gallo SA. Gender differences and psychometric characteristics of proposal scores in simulated grant peer review. 2024. osf.io/92qez
- 13. Cicchetti DV. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess. 1994;6(4):284–90.
- 14.
Center for Scientific Review, National Institutes of Health. Demographics of CSR reviewers. https://public.csr.nih.gov/AboutCSR/Evaluations#reviewer_demographics
- 15. Rozin P, Royzman EB. Negativity bias, negativity dominance, and contagion. Pers Soc Psychol Rev. 2001;5(4):296–320.
- 16. Pier EL, Brauer M, Filut A, Kaatz A, Raclaw J, Nathan MJ, et al. Low agreement among reviewers evaluating the same NIH grant applications. Proc Natl Acad Sci U S A. 2018;115(12):2952–7. pmid:29507248
- 17. Gallo SA, Schmaling KB, Thompson LA, Glisson SR. Grant review feedback: Appropriateness and usefulness. Sci Eng Ethics. 2021;27(2):18. pmid:33733708
- 18. Sattler DN, McKnight PE, Naney L, Mathis R. Grant peer review: Improving inter-rater reliability with training. PLoS One. 2015;10(6):e0130450. pmid:26075884
- 19. Hesselberg JO, Fostervold KI, Ulleberg P, Svege I. Individual versus general structured feedback to improve agreement in grant peer review: A randomized controlled trial. Res Integr Peer Rev. 2021;6(1):12. pmid:34593049
- 20. Jayasinghe UW, Marsh HW, Bond N. A multilevel cross-classified modelling approach to peer review of grant proposals: the effects of assessor and researcher attributes on assessor ratings. J R Stat Soc Ser A Stat Soc. 2003;166(3):279–300.
- 21. Severin A, Martins J, Heyard R, Delavy F, Jorstad A, Egger M. Gender and other potential biases in peer review: Cross-sectional analysis of 38 250 external peer review reports. BMJ Open. 2020;10(8):e035058.
- 22.
Villanueva-Under R, Lorsch J. Application and funding trends in fiscal year 2023. March 20, 2024. https://loop.nigms.nih.gov/2024/03/application-and-funding-trends-in-fy23/
- 23. Jampol L, Rattan A, Wolf EB. A bias toward kindness goals in performance feedback to women (vs. men). Pers Soc Psychol Bull. 2023;49(10):1423–38. pmid:35751137
- 24. Johnson VE. Statistical analysis of the National Institutes of Health peer review system. Proc Natl Acad Sci U S A. 2008;105(32):11076–80. pmid:18663221
- 25. Smyth FL, Nosek BA. On the gender-science stereotypes held by scientists: explicit accord with gender-ratios, implicit accord with scientific identity. Front Psychol. 2015;6:415. pmid:25964765