How do authors’ perceptions of their papers compare with co-authors’ perceptions and peer-review decisions?

How do author perceptions match up to the outcomes of the peer-review process and perceptions of others? In a top-tier computer science conference (NeurIPS 2021) with more than 23,000 submitting authors and 9,000 submitted papers, we surveyed the authors on three questions: (i) their predicted probability of acceptance for each of their papers, (ii) their perceived ranking of their own papers based on scientific contribution, and (iii) the change in their perception about their own papers after seeing the reviews. The salient results are: (1) Authors had roughly a three-fold overestimate of the acceptance probability of their papers: The median prediction was 70% for an approximately 25% acceptance rate. (2) Female authors exhibited a marginally higher (statistically significant) miscalibration than male authors; predictions of authors invited to serve as meta-reviewers or reviewers were similarly calibrated, but better than authors who were not invited to review. (3) Authors’ relative ranking of scientific contribution of two submissions they made generally agreed with their predicted acceptance probabilities (93% agreement), but there was a notable 7% responses where authors predicted a worse outcome for their better paper. (4) The author-provided rankings disagreed with the peer-review decisions about a third of the time; when co-authors ranked their jointly authored papers, co-authors disagreed at a similar rate—about a third of the time. (5) At least 30% of respondents of both accepted and rejected papers said that their perception of their own paper improved after the review process. The stakeholders in peer review should take these findings into account in setting their expectations from peer review.


Introduction
Peer review is used widely in scientific research for quality control as well as selecting 'interesting' research.However, a number of studies have documented low agreement among reviewers (Cicchetti, 1991;Bornmann et al., 2010;Obrecht et al., 2007;Fogelholm et al., 2012;Lawrence and Cortes, 2014;Pier et al., 2017;Cortes and Lawrence, 2021), and researchers often lament various problems with peer review (Akst, 2010;McCook, 2006;Rennie, 2016).On the other hand, surveys of researchers about their general perception of peer review reveal that researchers across various scientific disciplines consider peer review to be important, yet in need of improvements (Ware, 2016;Taylor and Francis group, 2015;Ware, 2008;Mulligan et al., 2013;Nicholas et al., 2015).But how do author perceptions on their submitted papers match up with outcomes of the peer-review process?We investigate this question in this work.
We conduct a survey-based experiment in the Neural Information Processing Systems (NeurIPS) 2021 conference, which is a top-tier conference in the field of machine learning. 1The conference had over 9,000 papers submitted by over 23,000 authors.The conference traditionally has accepted 20-25% of the submitted papers, and in 2021 the acceptance rate was 25.8%.
We design and execute a survey to understand authors' perceptions about their submitted papers as well as their perceptions of the peer-review process in relation to their papers.In particular, we ask three questions: • It is well known that the peer-review process (at the NeurIPS conference) has a low acceptance rate and a high amount of disagreement between reviewers (Lawrence and Cortes, 2014;Cortes and Lawrence, 2021;Beygelzimer et al., 2021).Do authors take this into account when setting their expectations from the peer review process?Specifically, we aim to understand the calibration of authors with respect to the review process, by asking them to predict the probability of acceptance of their submitted paper(s).
• Motivated by authors often lamenting that their paper that they thought was best was rejected and the one they thought had lower scientific merit was accepted, we aim to quantify the discrepancy between the author's and the reviewers' relative perceptions of papers by asking authors to rank their papers in terms of their perceived scientific contribution and comparing this against acceptance decisions.
• Finally, while the two questions above measured the perception before the review process, we also measure the perception after they see the reviews, by asking authors whether the review process changed their perception of their own paper.
We then analyze how author perceptions align with the outcomes of the peer-review process and the perceptions of co-authors.The results of this work are useful to set expectations from the peer-review process, identify its fundamental limitations, and help guide the policies that the community implements as well as future research on improving peer review.The rest of the paper is organized as follows.Section 2 discusses related work.In Section 3, we present details of the questions asked to the participants (authors of submitted papers).We provide basic statistics of the responses in Section 4 and our main analysis of the responses in Section 5. We conclude with a discussion in Section 6.

Related work
There are a number of papers in the literature that conduct surveys of authors.Frachtenberg and Koster (2020) survey authors of accepted papers from 56 computer science conferences.The survey was conducted after these papers were published.Questions pertained to the paper's history (amount of time needed to write it; resubmission history) and their opinions about the conference's rebuttal process.The respondents were also asked whether they found the reviews helpful in improving their paper.A total of 34.1% of the respondents said they were 'very helpful,' 52.7% said they were 'somewhat helpful,' and 13.2% said they were 'not at all' helpful.Similar surveys asking authors whether peer review helped improve their paper are also conducted in other fields (Weller, 1996;Mulligan et al., 2013;Patat et al., 2019).It is important to note that this question is different from our third question which asks whether their own perception of the quality of their own paper changed after the review process.Our question pertains to the same (version of the) paper but perception before and after the reviews; on the other hand, their question pertains to two different versions of the paper (initial submission and after reviewers' suggestions) and whether there was an improvement across the versions.They also find that for these questions, responses from different authors to the same paper were usually very similar.Philipps (2021) surveys authors of research proposals on their perception of random allocation of grant funding.They do find support for such randomized decisions, which have now also been implemented (Heyard et al., 2022).Resnik et al. (2008); Fanelli (2009) conduct or analyze surveys of authors for breach of ethics.While computer science was not their focus, within computer science as well, there have been discoveries of breach of ethics in the peer-review process (Littman, 2021;Vijaykumar, 2020;Jecmen et al., 2020;Wu et al., 2021;Jecmen et al., 2022).
Several other surveys (Ware, 2016;Taylor and Francis group, 2015;Ware, 2008;Mulligan et al., 2013;Nicholas et al., 2015) find a strong support for peer review among researchers.They also find that researchers see a need to improve peer review.
The work of Gardner et al. (2012) is closest to ours.They conduct a survey in the Australasian Association for Engineering Education (AAEE) annual conference 2010 and 2011, comprising a total of 70 papers and 140 reviews.The survey asked authors to rate their own papers and also to rate reviews.Their survey received responses from 23 authors in 2010 and from 37 authors in the 2011 edition.They found that overall 75% of authors rated their paper higher than the average of the reviewers' ratings for their paper.Furthermore, their survey found that the academic rank of the respondent was not correlated with the accuracy of the respondent's prediction of the reviews.Anderson (2009) offers a somewhat tongue in cheek commentary pertaining to authors' perceptions: "if authors systematically overestimate the quality of their own work, then any paper rejected near the threshold is likely to appear (to the author) to be better than a large percentage of the actual conference program, implying (to the author) that the program committee was incompetent or venal.When a program committee member's paper is rejected, the dynamic becomes self-sustaining: the accept threshold must be higher than the (self-perceived) merit of their own paper, encouraging them to advocate rejecting even more papers." Within the machine learning community, Rastogi et al. (2022) survey reviewers about visibility of papers submitted to a conference that anonymizes authors, and intentionally searching online for assigned papers.Or current work contributes to a tradition in machine learning venues of experimentation aimed at understanding and improving the peer-review process (Lawrence and Cortes, 2014;Shah et al., 2018;Tomkins et al., 2017;Stelmakh et al., 2021bStelmakh et al., , 2020Stelmakh et al., , 2021a;;Cortes and Lawrence, 2021;Beygelzimer et al., 2021;Stelmakh et al., 2022).
See Shah (2022) for a more extensive discussion about research on the peer-review process and associated references.

Questionnaire
Our experiment was conducted in two phases.Phase 1 was conducted shortly after the paper submission deadline, and Phase 2 was conducted after the authors were shown their initial reviews. 2We asked two questions during Phase 1 and one question during Phase 2, as described below.All of the questions were optional.Authors were told that their responses will not be seen by anyone during the review process and will not affect the decisions on their papers.The study protocol was approved by an independent institutional review board (IRB).A more detailed description of privacy and confidentiality of responses can be found in Appendix A.
Phase 1: The first phase was conducted four days after the deadline for submission of papers, and was open for ten days.All authors of submitted papers were asked the following question: • Acceptance probability.What is your best estimate of the probability (as a percentage) that this submission will be accepted?Please use a scale of 0 to 100, where 0 = "no chance of acceptance" and 100 = "certain to be accepted."Your estimate should reflect only how likely you believe it is that the paper will be accepted at NeurIPS, which may or may not reflect your perception of the actual quality of the submission.For context, over the past four years, about 21% of NeurIPS submissions were accepted.
Every author who had authored more than one submitted paper was also asked the following second question: • Paper quality ranking.Rank your submissions in terms of your own perception of their scientific contributions to the NeurIPS community, if published in their current form.Rank 1 indicates the submission with the greatest scientific contribution; ties are allowed, but please use them sparingly.
Notice that the two questions differ in two ways.The acceptance probability question asks for a value (chance of acceptance), and this value represents the authors' perception of the outcomes of the peer-review process for their paper.On the other hand, the paper quality ranking question asks for a ranking, and furthermore, pertains to the author's own perception of the scientific contribution made by their paper.
Phase 2: The second phase was conducted after the authors could see the (initial) reviews.This phase comprised a single question, and the participants were told they could answer this question irrespective of whether they participated in Phase 1 or not.

Basic statistics
In this section, we provide some basic statistics pertaining to the experiment.

NeurIPS 2021 conference:
• Total number of papers submitted to the conference: 9,034.
• Total number of unique authors who submitted papers to the conference: 23,882.
• Total number of author-paper pairs: 37,100.3 • Percentage of submitted papers that were eventually accepted to the conference: 25.8%.
We now move on to discuss the responses to the three questions.
• Number of papers with at least one response: 6,278 (69.5%).

"Paper quality ranking" question:
• Number of authors with more than one submission: 6,237.
• Total number of author-paper pairs for these authors: 19,455.

"Change of perception" question:
• Number of papers remaining after reviews were released (as some were rejected/withdrawn): 8,765 • Number of author-paper pairs remaining: 36,103.
Response rates and breakdown: The overall response rates in our experiment are broadly in the ballpark of the response rates of other surveys in computer science.The survey by Nobarany et al. (2016) in the CHI 2011 conference had a response rate of 16%.Rastogi et al. (2022) conduct multiple surveys: an anonymous survey in the ICML 2021 and EC 2021 conferences had response rates of 16% and 51% respectively; a second, non-anonymous opt-in survey in EC 2021 had a response rate of 55.78%.Frachtenberg and Koster (2020) survey authors of accepted papers in 56 computer systems conferences, with response rates ranging from 0% to 59% across these conferences.The survey by Gardner et al. (2012) was opt-in in 2011 and their response rate was 28%.We used gender self-reported in OpenReview profiles.The conference had 23,581 author-paper pairs with a self-reported gender "male" of the author, 3,328 author-paper pairs with a self-reported gender "female" of the author.We omit other gender-based subgroups in our analysis, following concerns about privacy and noise due to the small sample size of responses from these groups.Further, 7,432 author-paper pairs did not have a self-reported gender of the author.In phase 1, the response rate among author-paper pairs with self-reported gender as "male" was 30.9%, that among self-reported gender as "female" was 24.7%, and among the rest was 22%.
In terms of seniority, while we do not have a perfect measure of seniority, we use the role within the NeurIPS 2021 reviewing process as a proxy.We consider three levels of seniority.Ordered by decreasing seniority, these levels comprise of: (1) authors who were invited to serve as area chairs or senior area chairs at NeurIPS 2021, whom we refer to as "meta-reviewers"; (2) authors who were invited to serve as reviewers; and (3) authors who are in neither of the two aforementioned groups.The conference saw 3,834 authorpaper pairs for authors invited as meta-reviewers, 10,938 pairs for authors invited as reviewers, and 19,941 for those who were in neither list.The response rate (for acceptance probability) was 21% among authors invited as meta-reviewers, 28.9% among authors invited to review, and 29.8% for those in neither list.
In terms of paper outcomes, out of all the responses to Phase 1 (acceptance probability) of the experiment, 27% of the responses pertained to papers that were eventually accepted.Thus, in Phase 1 we did not see any large non-response bias with respect to the papers that were eventually accepted or rejected.
The "change of perception" question (Phase 2) was asked after the authors saw the reviews.Some authors with unfavorable reviews withdrew their papers before this phase.Some other papers were rejected before this phase for reasons such as formatting violations.As a result, out of all the responses to Phase 2 of the experiment, there was a significantly higher representation of accepted papers: 39.8% of the responses pertained to papers that were eventually accepted.Out of the 4,435 responses (author-paper pairs) in this phase, 3,259 were from authors who self-identified as male, 310 from authors who self-identified as female, and 866 from those who did not provide a gender or those who did not self-identify as male or female.In terms of participation in the review process, 324 responses were from authors who were invited to serve as meta-reviewers, 1,544 from authors who were invited to serve as reviewers, and 2,567 from neither.

Main analysis and results
We now present the main results.

Calibration in prediction of acceptance
We begin by looking at the responses to the acceptance probability question, and comparing it with actual acceptance decisions.In Figure 1a, we plot the relation between responses given by authors to the question and the actual acceptance rates for these papers.Here, the blue dots represent responses with at least 50 samples, which together comprise 94% of all responses.
We find that there is a nearly three-fold overestimation overall: The median of the acceptance probabilities estimated by the respondents is 70% and the mean is 67.7%.In comparison, we had primed respondents by mentioning that the acceptance rate in the past four years was about 21%.(The acceptance rate at NeurIPS 2021 ended up being 25.8%.)The fact that participants over-predict aligns with studies in other settings (Alpert and Raiffa, 1982;Anderson et al., 2012) that also find overconfidence effects.Also observe in Figure 1a that interestingly, the authors' predictions track perfect calibration quite well for responses up  to 35%, whereas responses greater than (to the right of) 35% are uncorrelated with the actual acceptance rate.
In Figure 1b, we sort the responses in descending order (on the x axis) and plot the values of these responses (y axis).We make separate plots for papers that were eventually accepted and those that were eventually rejected.We see that these two plots track each other quite closely, with papers that were eventually accepted having slightly higher predictions.We also observe indications of over estimation heremore than 5% of responses predict a 100% chance of their paper getting accepted, about 50% responses predict chances of 75% or higher, whereas fewer than 15% of responses provide a prediction smaller than 40%.

Role of demographics
Next we look at the role of demographics in calibration.For this we now define the calibration error in prediction of acceptance by any author.First, based on Section 5.1 and Figure 1a, we note that responses were on average overly confident, that is the predicted probability of acceptance was higher than the observed rate of acceptance.Further, we also observe that within each demographic-based subgroup, authors on average predicted a higher acceptance probability of their submission as compared to the acceptance rate within that subgroup.We thus know the direction of miscalibration of each subgroup.
We measure the calibration error of any subgroup in terms of the mean Brier score (i.e., squared loss).The Brier score (Brier, 1950) is a strictly proper scoring rule that measures the accuracy of probabilistic predictions: Given a prediction (value in the interval [0, 1] representing the probability of acceptance) and the outcome (accept = 1, reject = 0), the Brier score equals the square of the difference between the prediction and the outcome.To get a sense of the value of the Brier score, if 25% of the papers are accepted and all respondents provide a prediction of 0.25, then the Brier score equals 0.1875; if all respondents provide a prediction of 0.8 then the Brier score equals 0.49.In our analysis, we had decided in advance to execute statistical tests comparing calibration of male and female authors and of reviewers and meta-reviewers; we had decided to not compare the remaining subgroups due to possibility of high heterogeneity among them.We provide the main details about our analysis in this subsection, and provide additional details in Appendix B.

Male
Female Not reported Self-reported gender

Gender
We compute the average calibration error for a gender subgroup, weighted to account for confounding by other demographic factors of seniority and geographical region (see Appendix B for details).See Figure 2a for the average calibration error for the "male", "female" and "not reported" subgroups, where "not reported" comprises of authors who did not provide their gender information in their Open Review profile.We do not report statistics for other gender subgroups, which are very small, to preserve respondent privacy.
In many fields of science there is research showing that there exists a confidence gap between female and male participants (Dahlbom et al., 2011;Bench et al., 2015), where men are generally found to overestimate and women underestimate.In NeurIPS 2021, we tested for significance of difference in calibration error by male authors and female authors.To test this hypothesis, we consider the test statistic of the difference in calibration errors between female authors and male authors.We find that there is a statistically significant difference (p = 0.0012) at level 0.05.However, note that the effect size-the difference in the calibration errors between female authors (0.44) and male authors (0.40)-is very small (0.04).

Seniority
We now investigate the role of seniority in authors' calibration of probability of acceptance.As mentioned in Section 4, we consider three subgroups defined by the authors' reviewing role as a proxy for seniority, namely, authors invited to serve as meta-reviewers, authors invited to serve as reviewers, and the remaining authors.Figure 2b shows the average calibration error for these three subgroups, weighted to account for confounding by other demographics (see Appendix B for details).Further, we tested for significance of difference in the average calibration error between the sets of meta-reviewers and reviewers.As in Section 5.2.1, we consider the difference in the mean calibration error as the test statistic.The difference in calibration error between meta-reviewers (0.33) and reviewers (0.36) is 0.03, and the difference is not statistically significant (p = 0.055) at level 0.05.As mentioned earlier, we had a priori decided to not run any statistical tests on the "neither" group.

Prediction of acceptance vs. perceived scientific contribution
We investigate the consistency between the predictions by authors about the acceptance of their papers and the scientific contribution (paper quality) of those papers as perceived by the authors.There were a total of 6,024 pairs of papers by the same author where the author provided their responses for both questions for both papers.We break down the responses in Figure 3. Figure 4: Comparing authors' ranking of their perceived scientific contribution (paper quality) and the decisions from the peer-review process.This plot is based on 10,171 such responses.In particular, the first two bars enumerate the agreement and disagreement when the author-provided ranking is strict and where one of the papers is accepted and the other is rejected.
Of particular interest are the first two bars in Figure 3 that comprise responses where the same author provided a strict ranking of two papers they authored in terms of their perceived quality, and also gave distinct probabilities of acceptance for the two papers.We find that there is a significant amount of agreement -the two rankings agree in 92.6% of responses.However, there is a noticeable 7.4% of responses where the authors think that the peer review is more likely to reject the better of their two papers.

Agreements between co-authors, and between authors and peer-review decisions
We first look at author-provided rankings of their perception of the scientific contribution (paper quality) of multiple papers they authored.We compare these rankings with the outcomes (accept or reject) of the peer-review process.We show the results in Figure 4.In particular, observe that among the situations where the decisions for the two papers were different and the author-provided ranking was strict (first two bars of Figure 4), authors' rankings disagreed with the decision 34% of the time.(An analysis comparing the ranking of papers by authors' perceived acceptance probabilities and the final decisions yields results very similar to that in Figure 4.) We now compute agreements between co-authors in terms of their perceived scientific contribution (paper quality) of a pair of jointly-authored papers.We show the results in Figure 5. Observe that interestingly, Figure 6: Change in authors' perceptions of their own papers after seeing the reviews.The five bars in each plot represent the five options: much more positive ("++"), slightly more positive ("+"), did not change ("0"), slightly more negative ("-"), much more negative ("--").The three subfigures depict responses pertaining to all, accepted, and rejected papers, and are based on 4435, 1767, and 2668 such responses respectively.
among the pairs where both authors gave a strict ranking, they disagreed 32% of the time-approximately the same level of disagreement as we saw between the authors and reviewers.This high amount of disagreement between co-authors about the scientific contribution of their jointly authored papers has some implications for research on peer review.Many models of peer review (Roos et al., 2012;Ge et al., 2013;Tomkins et al., 2017;MacKay et al., 2017;Wang and Shah, 2019;Ding et al., 2022;Heyard et al., 2022) assume existence of some "true quality" of each paper.This result raises questions about such an assumption-if there were such a true quality, then it is perhaps the authors who would know them well at least in a relative sense, but as we saw above, authors do not seem to agree.In a recent work, Su (2021) proposes a novel idea of asking each author to submit a ranking of their submitted papers.Under the assumption that this author-reported ranking is a gold standard, Su (2021) then proposes to modify the review scores to align with this reported ranking.However, our observation that co-authors have a high disagreement about this ranking violates the gold standard assumption that underlies this proposal.

Change of perception
We now analyze the responses to the question posed to authors in the second phase of the experiment on whether the review process changed their perception of their own paper(s).We plot the results in Figure 6.
Given significant non-response bias in this phase with respect to acceptance decisions (Section 4), we also separately plot the responses pertaining to accepted and rejected papers.
We observe that among both accepted and rejected papers, about 50% of the responses indicated a change in their perceived opinion about their own papers.Furthermore, even among rejected papers, over 30% of responses mention that the reviews made their perception more positive.While past studies (Frachtenberg and Koster, 2020;Weller, 1996;Mulligan et al., 2013;Patat et al., 2019) document whether the review process helps improve the paper, the results in Figure 6 shows that it also results in a change of perception of authors about their papers about half the time.

Limitations and discussion
We discuss some key limitations.The 26.7% response rate in phase 1, and particularly the 12.3% response rate in phase 2, introduces concerns about non-response bias, in which non-respondents might have given different answers than respondents.We provide statistics pertaining to non-response bias in Section 4, and attempt to mitigate confounding with respect to the observables of demographics and paper outcomes (specifically, Section 5.2 and Section 5.5).However, importantly, only the observables cannot capture all the ways in which data may be missing not at random and this caveat ought be kept in mind in interpreting our results.A second limitation of this study is that respondents may not have been answering fully honestly.For example, if respondents believed that there was even a small chance their answers might leak to reviewers or co-authors, this would incentivize them to exaggerate the probability their paper would be accepted (an effect which would indeed be consistent with the pattern we observed).We took pains to mitigate this effect by assuring the authors of the privacy and security of their responses, and further, by asking them to not discuss their responses with others (see Appendix A).
These limitations notwithstanding, this study has several implications for improving the peer review process.First, the fact that authors vastly overestimated the probability their papers would be accepted suggests it would be useful for conference organizers and PhD supervisors to attempt to recalibrate expectations prior to each conference.This might mitigate disappointment from conference rejections.
The disagreements we document around paper quality -between co-authors as well as between authors and reviewers -suggest that, as previous work has also found, assessing paper quality is an extremely noisy process.A complementary study on the consistency of decisions made by independent committees of reviewers that was also run at NeurIPS 2021 also showed high levels of disagreement between reviewers (Beygelzimer et al., 2021).Specifically, 10% of submitted papers were assigned to two independent committees (reviewers, area chairs, and senior area chairs) for review, and of these papers, the committees arrived at different acceptance decisions for 23%.While it may be tempting to attribute this disagreement solely to flaws in the peer-review process, if even co-authors -who know their own work as well as anyone -have significant disagreements on the ranking of their papers, perhaps it is fundamentally hard or impossible to objectively rank papers.
The outcomes of paper submissions should thus be taken with a grain of salt, mindful of the inherent randomness and arbitrariness of the process and the arguable lack of a fully objective notion of paper quality.Realizing that the rejections which generally follow paper submissions do not necessarily result from lack of merit, but rather just bad luck and subjectivity, would both be accurate and healthy for the academic community.More broadly, as a community, we may take these findings into account when deciding on our policies and perceptions pertaining to the peer-review process and its outcomes.We hope the results of our experiment encourage discussion and introspection in the community.
to provide their rebuttal (response to the initial reviews) any time until August 10.We invited authors to participate in the Phase 2 survey on August 12.The peer review process was concluded on September 28, 2021, with the announcement of final decisions.
Instructions.In both the Phase 1 and Phase 2 surveys, authors were provided information regarding the privacy and confidentiality of their survey responses.They were informed that during the review process, only the authors themselves could view their responses, in addition to the administrators of Open-Review.net(the conference management platform used by NeurIPS 2021).It was emphasised that authors' responses could not affect the outcome of the review process and that the responses would not be visible to their co-authors, reviewers, area chairs, or senior area chairs at any point of time.Regarding the analyses and following dissemination of the findings from the experiment, the survey mentioned that, "After the review process, the survey responses will be made available to the NeurIPS 2021 program chairs and Workflow chairs for statistical analyses.Any information shared publicly will be anonymized and only reported in an aggregated manner that protects your identities."For the purposes of analysis, responses and profiles were accessed algorithmically via the OpenReview api.Further, authors were also told, "To allow authors to freely provide their opinions and keep samples as independent as possible, please do not discuss your answers to these survey questions with other NeurIPS 2021 authors (including your co-authors), or ask others about their responses." In Phase 1 of the experiment, we asked authors with multiple submissions to rank their submissions.The instructions for providing ranking were as follows: "Rank your papers in terms of your own perception of their scientific contributions to the NeurIPS community, if published in their current form.Rank 1 indicates the paper with the greatest scientific contribution; ties are allowed, but please use them sparingly.In the table entry for each submission below, there is a pull-down menu called "Paper Ranking."Please click on it and specify the rank for that submission." Finally, among the 6237 authors with multiple submissions, 32 authors (0.5%) provided a ranking for only one of their submissions.We exclude these responses from the analysis of the ranking.

•
Change of perception.After you read the reviews of this paper, how did your perception of the value of its scientific contribution to the NeurIPS community change (assuming it was published in its initially submitted form)?[Select any one of the following options.]o My perception became much more positive o My perception became slightly more positive o My perception did not change o My perception became slightly less positive o My perception became much less positive More details about the timeline and instructions are provided in Appendix A.
Plot of authors' predictions of chances of acceptance of the paper versus the actual acceptance rates for each response.The diagonal line represents perfect calibration and the (blue) dots represent authors' responses.(b)Plots of authors' predictions, for papers that were eventually accepted (thin yellow line) and rejected (thick blue line).The x-axis represents the fraction of responses with predicted percent chance greater or equal to the corresponding value on the y-axis.In other words, the x-axis is the fraction of responses with prediction greater than or equal to the corresponding y value.

Figure 1 :
Figure 1: Author's predictions on the probability of acceptance of their papers.

Figure 2 :
Figure 2: Comparing authors' calibration error (Brier score) in prediction of acceptance across different subgroups based on gender and seniority level.The error bars indicate 95% confidence intervals, obtained via bootstrapping.

Figure 3 :
Figure3: Comparing authors' (relative) predicted acceptance probability and perceived paper quality for any pair of papers authored by them.This plot is based on 6,024 such responses.In particular, the first two bars enumerate the amount of agreement and disagreement respectively, among responses of any author that had a strict ranking between the two papers for both questions.

Figure 5 :
Figure5: Comparing co-authors' rankings of their perceived scientific contribution (paper quality) of a pair of papers that both have authored.This plot is based on 1,357 such responses.In particular, the first two bars enumerate the agreement and disagreement of co-authors when they both provide strict rankings of their papers.