Cooperation between Referees and Authors Increases Peer Review Accuracy

Peer review is fundamentally a cooperative process between scientists in a community who agree to review each other's work in an unbiased fashion. Peer review is the foundation for decisions concerning publication in journals, awarding of grants, and academic promotion. Here we perform a laboratory study of open and closed peer review based on an online game. We show that when reviewer behavior was made public under open review, reviewers were rewarded for refereeing and formed significantly more cooperative interactions (13% increase in cooperation, P = 0.018). We also show that referees and authors who participated in cooperative interactions had an 11% higher reviewing accuracy rate (P = 0.016). Our results suggest that increasing cooperation in the peer review process can lead to a decreased risk of reviewing errors.


Introduction
Peer review is the foundation for decisions concerning publication in journals, awarding of grants, and academic promotion. Anonymous peer review plays a major role in decisions concerning publication in journals, awarding of grants and academic promotion. However, the anonymous nature of peer review is increasingly under scrutiny [1][2][3][4], and some journals have considered or already moved to open peer review [3,[5][6][7]. Debates about the utility and ethics of anonymity, have led to questions concerning whether there is any science behind peer review [8], to calls for an evidenced-based rationale for peer review [9], and to debates about alternative practices of peer review [2][3][4].
Despite its central role in the scientific process, the underlying social dynamics and accuracy of peer review under alternative systems are difficult to study. It is perhaps not surprising that there are few reliable studies of peer review. Conclusive randomized controlled studies require cooperation and coordination of journals, editors and authors within an academic community. It has been argued that many studies are inconclusive or suffer from methodological defects, primarily due to the robustness of author or review blinding [10]. Moreover, these studies focus on review quality [11][12][13][14][15][16], rather than correctness or impact of the results which can only be assessed retrospectively and after scientific consensus is achieved.
Here we develop a theoretical model for peer-review which can be described in terms of payoffs for author and referee behavior. We analyze the theoretical model to determine the properties of optimal strategies under both open and closed peer review. We then develop a model system in the form of an online game launched from the Amazon EC2 cloud to collect data to both support our theoretical model and evaluate accuracy and social dynamics under peer review. Using our model system, we perform experiments to collect quantitative data about the social behavior of referees in anonymous (closed) and non-anonymous (open) peer review. These data represent the first direct quantitative measurements of peer review accuracy under alternative peer reviewing systems. Using these data we show that: (1) under open review peer reviewers are rewarded for refereeing in contrast to closed review, (2) reviewers and authors are significantly more likely to cooperate under open review versus closed review, and (3) cooperative peer reviewing behavior leads to higher review accuracy.

Theoretical Model
Definition of the Peer Review Game. In our model there are K players participating in a game for a total of T units of time. Each player in the game participates in two activities: (1) solving problems and (2) reviewing solutions of their peers. For player k, the total time spent reviewing T r k and solving T s k must be less than the total time allocated for playing the game T r k zT s k ƒT. Over the course of the game, player k submits N s k solutions and reviews N r k solutions for other players. Let s ikj indicate the ith solution for player k, which is reviewed by player j. For each solution there is a corresponding time the solution was submitted t s ikj and time that the reviewer completed the review t r ikj . For player k the number of accepted papers at time t is the sum of the indicators that each of their submitted solutions is accepted up to that point: A k (t)~X fi:t r ijk vtg 1(s ikj accepted). The payoff is proportional to the number of accepted solutions, which reflects the commonly held belief of ''publish or perish'' in academia. So the expected payoff for player k at time t is: where p ikj is the probability that solution i for player k is accepted by player j. The payoff is a function of the number of submitted solutions and the probability that each solution is accepted. The probability a solution is accepted is a function of the submitter, the reviewer, the time the solution is reviewed, and the solution itself.
where f ( : ) is a non-negative function mapping the solution, the review time, the solver, and the reviewer onto ½0,1. Player k can increase their payoff by increasing the number of solutions they submit or increasing the probability each solution is accepted. An alternative is a competitive payoff where the payoff function is proportional to the difference between a player's number of accepted solutions and the maximum of all the other player's payoffs. In this case, the expected payoff is: Closed Peer Review (CPR). Under closed peer review, the model for the acceptance probability for solution s ikj is modeled as: Here there is an effect for the solution itself a( : ) which may reflect a large number of factors about the solution, including the type of problem or the time spent on the solution. There is also an effect for the solver b( : ) since some solvers are more likely to submit correct solutions than others. Each reviewer may choose to accept or reject problems at a different rate which we model by c( : ). Under CPR the public information is the number of solutions that each player has submitted and had accepted by another player. A(t r ikj )) is a vector of the cumulative number of accepted solutions for each player at time t r ijk . The function k( : ) quantifies the influence of this information on the probability solution s ikj is accepted.
At any given time point a player can choose between three different strategies: (1) solve and submit a problem, (2) review a problem and reject, or (3) review a problem and accept. The first strategy has the potential to improve a player's payoff, by increasing the number of submitted solutions. If a player chooses either of the first two strategies, no other player's score will increase. If the player chooses strategy (3), then another player's score will increase. However, that person will not know who accepted their solution. Under CPR, if a player chooses strategy (2) or (3) they will reduce the amount of time they spend solving a problem and will reduce their expected payoff. However, no other player will be aware of this choice since reviews are anonymous and only the cumulative accepted solutions for each player is known. In this game, there is no increase to the payoff function for reviewing. Therefore, each player maximizes their expected payoff by always choosing strategy (1) and never reviewing, so this solution is the Nash equilibrium [17].
Open Peer Review (OPR). Under OPR the model for the acceptance probability for solution s ikj includes the same terms as CPR, along with terms that encode the influence of the current public and private information available to each player.
The model includes a term, g(), that is a function of vector of the cumulative number of solutions reviewed and accepted by each player. The functions k( : ) and g( : ) encode the public information available to each player. Under the open system, player k also knows the cumulative number of times player j has reviewed their solutions, R kj (t r ikj ), and accepted their solutions R a kj (t r ikj ) at the time of the review. The function j quantifies the effect of this information on the probability of acceptance.
Under the OPR it is possible that a player may incur some benefit by reviewing for other players. Specifically if a player has previously accepted solutions for player j, they may improve the probability their solution is accepted through the function j( : ). Similarly, if they are a generous reviewer to all the other players, player j may again be more sympathetic and the probability of acceptance may be increased through the function g( : ). The residual benefit of reviewing may carry over to future times, so the functions g and j are functions of the cumulative reviews and acceptances to time point t r ijk . Under OPR, a player still has the same three strategy choices at any given time point: (1) solve and submit a problem, (2) review a problem and reject, or (3) review a problem and accept. However, under OPR a player may incur some increase in their probability of acceptance if they choose strategy (2) or (3). They are particularly likely to incur increases in their acceptance probability when choosing strategy (3). Under this mode, additional Nash equilibria may be possible. To calculate these equilibria, substantial additional assumptions are required about the benefit of reviewing, the time it costs to perform a review, and the timing of additional reviews. Since the payoffs functions now depend continuously on the number of accepted and reviewed at each time point, the game must be modeled as a continuous game. Theoretical analysis of OPR represents a potentially fruitful area for future research.
Relative Payoff of Reviewing and Solving. It is not difficult to argue that in science, the payoff for solving problems is significantly greater than the payoff of reviewing submissions. The only way to change this ordering is to decrease the payoff for solving problems or to increase the payoff for reviewing problems, or both. The former might be achieved in situations where the information available to the community causes the community to punish a player by reducing the acceptance rate of the player's submissions [13]. The latter might be achieved by increasing the time spent reviewing and rejecting the submissions of other players. An example would be if a player could somehow reject all the submissions of a strong competitor, without knowledge of these actions being provided to the community.

Experimental Results
Setup. Our model system for peer review was an online game launched from the Amazon EC2 cloud played by 7-10 individuals over a fixed period. Players were graduate students, postdoctoral fellows, research scientists or principal investigators, all of whom are members of a single research laboratory. The game was designed to replicate several components of editorial peer review: (1) most reviewers know the authors of the papers they referee, (2) peer review is usually performed within relatively small communities of individuals [18], and (3) peer review involves repeated interactions between referees and authors. The game's interface presented players with multiple-choice questions similar to those found on the Graduate Record Exam (GRE) [19]. At any point in the game a player chose between solving problems or reviewing (accepting or rejecting) solutions submitted by other players. The software also played the role of journal editor and randomly assigned submitted solutions to players for review. At the end of the game, the two players with the largest number of accepted submissions received monetary rewards, reflecting the conventional publish or perish academic incentives.
Individual games were played in either a closed mode, or in an open mode. In the closed mode, the reviewers were anonymous ( Experimental results agree with theoretical model and previous studies of peer review. To mimic the dynamics of a small community of scientists, we recruited individual research laboratories to play the Peer Review Game (Materials and Methods). We recruited members of six research laboratories at Johns Hopkins University to play the Peer Review Game in closed mode (3 labs, n = 8, 8, and 9 players) and open mode (3 labs, n = 7,10, and 8 players). Each laboratory played the game for T = 40 minutes. We collected a total of 1,143 solutions and 666 reviews over the course of the six experiments. Overall, 62% of the submitted solutions were correct. Peer review did lead to an increase in accuracy; only 39% of rejected solutions were correct, while 78% of accepted solutions were correct. We first evaluated our experimental model by comparing our results to predictions of our theoretical model, previous results on iterated games, and previous studies of peer review.
In the open system each solution a player accepted led to an increased probability their own next submission would be accepted (2% increase per accepted solution, P = 0.047). Our theoretical analysis suggested a similar potential increase in probability for helpful reviewing behavior. Under closed review players were not rewarded for reviewing additional submissions, i.e. there was no significant difference in the probability a playerÕ s submissions would be accepted for each additional review (0.8% decrease per accepted solution, P = 0.30). Under the open system one of the top two reviewers was always one of the winners of the game, suggesting that reviewers were rewarded for their good behavior toward other players (Materials and Methods). This result agrees with both our theoretical analysis and the results of previous studies of iterated games, which showed that costly punishment has been shown to be negatively associated with payoff. In other words ''winners don't punish'' [20].
Review times were not significantly different between open and closed review (2 seconds longer on average for closed games, P = 0.31), consistent with observations from randomized controlled trials [12]. However, in the closed games players spent a higher proportion of their time solving problems instead of reviewing (Figure 2  Finally, overall reviewing accuracy was statistically indistinguishable between open and closed peer review (1% more accuracy under closed, P = 0.762). This result agrees with previous studies of open and closed peer review which showed no statistically significant difference in review quality between the two systems [14].
Open review leads to increased cooperation which leads to increased review accuracy. An important question is whether making reviewing behavior public facilitates cooperation. For each experiment we calculated a pair-wise measure of cooperation between players (Materials and Methods). The open review experiments showed more cooperative connections than the closed experiments (22% versus 9% respectively, P = 0.018, Figure 3). It was not immediately clear that cooperation between referees and authors would increase reviewing accuracy. Intuitively, one might expect that players who cooperate would always accept each others solutions -regardless of whether they were correct. However, we observed that when a submitter and reviewer acted cooperatively, reviewing accuracy actually increased by 11% (P = 0.016). The difference in accuracy was significant even after adjusting for the fact that some solvers had higher accuracy than others (11% increase in accuracy, P = 0.039). The increase in reviewing accuracy was mediated by cooperative interactions between players, since overall accuracy was comparable under open and closed peer review (1% more accuracy under closed, P = 0.762).

Discussion
We have developed both a theoretical and experimental model for peer review. Our theoretical model allows exploration of the relative impact of alternative systems and incentives for peer review. A basic analysis of the theoretical model suggests that the current system of anonymous peer review discourages reviewing activities. Further exploration of the model under alternative systems and incentives may be helpful in evaluating alternative models of review going forward. Using our experimental model, we were able to collect the first data on social interactions and accuracy under alternative peer review models. Our experimental results both substantiate our theoretical model and agree with previous studies of peer review systems. We have also shown that one mechanism for increased cooperation is making reviewer information public. But other mechanisms for improving cooperation in the review process may exist; for example, reducing calls for unnecessary experiments has recently been suggested as a potential improvement in the reviewing process [21]. Our results indicate that improved cooperation does in fact lead to improved reviewing accuracy. These results suggest that in this era of increasing competition for publication and grants, cooperation is vital for accurate evaluation of scientific research.

The Peer Review Game
We developed a peer review game that can be played by two or more players. The game was developed as an Amazon Machine Image (AMI) that can be launched from the Amazon Elastic Compute Cloud [22]. The game was developed using the vWorker online development platform [23]. Players were directed to a website of a temporary web-server and logged on with a user name and password. When the investigator initiated the game, the players were shown a task selection page (Figure 4). They could choose to solve a problem or choose to review a problem from their list of pending reviews. If a player chose to solve a problem, then a GRE-like problem was selected from a database for them to solve and displayed to their screen ( Figure 5). The GRE problems used for the experiment were based on problems from the website [19]. If they chose to review a problem, then they were shown a solution to a problem submitted by one of their peers ( Figure 6). They could choose to either accept or reject the solution to the problem. The program acted as editor, randomly assigning problems to players for peer review.
In both the open and closed games reviewers were shown the identity of the player who solved the problem. Under the open system, solvers were also shown the identity of the player who acted as peer reviewer for their solution. Throughout the game, information was projected onto a screen at the front of the room. In the closed mode, the number of solutions each player had submitted and had accepted was displayed (Figure 7a). In the open mode, the number of solutions each player had reviewed and accepted for one of their peers was also displayed (Figure 7b).
At the beginning of each game, the players were read the instructions for the appropriate mode (closed or open) as described in the following sections. The investigator then initiated a session of the Peer Review Game that lasted for T = 40 minutes in each case. Nametags were given to each subject with their anonymous subject ID at the beginning of the experiment and players were permitted to speak to one another during the course of the experiment.

Recruitment
Six laboratories at the Johns Hopkins Medical School and Johns Hopkins Bloomberg School of Public Health were recruited to participate in the peer review experiment. Laboratories consisted of graduate students, postdoctoral fellows, research scientists, and     open game. Participating laboratories were offered $50 for each 10 participating members of the lab, a complimentary lunch, and the potential for two lab members to earn $5 each. Written informed consent was obtained from all participants in the study. Recruitment was performed with approval from the Johns Hopkins Bloomberg School of Public Health IRB, project number 3316.

Group dynamics measurement
Next we estimated a measure of cooperation or obstruction between subjects i and j. The baseline observed acceptance probability for subject i, P i was calculated as A i =N i where A i is the number of solutions accepted by subject i and N i is the number of solutions reviewed by subject i. We computed the observed probability that subject i accepts a solution submitted by subject j, P ij , as A ij =N ij , where A ij is the number of solutions accepted by subject i which were submitted by subject j, and N ij is the number of solutions reviewed by subject i which were submitted by subject j. The difference d ij~Pij {P i gives a measure of the change in the probability subject i accepts a solution from subject j relative to their overall acceptance rate. Similarly, we can calculate d ji as a symmetric measurement. If d ij and d ji are both positive, then the interaction between the two subjects is cooperative. Similarly, if both values are negative, the interaction between the two subjects is obstructive. We calculated the total number of possible interactions under both the open and closed peer review experiments. Among these, we identified the number that were cooperative. We then performed a two-sample test of proportions to evaluate whether there was more cooperation under OPR or CPR.

Outcome modeling
In all outcome modeling, the unit of observation is one reviewed problem. Each reviewed problem has a solver and a reviewer and is associated with a particular study type, either open or closed.
To control for differences in behavior between individual participants, the models described below were fit using a mixedmodel framework, with all models including separate random effects for solvers and reviewers. Model fitting was done in the statistical programming language R [24] using the function glmer from the package lme4 with a linear link assuming Gaussian distribution of random effects [25]. In a general form the random effects model can be written as where y ijt is the outcome of interest related to a review at time t by subject j, for a solution submitted by subject i; m is the mean outcome over the whole data set; x ijtk is the k th covariate of interest which has effect size b k , u i is a random effect associated with subject i and v j is a random effect associated with subject j. We assume that u i , v j and e ijt are mean zero Normal random variables with variances s 2 u ,s 2 v and s 2 e , respectively. To assess the impact of previous review performance by a subject on the chance that solutions submitted by that subject will be accepted, we associated to each reviewed problem the number of solutions accepted by the problem submitter, up to the time the problem was reviewed. In the open framework, this value was known to all study participants, including the reviewer; in the closed framework, this value was unknown.
Modeling the acceptance probability of a submission as a function of this covariate and the study type, and their interaction, we assessed the change in acceptance probability for each solution accepted by the submitter, in either the open or closed review setting. Define a ijt to be the indicator that solution s ijt is accepted. The model is then: where R a it is the number of reviewed and accepted solutions by subject i by time t, S ij is an indicator of the study type that subjects i and j participated in (taking a value of 0 for closed review and 1 for open review). In this model u i is a random effect representing the solver, v j is a random effect representing the reviewer and e ijt represents residual variation not due to reviewer or solver effects.
To assess the impact of the open or closed scenarios on review quality, we associated to each reviewed problem an indicator of whether the review was accurate, given that we know the correctness of the submitted solution.
We defined the variable c ijt to be an indicator of whether solution s ijt was correctly reviewed (e.g. accepted if correct, rejected if incorrect). To assess the impact of cooperation on review accuracy, for each reviewed problem, we defined a 0-1 indicator O ijt which takes a values of 1 if subjects i and j have a cooperative interaction. We then fit the model where all terms are as defined above.
To ensure the effect observed in this model is not due only to the increased accuracy of the solution submitted by the problem solver, for each reviewed problem we defined a three-level factor, with level 0 indicating that neither the solver nor the reviewer was part of a cooperative pair, 1 indicating that only the solver was part of a cooperative pair, and 2 indicating that both the solver and the reviewer are part of a cooperative pair. Calling this variable Q ijt we then fit the model where all terms are as defined above.
We also modeled this accuracy as a function of study type alone to determine whether one scenario produced more accurate reviews. We fit the model where all terms are as defined above.

Instructions for the Closed Peer Review Games
Purpose of research project. This research is being done to evaluate open and closed peer review systems experimentally. Peer review is the process by which scientific research is evaluated for publication in journals. The goal of this study is to determine whether anonymous (closed) or non-anonymous (open) peer review results in more correct research being accepted.
Why you are being asked to participate. You are being asked to participate in the study because you are a graduate student, postdoctoral research fellow, scientist, or faculty member at Johns Hopkins University and are representative of the population of individuals who will participate in the peer review process.
Procedures. Once the experiment begins, you will be asked to answer multiple choice questions similar to questions on the graduate record exam (GRE). After you submit your answer, the solution will be randomly assigned to another participant in the study for review. The reviewer can either choose to accept or reject the solution. The reviewer will know your subject ID. However subjects who submit solutions will not know the ID of the reviewer of their solution. Throughout the course of the experiment you will act as both a reviewer and a problem solver. You may spend as much time as you like on either task. The experiment will last for forty minutes. I will now show you example screens from the experiment website and you may ask questions about the study procedure.
Risks/discomforts. You may experience some stress since you will be asked to answer GRE like problems and review the solutions of your peers. However, the only interaction you will have with other participants will be through the anonymous subject IDs.
Payment. The two individuals with the most accepted answers at the conclusion of the experiment will receive $5. The payment will be in cash immediately following the experiment. If you leave the study early you will lose your opportunity to win the cash prizes distributed at the end of the experiment.
Protecting data confidentiality. All research projects carry some risk that information about you may become known to people outside of a study. We minimize these risks by not connecting your responses to any information that could be used to identify you. All data collected during this experiment will only be connected with the anonymous subject ID you have been assigned.
Protecting subject privacy during data collection. Your responses and reviews will not be personally associated with you. All interaction will be performed based on the anonymous subject IDs you have been assigned.
What happens if you leave the study early? You may leave the study at any time without penalty.

Instructions for the Open Peer Review Games
Purpose of research project. This research is being done to test open and closed peer review systems experimentally. Peer review is the process by which scientific research is evaluated for publication in journals. The goal of this study is to determine whether anonymous (closed) or non-anonymous (open) peer review results in more correct research being accepted.
Why you are being asked to participate. You are being asked to participate in the study because you are a graduate student, postdoctoral research fellow, scientist, or faculty member at Johns Hopkins University and are representative of the population of individuals who will participate in the peer review process.
Procedures. Once the experiment begins, you will be asked to answer multiple choice questions similar to questions on the graduate record exam (GRE). After you submit your answer, the solution will be randomly assigned to another participant in the study for review. The reviewer can either choose to accept or reject the solution. The reviewer will know your subject ID and you will know the reviewer ID for each solution after it is reviewed. Throughout the course of the experiment you will act as both a reviewer and a problem solver. You may spend as much time as you like on either task. The experiment will last for one forty minutes. I will now show you example screens from the experiment website and you may ask questions about the study procedure.
Risks/discomforts. You may experience some stress since you will be asked to answer GRE like problems and review the solutions of your peers. However, the only interaction you will have with other participants will be through the anonymous subject IDs.
Payment. The two individuals with the most accepted answers at the conclusion of the experiment will receive $5. The payment will be in cash immediately following the experiment. If you leave the study early you will lose your opportunity to win the cash prizes distributed at the end of the experiment.
Protecting data confidentiality. All research projects carry some risk that information about you may become known to people outside of a study. We minimize these risks by not connecting your responses to any information that could be used to identify you. All data collected during this experiment will only be connected with the anonymous subject ID you have been assigned.
Protecting subject privacy during data collection. Your responses and reviews will not be personally associated with you. All interaction will be performed based on the anonymous subject IDs you have been assigned.
What happens if you leave the study early? You may leave the study at any time without penalty.

Reproducible Research
To conform with the standards of reproducible research, R [24] scripts and R data objects have been posted at: http://www. biostat.jhsph.edu/,jleek/peerreview/ that reproduce all analyses performed in this paper.

Informed Consent
Written informed consent was obtained from all participants in this study. This specific study was approved by the Johns Hopkins Bloomberg School of Public Health IRB with project number 3316.