Conceived and designed the experiments: PW. Performed the experiments: PW. Analyzed the data: PW IEH. Contributed reagents/materials/analysis tools: PW IEH. Wrote the paper: PW IEH. Designed the software used in analysis: PW IEH.
The authors have declared that no competing interests exist.
The literature is not univocal about the effects of Peer Review (PR) within the context of constructivist learning. Due to the predominant focus on using PR as an assessment tool, rather than a constructivist learning activity, and because most studies implicitly assume that the benefits of PR are limited to the reviewee, little is known about the effects upon students who are required to review their peers. Much of the theoretical debate in the literature is focused on explaining
The purpose of the study is to investigate whether the writing of PR feedback causes students to benefit in terms of: perceived utility about statistics, actual use of statistics, better understanding of statistical concepts and associated methods, changed attitudes towards market risks, and outcomes of decisions that were made.
We conducted a randomized experiment, assigning students randomly to receive PR or non–PR treatments and used two cohorts with a different time span. The paper discusses the experimental design and all the software components that we used to support the learning process: Reproducible Computing technology which allows students to reproduce or re–use statistical results from peers, Collaborative PR, and an AI–enhanced Stock Market Engine.
The results establish that the writing of PR feedback messages causes students to experience benefits in terms of Behavior, Non–Rote Learning, and Attitudes, provided the sequence of PR activities are maintained for a period that is sufficiently long.
Due to the rapid advance in computer technology, Peer Review (PR) has become an important practice in higher education in a wide variety of fields and educational settings
Some educators and educational researchers perceive PR as a formative assessment and grading tool rather than a collaborative learning activity
However, and even if it is primarily viewed as a formative incentive, PR practices may restrict a learner's freedom to experiment, to be creative and to collaborate in the joint construction of knowledge and the negotiation of alternatives through debate and argumentation
In contrast to PR as a formative grading tool, little is known about the effects upon students who are required to review their peers because almost all empirical PR studies focus on the effect on the receiver of the feedback, i.e. the reviewee
Most importantly, and
Nevertheless, in the literature, there seems to be a theorydriven belief that PR activities stimulate constructivist learning — or in other words
Even though the process of PR may seem to play an important role, as a formative assessment tool or as a constructivist learning activity, we cannot neglect the fact that there are only few studies in which the effects on learning outcomes are actually tested
Fortunately, the availability of various e–learning tools that we developed
The concept of peer reviewbased learning in universitylevel statistics education is largely uncharted. This may be strange because the need to be able to critically review statistical papers has never been disputed
The implementation of PR in educational practice through online technology has been advocated and studied by several educational researchers
We investigated the causal effects of Reproducible Computing (RC) technology
We investigate the effects of PR on perceived utility, learning outcomes (about true understanding of the underlying statistical concepts), attitudes towards trading, and the effect on actual trading activities. The rationale behind this is that student's understanding of statistical concepts is insufficient to describe the potential effects of competing learning approaches — changes in actual behavior and attitudes (such as risk aversion) may be equally (if not more) important.
With the exception of perceived utility, all effects are measured by means of objective and accurate observations. This is possible through the use of innovative RC technology and the XSE which have been seamlessly integrated into the learning environment used in this experimental research.
In line with our earlier research which has been focused on reproducibility of statistical computing
The experiment was conducted with several ethical considerations in mind, which are briefly listed here:
There was informed consent from the students. All students in this study had the opportunity to indicate whether they wanted to participate in the experiment or not. This was achieved through a selection menu from within the VLE (the choices were stored electronically and could not be forged because the students were required to logon to the VLE). During the lectures, students received detailed information about the experiment. If they chose not to participate, they were required to work on an offline assignment about an article which covers roughly the same topics as the ones that were introduced in the experiment
All data were anonymized by replacing student names with unique, noninformative numbers.
The collected data did not contain any sensitive information.
The results of the experimental measurements were not used to grade students. Rather, students were graded on their active participation in either the experiment or the alternative (offline) assignment.
The experimental treatments under investigation were in no way related to the core statistics curriculum and did not influence student performance at the final examination. In other words, the treatments in the stock market game did not discriminate any students to perform well in the statistics course.
In most situations, an official approval by an Institutional Review Board (or Ethical Committee) is not required for educational research, as is exemplified by the exemption of “
The stock market game is part of our extracurricular offerings. This means that permission to organize the game was granted too.
The experiment was embedded in a compulsory undergraduate statistics course for business students in Belgium. The emphasis of the course was on constructivist learning, based on more than 70 different statistical techniques which cover the following topics: explorative data analysis, hypothesis testing, multiple linear regression, univariate time series analysis, and nonparametric statistics. We used a statistics handbook which was translated from English to Dutch and covers most of the topics of the course
For each technique, students had one or more webbased software modules available within the R Framework which was developed at the University of Leuven
The software is freely available online at
In order to implement this course within a setting of constructivism for a large student population, we introduced a strict assignment–review mechanism. This is illustrated in
The semester ended with a final examination consisting of a series of objective multiple choice questions which referred to a large document containing raw computational output (charts and tables about several data series). The examination was intended to test understanding of statistical concepts rather than rote memorization. More precisely, the exam was designed to test if students were able to:
identify the computational output that was relevant to the question
interpret the output in terms of the question
critically investigate if the underlying assumptions of analyses were satisfied
The main sections of the statistics course were built around a series of researchbased workshops (labeled WS1, WS2, …) that required students to reflect and communicate about a variety of statistical problems, at various levels of difficulty. The problems were carefully designed and tested over a long period. Each workshop contained questions about common datasets and questions about individual data series provided to students — this dual structure of the workshops promoted both collaboration between students and individual work. The top (blue) puzzle pieces in
Each week there were two (compulsory) lectures which are labeled L1, L2, etc. With the exception of the first and last week, each lecture consisted of the following two parts:
one or several illustrated solutions of the previous week's workshop assignment based on good and bad examples of archived computations that have been generated by students and the educator
an introduction to next week's assignment including a reading list and an illustration
During each week, students were required to work on their workshop assignment and — at the same time — write peer reviews (labeled Rev1, Rev2, …) about (an average of) six assignments that were submitted by peers. Each review was based on a rubric of a minimum of three criteria and required students to submit a workshop score and an extended feedback message for each criterion. In
The PR process was supported by newly developed, innovative software that is based on a socalled contentbased design of the Virtual Learning Environment (VLE) which can be shown to be more efficient than traditional PR implementations
As one might have noted, this feedbackoriented process is similar to the peer review procedure of an article that is submitted to a scientific journal. The process of peer review is an important aspect of scientific endeavor, and may help us in achieving learning goals with respect to attitudes (through peer review experiences) and skills (through construction of knowledge). The key idea behind this constructivist application is that students are empowered to interact with reproducible computations from peers and the educator. Students are required to play the role of active scientists who investigate problems, present solutions, and review the work of peers. Access to webbased
The actual experiment was conducted in parallel to the regular course activities as is illustrated in
had sufficient background knowledge of statistical concepts
had already experienced several rounds of peer review
were able to use the statistical software and blogging features.
Rather than using regular statistical topics as the subject of experimental study, we opted to use the annual Stock Market Game (SMG), based on the XSE software, as a vehicle to measure learning outcomes. The SMG has a long tradition at several Business Schools in Belgium and the underlying XSE software is stable and thoroughly tested because it was originally developed for EURONEXT and the European Commission, for the purpose of training and research. The participants in the experiment were required to learn about a series of new statistical techniques that can be used to analyze stock market time series and make informed decisions about the investment strategy that is employed.
For instance, one of the assignments that we introduced (XA1, XA2, etc. in
The XSE software allows students to interact with the R Framework in realtime. This implies that participants are able to send the stock market time series to any webbased R module for analysis
In order to drive trading on the stock market, the game administrator (or educator) is able to influence the news messages that are sent to the traders. If the administrator sends good news about a company into the trading room then there is a good chance that some participants launch orders to buy the shares of that company. In any case, the computer trader, if activated, will respond to the news messages and change its limit prices according to a large number of heuristic rules which are based on our analysis of actual (typical) market reactions that can be associated with similar news messages. The consequence of this mechanism is that the stock prices will fluctuate according to what “normally” happens on the real stock market.
In principle, the administrator is able to steer the market through the manipulation of corporate or general news messages. However, the wealth of the computer trader is limited and can be changed by the administrator. This implies that the influence of orders made by human participants may become much stronger than the impact of the AIenabled computer trader. In other words, if participants behave irrationally then the stock market prices will show statistical properties that deviate from what could normally be expected
The SMG was used to obtain objective measurements of student's ability to apply newly acquired statistical knowledge to solve new and challenging problems. Before the actual measurement was made, participants only knew that they would be required to design a profitable financial investment and implement it through trading activities on the stock market during a period of a few hours. We made the window of measurement relatively short because that ensures that the participants had to work under stress and did not have much time to communicate or collaborate with each other.
During the first weeks of the semester, we introduced the basic concepts of the experiment and also explained the rules of engagement as explained in the Ethical Considerations subsection. In the statistics course there were 314 students who completed the final examination. From this group we had no information about or manually excluded the observations from students who:
did not want to participate (and chose to do the alternative, offline assignment)
were not able to complete the entire experiment (due to illness, etc.)
dropped out or wished to discontinue the experiment
did not complete the experimental trading activities within the specfied deadline
had prior knowledge about the statistics course or the stock market game (e.g, students who had to retake the course, or played the SMG before)
As a result, we had valid data from a total of 154 students for statistical analysis.
We announced the date and exact time during which the experimental investment strategy would have to be designed and implemented. The actual description of the challenge however, was unknown to the students and only revealed at the start of the measurement period. Moreover, students did not know beforehand what the market circumstances would be like during the measurement period. In the tutorials (Tut1, Tut2, etc. in
At the beginning of the measurement period one should analyze the stock market time series and create three piles which are conveniently called: Long, Short, and Neutral.
We put all the stocks for which we predict an increase onto the Long pile. The stocks which are predicted to decline belong to the Short pile. All remaining stocks are in the Neutral pile.
When we placed all stocks in the appropriate piles, we buy the shares in the Long pile, and sell the ones in Short pile. Note: on the stock market it is possible to sell shares that one does not already own. In essence one “borrows” the shares from a third party (the broker) and sells them, hoping that prices will fall. At some time in future, the short seller must buy back the borrowed shares (even if the share price has increased). For obvious reasons, short selling is subject to several limitations. Obviously, the stocks from the Neutral pile are not held in the portfolio.
We hold the Long and Short position until the end of the measurement period. After that we evaluate the profits (or losses) for the investment portfolio.
The above investment strategy is referred to as a “marketneutral arbitrage strategy” (MNAS) which is often used by hedge funds and may be supported by statistical analysis. In theory the MNAS works for “bullish” (rising) and “bearish” (declining) markets as long as one is able to correctly pick the stocks that go into the Long, Short, and Neutral piles. Within the context of our experiment, students had complete freedom to choose how they would make their investment decisions. Since they didn't know that our main interest was in the application of statistical techniques, as it was presented as a trading game, there was no obligation to use any statistical analysis which is illustrated by the fact that some students made their decisions based on economic intuition rather than empirical evidence.
Based on the findings in usability and technology acceptance research, we may expect that our technologydriven approach to constructivist education is affected by several aspects that pertain to students' attitudes and emotional experiences. The study of
statistical anxiety (which is a multidimensional concept)
statistical software selfefficacy
computer attitude
perceived ease of use
The perceived usefulness construct, in turn, affects behavioral intentions to use the software in the future. Other studies, such as
One of the subscales of particular importance for this study is “
In an attempt to make Statistics more attractive to students, we have tried to implement a more practical approach than what is done in a traditional or typical statistics course. The constructivist approach to statistics education may seem promising in this regard because it encourages students to experiment, communicate, and experience statistical problems in a more natural or practical environment. It is therefore interesting to investigate whether it is possible to gain students' interest in the subject (of Statistics) through constructivist learning activities such as Peer Review — after all, Statistical Analysis may well be seen as an acquired taste.
Due to its academic and practical importance we have decided to formulate the following null and alternative hypothesis about utility:
The Utility Hypothesis implies that utility or usefulness affects the intention to use statistical software at some undefined time in the future, for the purpose of solving some undefined problem. In other words, this hypothesis implicitly assumes an effect on the long term and for general purposes.
In contrast to this, the study of
Even if PR does not improve perceived utility of statistics, it still might have an impact on actual use for the purpose of solving particular problems. In addition, it should be noted that the actual (shortrun) behavior of students can, unlike percieved utility, be objectively measured because all statistical computations are performed within the R Framework which maintains historical and detailed records of computing activity. If constructivism, by means of PR, is claimed to be beneficial, it should lead to changes in actual behavior on the shortrun, even if the problem occurs outside of the regular course (i.e. the SMG).
In line with current tradition in educational research, the pedagogical paradigm of constructivism is believed to support nonrote learning
For this reason we treat the NonRote Learning Hypothesis as the most most important hypothesis in this study. Even if constructivism (by means of PR) cannot affect behavior or perceived utility, at least we hope to find evidence that it helps students to understand statistical concepts to such a degree that they can solve particular problems with the correct type of analysis (for which the underlying assumptions are satisfied).
We specify the nonrote learning hypothesis as follows:
The literature review of
Learning outcomes in academic education are not only expressed in terms of skills (as described in the NonRote Learning Hypothesis) but also relate to attitudes. In the curricular definition of our academic courses it is often specified what type of attitudes should be changed or improved. In daily practice however, one rarely sees any evidence that a course truly affects student attitudes, let alone that attitudes would be estimated through the use of surveys or based on objective measurements. In our experiment, we had the opportunity to investigate this matter based on objective measurements of trading actions which are closely related to students' attitudes.
Within the XSE software, students could submit orders to buy and sell shares according to the rules of the EURONEXT exchange. One of those rules specifies that traders have the option to submit Market Orders (MO) or Limit Orders (LO). A MO is simply a request to buy or sell shares in a certain quantity. The price at which the trade should take place is not specified by the submitter of the MO. Therefore, the exchange will search for the “best” counter party that is currently available. The price at which the trade is executed is simply the “best bid” (highest bid price) or “best ask” (lowest ask price) of all available counter parties. On the contrary, the LO allows the trader to specify a quantity and a limit price. For instance, if the trader wishes to buy shares at a limit price of EUR 10 per share, then the order will only be executed if there is a counter party with a MO or a LO which specifies a selling limit price that is not higher than EUR 10.
In the experimental tutorials it is clearly explained how the order system of the stock exchange works and how this is related to what is commonly called “market liquidity” (i.e. the property that ensures that shares can be sold or bought quickly and without large price changes). As explained before, the XSE is not a simple simulation of stock prices — it
many human, and relatively inexperienced, participants would enter the market at roughly the same time
if all human traders make the same decisions there will be no counter party available (the counter offer from the computer trader would soon be completely executed which leads to a situation where the best counter offer is made by another human participant and which may well have an extreme limit price)
some (smart) participants submitted buy and sell LOs at extreme prices, knowing that in times of stress, many traders would simply submit MOs. These participants are literally hoping that chaos occurs because that would cause them to make large profits.
Students did not know before or during the experiment how large the impact of the computer trader would be. They also did not know that their choice of order (MO or LO) was of particular interest in our experiment. In other words, there was no indication or information about the importance of MOs versus LOs that could have affected the outcome of the experiment. In addition, it is important to understand that students did not only learn about statistical techniques, but also about the statistical properties of the stock market and how this affects traders. Only those students who would have fully understood the mechanism of the stock market and its statistical properties would have had the opportunity to learn or acquire the attitude that trading during the MNAS implementation period would be potentially dangerous.
The attitude hypothesis is formulated as:
We defined the statement “to use Limit Orders more often” according to the ratio
Whenever
During the preparations of the experiment we did not know whether our intended illiquidity would work or not. In other words, we were uncertain whether the fluctuations on the market would be most strongly affected by the students or the computer trader. Based on the AI rules in the computer trader software, we knew that under normal circumstances (i.e. the situation where students would not have a dominant impact on prices) certain stocks would rise and others would fall. As a consequence, the outcome (in terms of profit) of the MNAS investment strategy was known under the condition that students' impact on prices would not be dominant. It is therefore interesting to investigate whether the PR treatment would cause students to achieve higher profits or not.
The outcome hypothesis is as follows:
On the other hand, if students would dominate the price fluctuations on the market, the outcome of any rational investment strategy would be highly uncertain and contaminated by irrational behavior from inexperienced participants, as is predicted by
The treatment under investigation is PR or more precisely, the submission of PR feedback messages to other students. As is explained in the empirical analysis of
It is for this reason that we embedded the same feedback mechanism in the experiment as was used in the regular course. There was only one crucial difference: the students in the randomly selected control group did not participate in PR but received ordinary feedback from the educator. Additionally, the control group students were required to correct mistakes from the previous workshop, which was to be submitted together with the next one. In other words, the control group followed an ordinary cycle of feedback as is encountered in many courses. Other than that there was no difference between the control and treatment groups. The assignments were identical and all students were assigned completely at random, which implies that measured differences (the so–called effects) can be interpreted in terms of causality.
Based on personal experiences and (unpublished) preliminary research, we believe that PR is only beneficial when it is applied frequently and for a longer period of time. This hypothesis is in line with our conclusions from focus group discussions in which students reported that PR is a “new learning method” which requires time to get used to. Our estimate was that a consecutive series of (at least) three rounds of PR would be necessary to obtain a beneficial effect. For this reason, we decided to conduct the experiment for two different cohorts: one with 2 full rounds of PR about large assignments, and one with 4 full rounds of PR about mediumsized assignments. It is our expectation that the treatment effect of PR would work at least as good, if not better, in the 4–round group as compared to the 2–round group. As a consequence, each of the five hypotheses is examined for each of the two cohorts, yielding a total of ten statistical hypotheses.
The timeline of the experiment is outlined briefly because it has important reprecussions to understand the results of the experiment. There are three phases in the experiment which are conveniently labeled A, B, and C.
Phase A is the preparation period which was needed to ensure that the stock market's statistical properties are perfect to perform a MNAS. More precisely, the news messages were manipulated by the game administrator such that half of the companies' stock prices were (slowly) rising and the other half was (slowly) decreasing. The overall stock market index was neither bullish nor bearish and displayed a flat line as can be seen in panel A of
At the end of phase A, students received detailed information about the task they had to perform. There was not enough time to start collaborating because students were required to specifiy their investment decisions at the start of phase B. Again, students were not required to use statistical techniques — they had complete freedom to make their decisions. However, any student who wished to use statistics had no other data available than the historical prices of phase A and the associated news messages. In other words, students had every (statistical) reason to believe that circumstances during phase B would remain the same as in phase A (in Economics this is called the “ceteris paribus” condition).
On the other hand, students also knew that a large group of peers would be implementing the MNAS during phase B. They knew, based on economic theory outlined in the tutorials, that this could have consequences for the statistical properties of the stock market. It was therefore important to stay online during phase B and to use LOs instead of MOs. The instructions for students clearly indicated that they were required to:
determine the stocks that went into the Long, Short, and Neutral piles
to submit the buy and sell orders
not change the portfolio during phase B (as a consequence of new information that would become available)
This implies that the measurements of the experimental outcomes for the Behavior, NonRote Learning, and Attitude Hypotheses are made, based on the actions during the start of phase B. The actual change of the market index during phases B or C is entirely irrelevant. Only the Outcome Hypothesis could be affected by the actual events during phases B or C (for instance if the stock market would behave erratically).
Phase C was intented to provide students with an opportunity to trade freely, without any restrictions. Students were allowed to liquidate the MNAS portfolio and change their investment strategy. The students knew that we would be interested in the accumulated profits/losses at the end of phase C. For this reason, many students continued trading activities in an effort to improve their performance, even though this did not count for the grades they received. As explained before, the outcome hypothesis only makes sense if the stock market behaves (more or less) rationally during phases B and C.
Fisher's Exact Test (FET) is appropriate for the analysis of our experimental study. The underlying assumptions of the FET are the same as for traditional
Why is it that we expect low frequencies in certain cells of the contingency table? The reason is related to the way the experiment was conducted:
Roughly half the student population was randomly assigned to the PR treatment group (the other half forms the control group).
Not all students in the PR and Control groups were actively participating in the experiment. For this reason, we measured the degree of activity of all students through objective, quantitative observations which were collected though the RC technology and the XSE. We discarded the data of all students who did not actively participate from the dataset.
Some of the experimental measurement frequencies are expected to be low. For instance, the correct application of statistical techniques to investigate and implement the MNAS strategy, is rather difficult to achieve for our student population. We know this because the MNAS strategy used to be thaught in another course in the past, for a student population which is very similar.
One of the implicit assumptions of the FET is that the row and column sums are predetermined by the researchers
In our first draft for the experimental design, we intended to use four different cohorts each of which would have been subdivided into four randomized treatment groups: the maturationist group (having access to RC and PR but without any guidance from the educator), the workedexample group (with access to RC but not PR), the constructivist group (with access to RC, PR, and educator guidance), and the control group. However, when we examined the statistics from active student participation in the regular course, we were able to estimate that an experimental design with 4 different 2×2 tables would have resulted in row and column sums which were too low to have reasonable confidence, even when the FET analysis is used. After all, the fact that FET analysis works for “small samples” does not imply that one will be able to estimate the treatment effects with sufficient accuracy. Hence, we decided to reduce the design to two different 2×2 tables for each hypothesis X — the structure is outlined in
Hypothesis X  
2 rounds of PR  4 rounds of PR  
No Effect  Effect  Total  No Effect  Effect  Total  
No Treatment 






Treatment 






Total 






Due to the reduction of the number of treatments and cohorts we were fairly confident (before the start of the experiment) that
Another reason why the FET is an appropriate choice of test, is the fact that it is possible to use the Odds Ratio (OR) which can be easily interpreted and tested statistically (with confidence intervals and pvalues) within the R language which is used in the RC technology. The OR is simply the odds of success in the treatment group relative to the odds of success in the control group. Hence, it provides us with an effect size that is easily understood: the OR simply states how much more likely it is to obtain the desired outcome when the treatment is applied as compared to the situation when the treatment is not applied. It is therefore obvious that the treatment is beneficial when the OR is (much) larger than one. The statistical hypothesis test is performed against the Null Hypothesis that
There are three datasets in this study which are available online (
As explained before,
During phase C, students were allowed to trade freely. In an attempt to make up for the massive losses that were incurred during phase B, many students continued trading activities, which was often accompanied with risk taking. As a result, phase C was very volatile even though there were no reasons for high volatility in the news messages that were still sent into the trading room. Something which is even more remarkable is the observation that after the end of phase C (this is also the end of the official experiment), trading activitites were still much higher than during the pre–experiment period. Many students continued trading even though this was not expected of them, nor did they get any credit for participating in trading after the experiment. The post–experiment period clearly shows a continuation of high volatility which slowly converges to “normal” levels.
Utility Hypothesis: does PR cause students to find statistics more useful?  
2 rounds of PR  4 rounds of PR  
Odds Ratio  0.6262178  1.643887 
OR 95% CI  [0.08793619, Inf[  [0.640524, Inf[ 
OR pvalue  0.8272  0.2238 
LR 
0.34442  1.0017 
Pearson 
0.32536  1.0210 
Behavior Hypothesis: does PR increase the use of statistical analysis?  
2 rounds of PR  4 rounds of PR  
Odds Ratio  5.844957 
2.452065 
OR 95% CI  [1.824762, Inf[  [0.9542063, Inf[ 
OR pvalue  0.003997  0.06038 
LR 
8.6575 
3.2209 
Pearson 
9.4379 
3.3332 
NonRote Learning Hypothesis: does PR cause students to choose the correct analysis?  
2 rounds of PR  4 rounds of PR  
Odds Ratio  0  6.855875 
OR 95% CI  [0.000000, Inf[  [1.628410, Inf[ 
OR pvalue  1  0.009827 
LR 
1.7718  7.1273 
Pearson 
1.0462  8.1774 
Attitude Hypothesis: does PR cause students to use Limit trades more often?  
2 rounds of PR  4 rounds of PR  
Odds Ratio  1.617527  3.466403 
OR 95% CI  [0.5476096, Inf[  [1.311374, Inf[ 
OR pvalue  0.28  0.01498 
LR 
0.73630  5.8512 
Pearson 
0.74725  6.1918 
Outcome Hypothesis: does PR cause students to yield better trading results?  
2 rounds of PR  4 rounds of PR  
Odds Ratio  0.6195883  1.073807 
OR 95% CI  [0.1632266, Inf[  [0.430141, Inf[ 
OR pvalue  0.8562  0.5373 
LR 
0.59457  0.021790 
Pearson 
0.57413  0.021805 
The Utility Null Hypothesis is not rejected in both cohorts of the experiment. This implies that there is no evidence that PR causes students to perceive statistics as more generally and practically relevant. It does however, not imply that there is no causal relationship. The hypothesis testing framework only works in a confirmatory way and cannot be used to dismiss an alternative hypothesis entirely.
In addition, it is also interesting to note that the OR increases (while the pvalue decreases) when we change the number of PR cylces from 2 to 4. While this does not allow us to conclude anything definitive, it is still consistent with the hypothesis that PR could affect perceived utility on the longrun. Maybe we need even more than 4 rounds of PR before a significant effect can be measured — this would not be surprising because students often associate solutions (in this case statistical analysis) with very specific problems, not general ones. Only after many examples, and after a long time, one may realize that statistical solutions are generally and practically useful.
Both experimental cohorts show a significant impact of PR on the actual use of statistical techniques. The effect in the 2 round cohort seems to be larger than in the 4 round cohort which is probably due to the fact that overall levels of usage (of statistical techniques) in the 4 round cohort was substantially higher. In other words, a relatively higher proportion of nontreatment students in the 4 round cohort used statistics than the nontreatment students in the 2 round cohort. Hence the increase which is caused by PR in the 4 round cohort is smaller and the best reason to explain this is the fact that students in the nontreatment group have more opportunity to experiment with statistical techniques when learning takes place in smaller and more frequent assignments.
From the results it can be concluded that PR causes deep (nonrote) learning within the cohort with 4 rounds of review. The OR is large and implies that students with PR are (almost) seven times more likely to use the appropriate statistical analysis than students who experienced traditional feedback. There is no benefit from PR in the 2 round cohort which does not come as a surprise for reasons that were explained before. Both results seem to suggest that three consecutive rounds of PR is a threshold for the beneficial effect to occur. It is also possible that the effect grows with the number of PR cycles — this however, is a hypothesis that would require more research.
During the design phase and preparation of the experiment, we did not believe that the null of the attitude hypothesis would be rejected. We would have been happy to find only a significant impact of PR on nonrote learning — as it turns out however, PR
The results clearly demonstrate that in the 4 round cohort, PR causes students to use LOs more often than MOs. This is not the case for the 2 round cohort which supports the hypothesis that PR is only beneficial when it is applied frequently.
This hypothesis has become obsolete because of the erratical price changes during phases B and C which were caused by the students. We anticipated (or even hoped) that this would happen before the start of the experiment, even though this would invalidate the results for the outcome hypothesis. The reason why it was still interesting to maintain this hypothesis has two important reasons:
We did not know for sure that the crash would occur. Therefore, it was still scientifically appropriate to formulate the hypothesis.
There is now compelling evidence that true understanding of the underlying statistics and economics is relevant and may have serious repercussions for the behavior of financial markets. Future generations of students may now see the consequences of rotelearning.
The ORs in
Behavior effect: PR increases the use of statistical analysis  
2 rounds of PR  4 rounds of PR  
No Effect  Effect  Total  No Effect  Effect  Total  
No Treatment  68.6  31.4  100  59.8  40.2  100 
Treatment  31.4  68.6  100  40.2  59.8  100 
Total  100  100  200  100  100  200 
Nonrote learning effect: PR causes students to choose the correct analysis  
2 rounds of PR  4 rounds of PR  
No Effect  Effect  Total  No Effect  Effect  Total  
No Treatment  –  –  100  65.4  34.6  100 
Treatment  –  –  100  34.6  65.4  100 
Total  100  100  200  100  100  200 
Attitude effect: PR causes students to use Limit trades more often  
2 rounds of PR  4 rounds of PR  
No Effect  Effect  Total  No Effect  Effect  Total  
No Treatment  –  –  100  63.4  36.6  100 
Treatment  –  –  100  36.6  63.4  100 
Total  100  100  200  100  100  200 
Without a doubt, PR is one of the more important learning tools that is offered by the pedagogical paradigm of constructivism. In spite of the many empirical studies that touch on the importance of PR, there is little or no (hard) evidence for the hypothesis that PR leads to nonrote learning. More importantly, there seems to be a tendency to neglect the fact that PR may have completely different implications for reviewees and reviewers. This difference is explicitly taken into account in our attempt to answer this research question by comparing control group students with normal instructorbased feedback versus treatment students who are actively submitting feedback to their peers. In addition to this, there is, based on our fully randomized experiment, compelling evidence that the submission of PR feedback causes deep learning (NonRote Learning Hypothesis), changes the actions that are undertaken to solve specific problems under uncertainty (Behavior Hypothesis), and impacts attitudes which may lead to different behavior on the long run (Attitude Hypothesis). These effects are not only statistically significant but also substantial in terms of their underlying OR and Binomial Effect Size.
The Outcome Hypothesis was obsolete due to the stock market crash that was caused by the students in the aftermath of the MNAS implementation — a pure consequence of irrational behavior on the part of a substantial proportion of the student population with little experience and understanding of the underlying concepts from economics and statistics. As a consequence, we were not able to demonstrate improved investment outcomes in the treatment group as compared to the control. On the other hand, the crash was predicted by economics
There are good reasons to believe that the unfavourable perception of the practical relevance of statistics is an important source of potential dissatisfaction which may lead to rote learning. Unfortunately, we were not able to confirm that the PR treatment improves students' perception of relevance — which however, does in no way imply that there is no impact. As a matter of fact, it can be observed that the OR in the 4round group treatment group is higher than in the 2round group (while the corresponding pvalue drops from 0.83 to 0.22). It is still possible that PR
Finally, we would like to point out that the experimental design in this study, while classical and straightforward, is characterized by several unique features that strengthen our confidence in the results that are portrayed. Firstly, and with the exception of the Utility Hypothesis, all experimental observations are based on objective measurements that were generated by innovative, educational technology. This not only improves our confidence in the quality of the data, but it also allows us to gain much stronger control over the circumstances in which the experiment is conducted (precise timing, ability to deny certain features to some groups, detection of inactive students, etc.). Secondly, the experiment is embedded in a challenging game which has a history of many years and is known to be enjoyable and captivating. This is illustrated by the fact that intensive trading activities continued to be observed even after the experiment had ended and is likely to have contributed to the success of the experiment. It is our assertion that the game setting contributed to the students' motivation to perform well and to make the right decisions. Last but not least, the measured learning outcomes lie outside of the regular curriculum which implies that the challenge students faced was to solve an entierly new problem which is situated in a realistic environment and for which students had no textbook cooking recipe that could be applied. This ensured that the learning outcomes can be truly interpreted as insights that have been acquired through nonrote learning, rather than plain memorization of facts and theories. In addition, this aspect of the experiment also ensures that there was no discrimination towards the students in the control group because the learning outcomes of the experiment did not count towards the final results of the stats course.