Post a new comment on this article
Post Your Discussion Comment
Please follow our guidelines for comments and review our competing interests policy. Comments that do not conform to our guidelines will be promptly removed and the user account disabled. The following must be avoided:
- Remarks that could be interpreted as allegations of misconduct
- Unsupported assertions or statements
- Inflammatory or insulting language
Why should this posting be reviewed?
See also Guidelines for Comments and Corrections.
Thank you for taking the time to flag this posting; we review flagged postings on a regular basis.close
Replication crisis or crisis in replication? A reinterpretation of Shanks et al.
Posted by 25 Apr 2013 at 11:43 GMTon
This comment was submitted for publication in Plos One. It was under editorial consideration for about six weeks. After six weeks, I heard that the my commentary was rejected and that the Shanks team was allowed to make changes to their original article on the basis of commentary. It was, in other words, used as a late review. This was done without my permission.
Recently, scientists have become more aware of the importance of replication, especially of experiments that have far reaching implications for the development of both theory and applications. This increased awareness is clearly visible in psychology as well. Shanks and colleagues published a paper in Plos One in which they report nine experiments on the ‘professor-primes-intelligence effect’ (e.g., Dijksterhuis & van Knippenberg, 1998), one of the many demonstrations of a broader set of findings whereby priming affect people’s overt behavior (see Loersch & Payne, 2012, for a recent review and theoretical integration). Shanks and colleagues concluded that they failed to replicate the finding that priming participants with professors leads to improved performance on a general knowledge test whereas priming participants with hooligans hampers it.
Shanks and colleagues discuss various possible reasons for their null-findings, the one they seem to prefer being that the effects others have found are false positives. However, another possibility is that there are moderators at play that we do not understand yet, and a final possibility is that the Shanks team did not conduct their experiments properly. As Shanks himself recently said in the Times Higher Education: “null-effects are easily obtained when one’s methods are poor”.
Let us briefly look at the first possibility. Could it be that the older papers have reported false positives, perhaps because of extensive p-hacking? This is possible, but unlikely. Variations and extensions of the effect of intelligence or stupidity related primes on intellectual performance have been obtained in at least ten different laboratories, in at least 22 different experiments, and in at least seven different countries (Bry, Follenfant & Mayer, 2008; Dijksterhuis, Spears et al., 1998; Dijksterhuis & van Knippenberg, 1998; Dijksterhuis & van Knippenberg, 2000; Galinsky, Ku & Wang, 2008; Haddock, Macrae & Fleck, 2002; Hansen & Wänke, 2009; LeBoeuf & Estes, 2004; Lowery, Eisenberger, Hardin & Sinclair, 2005; Nussinson, Seibt, Häfner & Strack, 2010; Schubert & Häfner, 2003; Stel, Zeelenberg, van Dijk & Kutsal, 2012). These experiments are not all exact replications of the original paper (Dijksterhuis & van Knippenberg, 1998), but they all showed, one way or another, that priming people with the concept of intelligence (often the stereotype of professors) or stupidity changes people’s intellectual performance. On PsychFileDrawer (http://www.PsychFileDrawe...), three more replication attempts can be found, one of them successful, two of them not. However, the quality of these experiments can not be assessed on the basis of the information available at the website.
Moreover, the effects of priming intelligence are usually large, and we can theoretically account for them quite well. For instance, Hansen and Wänke, (2009) showed an interesting and intuitively compelling mediator, namely self-efficacy. In addition, Galinsky and colleagues (2008) showed that taking the perspective of a cheerleader negatively affects intellectual performance and their effect was mediated by a reduction in self-perceived intelligence.
Before thinking about the possibility of moderators, let’s have a closer look at the experiments of the Shanks team. In their experiments, they primed professors or hooligans (or exemplars such as Einstein), measured participants’ scores on IQ tests or general knowledge tests, and never found any statistically significant priming effects. But are their experiments any good?
Experiment 1 has two cells, each with 20 participants. 28 participants were run in the UK, 12 in Sweden, the age ranged from 19 to 79. This is all very unusual, and suggests that the circumstances under which the participants were recruited and under which they participated were unorthodox. Were they recruited on the streets? Even when the circumstances under which the data were gathered were not as strange as the demographics suggest, we can conclude that the group of participants was way too heterogeneous for such a small sample.
Experiment 2 has two cells with only eight participants each, ranging in age from 18 to 60. This does not make sense. The heterogenous and incredibly small sample indeed caused problems: The researchers administered both a pre- and postmeasure of intelligence and already on the premeasure the participants primed with professors scored considerably higher than the participants primed with the hooligans.
Experiment 3 has two conditions with 20 participants each. Some of the necessary information to evaluate this experiment is missing in the paper, but this experiment could have been reasonably good. Importantly, it is the only experiment in the paper explicitly reported to have been conducted in cubicles.
Experiment 4 has 50 participants in each of two cells. Hence, there is more then enough power, but again, more information is needed to evaluate this experiment. Unfortunately, cubicles are not mentioned suggesting that the experiment was indeed not done in cubicles. Still, the quality of this experiment could be passable.
In Experiment 5 the researchers focus on two manipulations. One is the priming of a category (professor) versus an exemplar (Einstein). The first is known to lead to assimilation, the second to contrast (Dijksterhuis, Spears et al., 1998). In addition, they study the effects of similarity versus dissimilarity focus. A similarity focus generally leads to assimilation, whereas a dissimilarity focus usually leads to contrast (e.g., Mussweiler, 2003). The Shanks team chose to run only two of the four possible conditions and bizarrely, they ran the professor-similarity and the Einstein-dissimilarity conditions. This is highly problematic because of the obvious confound: They manipulated two variables at the same time.
Experiment 6 has the exact same conditions, and hence the exact same confound, as Experiment 5. Moreover, the participants are informed about the fact that they are primed and that these primes may affect their performance. Earlier research has shown (e.g., Dijksterhuis, Bargh & Miedema, 2000) that this usually eliminates priming effects, presumable because it makes participants reactant.
Experiment 7 is a conceptual replication of a recently retracted experiment of which the data were fabricated. In addition, this study did not enough power (twelve people per condition) and on top of that, importing the ingroup/outgroup manipulation of the original paper into the professor/hooligan prime domain led to an unfortunate manipulation, to put it politely. Priming people with a professor who studied at UCL is not a problem, but did participants really believe the prime story about a hooligan who had studied at UCL? Was this at least measured with a manipulation check? Such a check is not reported. The only secondary data that are reported pertain to a different manipulation check measuring the perceived intelligence of the two primes (professor versus hooligan). This measure yielded extremely low SD’s (both cell means have a SD of .06 on a 5-point scale, leading the F-value for the comparison to be 608.51).
Experiment 8 is, in some ways, quite nice, although even this experiment has some mystifying aspects. It is said that the same questionnaire was used as in Experiment 4. However, Experiment 4 was done in Australia and in Experiment 8 participants were paid in pounds, suggesting it was done in the UK. I suppose it is theoretically possible to construct a questionnaire that is appropriate to use both the UK as well as in Australia, but it is telling that the authors did not take the trouble of devising a tailor-made questionnaire.
That being said, Experiment 8 is interesting. It has four conditions, with 20 participants each, and the authors not only test the effects of a professor prime by comparing it with a hooligan prime, but also at the effects of incentives by comparing paricipants who are paid for a correct answer with participants who are not paid. The data show that incentives indeed significantly improved performance. Participants primed with professors also outperformed participants primed with hooligans, but the p-value for this comparison was .12 (d = .38). In a genuine replication effort aimed at uncovering the truth, such a finding may have led to the intention to replicate this particular experiment with 40 or 50 participants per cell. Did the Shanks team indeed do this in their final experiment?
Alas, they did not. Experiment 9 has six conditions were administered with 66 participants in total. They used three different primes (professor, control, hooligan) and two differrent construal instructions (interdependent versus independent). Two of the conditions had only nine participants. The data were collected in the UK and in Greece, and “approximately evenly distributed across conditions”. The cell means (representing percentages of questions answered correctly) ranged from 40 tot 53.3, the SD’s from 22.7 to 30.1. In this experiment, chance performance would lead to a score of 33% correct (20 questions with three choice options each). The combination of the extremely high SD’s and the low sample size are troublesome.
It turns out the authors only analyzed the answers to the five most difficult questions (out of 20 questions in total). This being a heavily underpowered study, this analysis obviously didn’t show anything. However, in the Supporting Online Materials, the authors also report their analyses on the answers on all questions and they find an interesting trend towards assimilation under independent conditions (means for professor 75.0%, control 71.0%, hooligan 68.6%) and a trend in the direction of contrast under interdependent conditions (means for professor 64.3%, control 68.8%, hooligan 72.8%). Now the direction of this effect may raise eyebrows, and as said, these effects are far from statistically significant, but one would almost conclude that in their final two experiments the Shanks team had finally developed the tacit – or perhaps explicit – knowledge to get their paradigm to show at least some sensitivity to their manipulations. It would have been better for the field if they would have replicated their Experiments 8 and 9 with much larger samples, but they chose not to do so.
Where does this leave things? It seems as if the Shanks team made beginners mistakes in their experimental designs, and they ran most or even all of their experiments under noisy and unprofessional circumstances. In only one experiment the protocol of the original priming studies (Dijksterhuis & van Knippenberg, 1998) seems to have been followed. This protocal dictates that the experiment is done in individual cubicles, which is important as we know that other people can function as primes. And indeed, even if you use cubicles, it is best to give all instructions in a computerized, standardized way, rather than by experimenters running in and out with new booklets (see Dijksterhuis & van Knippenberg, 1998, Experiment 1, for details on the first demonstration of the effect).
Is it possible to draw a conclusion from the series of studies Shanks and colleagues reported? Most of the individual experiments are so noisy that the data are in fact not at all incompatible with moderately sized priming effects and hence do not provide useful input in the debate. In addition, some experiments, such as 1, 2 and 9, are underpowered and too different from the original work to count as replications. They were not done among students as participants (or at least some of the participants were not students), and the DV (at least for Experiments 1 and 2) did not consist of a general knowledge task. However, we can perform a meta-analysis across the remaining six experiments (that were at least all administered with students as participants and with a general knowledge test as a DV). In all these experiments, Shanks and colleagues ran a professor priming condition, of which it is known that it improves intellectual performance. This condition was compared with a condition in which hooligans were primed or with a condition in which Einstein was primed. Both primes were shown to hamper performance on a general knowledge effect in our seminal experiments (Dijksterhuis, Spears et al., 1998; Dijksterhuis & van Knippenberg, 1998). Interestingly, the data (with a total N of 313) yield a combined effect (Z=1.513) that is almost significant in the predicted direction (p < .065), whereby participants primed with professors outperformed participants primed with hooligans or Einstein. This goes to show that an effect that is reasonably large and fairly robust, such as the professor-primes-intelligence effect, can even withstand poor experimentation.
I do not think the above analyses requires us to think about moderators explaining the null-effects too elaborately, but is interesting to speculate on why the effects that Shanks found (if interpreted as effects in the first place) were so much smaller than the effects found by many others. If there is a single explanation for why the effects of Shanks and colleagues are weak or absent, it probably has to do with the difficulty of the questionnaire. The overall mean percentage of correct answers to the general knowledge test in the Shanks experiments (with the exception of Experiment 9, in which an easier questionnaire was used but only the analysis on the most difficult questions was mentioned in the main paper, and of Experiments 1 and 2, in which an altogether different DV was used) is just over 40. Given that the effect of priming is mediated by self-efficacy and confidence, it not surprising that the effect is weaker for questionnaires that are too difficult. If we control for chance, in the Shanks studies the average participant knew the answer to only one out of five questions (the median probably being something like one out of seven). In some conditions of some experiments this was as low as one out of ten. Perhaps needless to say, this can ruin self-efficacy. Indeed, out of the 22 published “professor priming” experiments referred to earlier, seventeen used a general knowledge questionnaire as a DV, and of the fifteen experiments that allowed the difficulty of the questionnaire to be assessed on the bais of the article, all except one revealed considerably higher percentages of correct answers than the Shanks et al., experiments (the exception being LeBoeuf & Estes, 2004, Experiment 2).
What are the implications for replication efforts? Most of the experiments of Shanks and colleagues seemed to have been done with (groups of) students as experimenters, for instance as student projects that were partly run by overseas students over their holidays in their home countries. That groups of students act as experimenters is fine - we have all done such experiments - but in our Nijmegen lab we almost always use such projects for pilot testing and even with these projects we strongly prefer them to be executed under proper laboratory conditions. And although we make an occasional exception, we are hesitant to publish studies conducted this way.
Some have said that such student projects are ideal for replication research, but I disagree. I think replications should be strongly encouraged (see Koole & Lakens, 2012) but not delegated to less professional circumstances. In fact, if anything, the bar should be even higher for replication studies than for our regular work, because replication studies in which original findings are not replicated are potentially damaging the reputation of other people, and, equally important, the reputation of science. For the same reason, I think replications studies should be overpowered rather than underpowered. Ideally, replication studies of experiments should only be conducted by faculty members, PhD students, closely supervised individual Master students working on their thesis, or professional RA’s. More importantly, careful experimentation is about controlling an environment, about reducing noise, and about precision in following protocol. If an experiment is done according to a specified protocol, with a homogeneous group of participants, with computerized instructions, in a good, quiet lab with individual cubicles, the replication experiment should do this too. Do not get me wrong here, we are not always doing rocket science. It is certainly not true that all published priming studies were done under perfect circumstances, but many of them are, and replicators have to mimic the circumstances as closely as possible. If, for whatever reasons, they cannot do this, then don’t try to replicate.
The Shanks et al., paper also goes to show that when you try to replicate someone’s findings (or when you have tried and failed but want to continue), it is strongly advisable to contact the author(s) of the original paper(s) to talk about details and to openly discuss what may have caused an initial replication to fail. They know things that one should know. Numerous priming researchers could have told the Shanks team that, like most experimental research in psychology, behavioral priming research generally requires sufficiently controlled experimental procedures and measurements. They could have told the Shanks team that behavioral priming is ideally done in cubicles, and that the circumstances under which the data of Experiments 1 and 2 were collected are really far from ideal for almost any type of social cognition research. They could have told them that the manipulation used in Experiments 5 and 6 has a confound, and that one should not simply assume that the same general knowledge test can be used at different sides of the planet. Finally, they could have suggested that the general knowledge test they have used was in all likelihood too difficult. It would have saved many people a lot of time and resources, not to mention frustration.
Finally, I think we should not publish sub-standard experiments, especially not when they are replication experiments. It is counterproductive for our efforts to improve psychological research. More and more people now agree that replication efforts are useful and even necessary, but endeavours such as the Shanks et al., paper will only lead to skepticism about (non-)replications. Moreover, publishing sub-standard experiments is harmful to colleagues, it is misleading to readers, and it is damaging to science.