Post a new comment on this article
Post Your Discussion Comment
Please follow our guidelines for comments and review our competing interests policy. Comments that do not conform to our guidelines will be promptly removed and the user account disabled. The following must be avoided:
- Remarks that could be interpreted as allegations of misconduct
- Unsupported assertions or statements
- Inflammatory or insulting language
Why should this posting be reviewed?
See also Guidelines for Comments and Corrections.
Thank you for taking the time to flag this posting; we review flagged postings on a regular basis.close
Fist-clenchingly poor science
Posted by jonsimons on 28 Apr 2013 at 04:51 GMT
I’m a memory researcher who likes to encourage research that explores novel and interesting hypotheses about memory. I also really dislike being negative. But I’m afraid this paper is shockingly poor quality science, and to see it published in such a prominent journal and receive such widespread media coverage is rather depressing.
There are so many flaws it is difficult to know where to begin, but I’d be hard pressed not to fail this paper if it was presented to me as an undergraduate project report. First, the entire premise of the study lacks foundation. Despite what the authors suggest, there is no evidence that fist clenching engages the lateral frontal brain regions associated with memory (e.g., Umetsu et al., 2002). Even if it did, the old notion that left lateral frontal areas support encoding and right lateral frontal areas support retrieval is now largely discredited (e.g., Fletcher & Henson, 2001; Owen, 2003). So the hypothesis that unilateral fist clenching might improve memory has little basis in the published literature.
Moving on to the experiment itself, for a study involving healthy volunteers rather than some perhaps rare patient population, group sizes as low as 9 per group are simply laughable. That a priori power analyses might have motivated such numbers is difficult to believe. The uneven group sizes and the large number of additional groups whose data are not reported also do not inspire confidence. The results are then limited in that numerous post-hoc group comparisons are undertaken, in one critical case following a non-significant ANOVA, and with no correction for multiple comparisons. The results that do emerge from such a flawed analysis are largely trends which, however much the authors may describe them as ‘strong trends’, are still not significant. Critically, this includes the difference between the R/L-clench group and the no-clench control group, which the authors admit was not significant despite repeating throughout the title, abstract, discussion, and press release, that clenching resulted in superior memory. I could go on, but I think I have said enough.
Perhaps this doesn’t really matter; poor quality science is published and press releases churnalised by a gullible media all the time. But I think this rather does matter, because there is a huge push underway to encourage scientists to publish their best research in open access journals, of which PLOS is a flagship example. The main obstacle to achieving this objective is the still widespread prejudice that you hear in common rooms everywhere that open access journals will publish any old rubbish. This is obviously untrue: PLOS regularly publishes excellent, high quality research (e.g., Shanks et al., this very week), and flawed research is often seen in traditional journals, not just in PLOS. But every time such fist-clenchingly poor science as the current paper is published, the prejudice is reinforced and the cause of open access publishing undermined. Thus, while I’m sure everyone involved is dedicated and scrupulous, it is paramount that PLOS works harder to increase its editorial standards to reduce the chances of such embarrassingly weak science being published. The open access cause depends on it!
A comprehensive and well-argued objection to a poor piece of work.
Wow, first the Dijksterhuis/Shanks spat, and now this. Remind me never to submit my work to an open-commentary journal!
Seriously, I'm sure your commentary was intellectually spot-on, but many young scientists will be reading these exchanges and wondering if such shark-infested waters are really for them. Is this kind of public criticism the face of open-access publishing? Because it's a bit - with all due respect - savage.
We are all capable of "fist-clenchingly poor science" - we get attached to hypotheses and sometimes fail to see the wood for the trees, and peer review can be a bit of a leaky sieve. But usually no great harm is done by these isolated mistakes, and indeed, sometimes the work done to correct them leads to other findings. However, if people become afraid to present their work publicly or even leave science altogether then we are all the poorer for it.
Kate, is there really no harm when poorly executed research (eg research with extremely low sample sizes) is published by respected peer-reviewed journals? Do you see no harm even when very unconvincing results get extensive media attention (the PLoS One Media Coverage page for this article cites 6 high profile media reports), resulting in perhaps millions of people assuming to be a proven fact something that may be nothing more than a remote possibility?
Don't you think science reporters assume (and have no choice but to assume) that referees at respected journals will have done the quality vetting for them, and that the results will have good scientific credibility?
You seem to be saying, oh well, they can't really assume that, because we all know that good journals do often publish very weak research--and that's just the way it is, and that's just fine and we have to live with it. Why is it fine? Won't a public recognition of this inevitably result in a severe lowering of public respect for scientists? If peer review does not filter for methodological rigor, then wouldn't the educated public be sensible to develop a complete distrust of scientists and everything they do? (Based on the tone of comments one sees below science stories in the Economist, such an attitude may already be growing...)
You also say that the kind of quick reviewing that Jon Simons has done here is bad because it may scare off young scientists. Should people publish work that they are not prepared to have intelligent people discuss in a critical fashion? Or take tax money to do it?
If people who are fearful of having their work examined and discussed avoid open journals like PLoS One, that will produce adverse selection for the old-style journals, and result in better quality work and more vibrant discussion appearing in the open journals.
I say kudos to Jon Simons for taking the time to read and comment on this article. PLoS One was supposed to provide a framework for this sort of discussion, but it never happened very much. Maybe it is starting to happen now...
RE: RE: RE: Fist-clenchingly poor science
Spinifex replied to Hal_Pashler on 28 Apr 2013 at 15:36 GMT
My problem isn't with commentary or criticism of the science itself, it's with how it is expressed. It's one thing to say "These results need to be treated with caution because the study was under-powered and the analyses didn't correct for multiple comparisons". It's another to say "This is fist-clenchingly, shockingly poor science that wouldn't even have been acceptable as an undergraduate project". One is a factual comment on the science, the other is just an emotion-laden and potentially hurtful attack.
RE: RE: RE: RE: Fist-clenchingly poor science
Gosh, in my life I'm almost always making your kind of point, in the interests of both efficiency and the undervalued leverage that good will provides a community. But here, I really have to differ. First, Simon registers a strong opinion that's clear and well-backed, with detail provided that is best thought of as a proxy for the kind of courtesy you'd like. The spirit of an open journal should welcome such, even if it sounds horrible. That's what I'm buying with my click, as a consumer of the piece- a clarity that isn't to be found in your antiseptic, muted alternative. Second, there is an informal quid pro quo required as the result of this story going over the wire unremittingly: one cannot rebut breathless, deceptive articles with murmured talk within community of "need to be treated with caution", because it isn't fair to people like me who are relatively uninformed, who read this stuff in magazines and start clenching our fists during meetings. The language of the study itself plainly welcomed such broad coverage instead of expressing the caution that was obviously warranted, the kind quite common in good studies. I think that burbling enthusiasm needs a certain kind of response by the community, and I hope this is only the start. Especially since they hit it out of the park publicity-wise with such a modest plan and effort.
Third, the emotion is in the response for a damn good reason, for once- as an effort at clear communication of the stakes, on a vital topic. Many of us are very concerned that the open journal succeed, and that it does so before the year 2200. Market forces that support a 2 billion dollar a year journal industry make that possibility quite tentative, despite the obvious advantages to science. Poor science and bold attempts at mass promulgation of such could be a disaster at any random time for the movement toward open access. Then there are others who are fearful that the recent pall on psychology due to low power results and scams will spread to shut down healthy enthusiasm (and funding) for neurology, especially now that an even worse result in statistical power for neurological work has arisen, in a similar meta-analysis.
I applaud your desire to not scare away scientists from publishing. I would (and do) put it differently, though- that competent science should not have to deal with the simplistic, cursory, or sweeping judgments that seem motivated by status needs, or inattention, or the ironic, incredibly strong cognitive biases evident among scientists. But this study is hopefully quite the special case, so here I must respectfully differ. In this particular context, we are well past the "should be treated with caution" phrasing so familiar to all of us. After all, some of us could well do with being frightened out of publishing. Or at least frightened enough to use a kazoo instead of a trumpet to sound the news of our findings..
RE: RE: RE: RE: Fist-clenchingly poor science
I agree completely, the Simon guy could have done a much better job phrasing his comment. There are much more tactful (read less sophomoric) ways to accomplish getting the same basic point across. Too much chest bumping not enough fist clenching.
RE: RE: RE: Fist-clenchingly poor science
anonymous_reviewer replied to Hal_Pashler on 29 Apr 2013 at 18:31 GMT
Hal is completely correct. Look, how many times have we all read the abstract in PLoS One and rolled our eyes? I posted a few factual comments and questions over the last couple of years, but nobody ever responded to them. This is not the way PLoS One was supposed to be. Open access only works well if there is a serious post-publication review process.
I think the phrase "if you can't stand the heat, stay out of the kitchen" applies here (no pun intended). Over centuries, science has profited from academic discussions, and yes, sometimes even heated debates. This is how eventually the best science survives. When some scientists then refrain from publishing here: well, so be it. When you are not prepared to respond to well articulated and thoughtful criticism such as the above comment by Jon, publishing your scientific work might not be a good idea at all.
I actually like seeing a critical commentary and ensuing discussion enfold on PLOS commenting system much much more often. It seems underused thus far.
Indeed, Jon formulated some things a bit harsh, but that's part of the game. Being overly nice never settled a debate, didn't it?
So thanks to all who took the time to comment!
Dear Dr. Simons,
Thank you for your interest in our work. Below, please find responses to your comments. Please let us know if we can provide any additional information.
Ruth E. Propper, Ph.D.
Director, Cerebral Lateralization Laboratory
Montclair State University
Montclair, NJ 07043
1. Comment: First, the entire premise of the study lacks foundation. Despite what the authors suggest, there is no evidence that fist clenching engages the lateral frontal brain regions associated with memory (e.g., Umetsu et al., 2002).
Response: We respectfully disagree. As cited in our paper, please see Harmon-Jones (2006). Unilateral right-hand contractions cause contralateral alpha suppression and approach motivational affective experience. Psychophysiology, 43, 598-603.
We would also like to point out that, from a scientific perspective, lacking a foundation for a premise does not invalidate research findings. A potentially weak premise is, in fact, further supported by data. The contrary is not true; Data are not invalidated by a weak premise; weak premises are either supported or refuted by data- In this case, weak premise or not (and we argue that it is not), the data are supportive.
2. Comment: Even if it did, the old notion that left lateral frontal areas support encoding and right lateral frontal areas support retrieval is now largely discredited (e.g., Fletcher & Henson, 2001; Owen, 2003). So the hypothesis that unilateral fist clenching might improve memory has little basis in the published literature.
Response: We respectfully disagree that the HERA model has been discredited, as would the authors of the following paper (cited in our manuscript):
Habibi, Nyberg, & Tulving (2003). Hemispheric asymmetries of memory: The HERA model revisisted. Trends in Cognitive Science, 7, 241-245.
3. Comment: Moving on to the experiment itself, for a study involving healthy volunteers rather than some perhaps rare patient population, group sizes as low as 9 per group are simply laughable.
Response: Small sample sizes, particularly in experiments involving relatively novel procedures, adheres to common convention, and actually increases the possibility of a Type II error (failure to reject the null hypothesis). Furthermore, as we point out, the effect sizes are quite large, regardless of sample size, indicating that replications with larger samples may show larger effects.
4. Comment: That a priori power analyses might have motivated such numbers is difficult to believe.
Response: Given the exploratory nature of the experiment, effect sizes were unknown, and it would have been imprudent to guess what the effect sizes would be. Even if we did, the effect sizes obtained are large, and the non-significant p values reflect small sample sizes, not small effects.
5. Comment: The uneven group sizes and the large number of additional groups whose data are not reported also do not inspire confidence.
Response: Statistical methods correct for uneven group sizes. The data reported in the present work focuses on unilateral hand clenching conditions, whereas those to be reported in the future involve bilateral clenching.
6. Comment: The results are then limited in that numerous post-hoc group comparisons are undertaken, in one critical case following a non-significant ANOVA, and with no correction for multiple comparisons.
Response: Post hoc analyses are conducted following ANOVAs, and so we do not understand how this limits the results. The ANOVA that did not reach traditional significance was p<.08, and was in the same direction as the other, significant ANOVAs. Therefore, it seemed reasonable, and the reviewers obviously agreed, to include post hoc testing of these results. Additionally, corrections for multiple comparisons are not necessary, particularly when comparisons are limited, and groups of analyses all converge on the same results. Please see our response to comment 9, below, regarding a rigid focus on p values to the exclusion of other methods of analysis reporting. Psychologists have adopted reporting effect sizes for just such confusion regarding p values.
7. Comment: The results that do emerge from such a flawed analysis are largely trends which, however much the authors may describe them as ‘strong trends’, are still not significant.
8. Response: First, as reported in the manuscript- there are two ANOVAs, and several post-hoc tests that reach ‘traditional’ significance.
Regardless of the specifics in our paper, criticisms of rigid focus on p values have been outlined by others, and in fact was a major topic of discussion of the Association for Psychological Science. It is beyond the scope of this response to go into detail of those arguments; we did however include measures of effect sizes in our paper- which were large- as another metric of our results.
9. Comment: Critically, this includes the difference between the R/L-clench group and the no-clench control group, which the authors admit was not significant despite repeating throughout the title, abstract, discussion, and press release, that clenching resulted in superior memory.
Response: We believe you are misinterpreting our findings, and our paper generally. In fact, the R/L clench group was statistically superior to the other hand clenching groups, a point that we made clear in our manuscript. The R/L-clench group was also numerically superior to the no clench group; again, a point that we took pains to make clear. Additionally, although this latter relationship was not traditionally statistically significant, it was in the predicted direction which, given the hypothesis, the novelty of the paradigm, and the small sample size, is in fact noteworthy.
In conclusion, while we respect the comments of Dr. Simons, we’d like to point out that the peer reviewers of this paper did not hold these same opinions. Furthermore, we would like to thank Professor Simons for giving us the opportunity to highlight important differences between p values and effect sizes in statistical analyses, as well as methods of hypothesis testing generally. We hope that the readings and comments here inspire others to continue this line of research. The scientific enterprise depends on replications and extensions, and we look forward to learning such results.
Finally, we would like to point out that our research should not be conflated with media reporting of such research. As many know, misinterpretations of scientific information, no matter how well explained, do occur. We nevertheless applaud those who attempt to convey scientific information to the public, as such attempts help the public, as well as policy makers, understand the importance of the scientific enterprise generally.
I am sincerely sorry if some people felt that my original comment was objectionable. That truly was not my intention. Like most scientists, I have seen numerous papers published that I would have liked to comment on, but until now, I have typically let them go (also like most scientists, I suspect). In this instance, I was encouraged by others to utilise PLOS’s commenting facility to express my concerns about the scientific quality of this paper. I am very glad that Dr Propper appears not to have been too upset by the tone of my comment, and has taken the time to reply with a clear and detailed response to the scientific issues raised. I disagree with some of her arguments, but appreciate her willingness to discuss her research which, as I said before, I think to be novel and interesting.
As it happens, several of the points made by Dr Propper have already been addressed by The Neurocritic in a blog post concerning this paper (to which she may also like to respond):
For example, The Neurocritic and a commenter on his/her blog go into detail about the very great difficulties associated with inferring the localisation of neural sources based on scalp EEG effects. In short, it is problematic to use the alpha suppression effects reported by Harmon-Jones as evidence of lateral frontal brain activity.
I am very happy to agree to disagree with Dr Propper regarding the validity of the HERA model. Its proponents argue that it is correct, and others point to data that are inconsistent with its predictions. No doubt such debate will continue.
One area in which I strongly, but respectfully, disagree with Dr Propper is the justification for the small samples sizes in the experiment as acting in some way as a control against erroneous conclusions. It is well established that small sample sizes do not merely increase the possibility of Type II errors, a point that is made very clearly in Button et al.’s (2013) recent analysis in Nature Reviews Neuroscience, which I recommend to anyone interested in greater statistical understanding of these issues.
Dr Propper’s points relating to the large effect sizes reported in the paper, and whether or not the primary hypothesis has been supported statistically, are addressed in detail in the Neurocritic’s blog post, so she might like to respond to the concerns raised there.
I conclusion, I am grateful to Dr Propper for responding to my criticisms of this paper. There are a number of areas on which we disagree but that of course is very much what science is all about. Such post-publication online debate is a welcome development, in my view, and will hopefully benefit all of us in increasing our understanding of the scientific strengths and limitations of the published articles we read, and of the fascinating underlying phenomena that we are all interested in discovering.
My major comment on the paper is that the critical R/L condition does not differ from NENR. In Figure 3, there is a notation that this difference is p<.09. However, I believe this must be incorrect, as <i>d</i> = 0.284 for this comparison. In addition, the text says "...a trend for the NENR to score higher than the Lenc/Lrec (p = .09, d = .82)." The reported effects point in the direction of R/L being better than the other clenching condition, but not against the NENR control. If the major comparison is to be between the different clenching conditions, then a 2 (encoding) x 2 (retrieval) ANOVA would be appropriate.
I made this point in a post on my blog:
A commenter on this post made a good point about "the danger of inferring effect sizes with such small sample sizes."
I welcome your reply either here or on my blog. Thank you.
I'd agree with the Neurocritic here; my first thought on looking at the stats was that a 2x2 ANOVA would have been a better approach for analyzing the data. The control condition, no clenching, does not seem to serve much purpose here - clenching with the non-optimal hand for the relevant stage seems control enough. Perhaps you could have used the control condition as a baseline, so that each other group's scores were expressed as a difference from the baseline condition.
Regarding the post-hoc tests; by definition, Fisher's PLSD should only be run when the overall ANOVA was significant - that's what the P (protected) means. Essentially, there is no control for FWE rate here at all, which is not ideal. I think it's pretty clear that correction for multiple comparisons should be used when performing all possible pairwise comparisons. It's not generally advisable to do post-hoc tests when the overall ANOVA was not significant, though it makes sense in some cases, depending on your research question. I'm not so sure of that here though.
Now, if you were doing a subset of all possible comparisons, having planned to do so on a priori grounds, and portioned out your data into a set of independent, orthogonal contrasts, you'd have a reasonable defense for not correcting. For example, an appropriate comparison may have been to compare the control group against the combination of all other groups before contrasting some subsets of those individual groups against each other (e.g. Renc/Lrec versus Lenc/Rrec would be an obvious comparison). You could also test each "active" group against the control group, though this would be non-orthogonal and you'd probably want to correct for multiple comparisons again.
I think you're quite right to raise issues regarding the strict use of p values (which might seem slightly ironic given the above). But if you're basing your inferences on effect sizes, it would be really quite helpful to include confidence intervals on those effect sizes (and confidence intervals on means). As an example, using the calculator here - http://gunston.gmu.edu/ce... - I calculated the effect size and CI for the contrast of Renc/Lrec versus Lenc/Rrec, for which you would presumably expect the largest difference in performance - they activate the "wrong" and "correct" hemispheres. This gives d = .8398 with 95% CI -0.0995 to 1.7791. Note that these are very wide confidence intervals, and that they include 0. Thus, the data do not provide a very precise estimate of the effect size (presumably because of the low sample size), and are compatible with a range of effect sizes which includes no effect at all. One should be extremely cautious in interpreting such an effect.
Typically, immediately after posting, I noticed I used an n of 10 instead of 11 for the L/R group, which means the d and CI should be 0.8578 and -.0623 to 1.778 in the above example.
This episode is an excellent example of why cognitive scientists should not rely so much on null hypothesis significance testing when they try to understand the meaning of the data they gather. As many thoughtful people have pointed out over the years, it is simply a "poor way of doing science" One example: http://www.psych.umn.edu/....
To anyone who will seriously consider it, a Bayesian approach toward extracting useful information from data may make more sense.
Dr. Propper, thank you for your kind and dignified initial response to points raised. Drs. Button and Munafo have a very helpful delineation of the social and language dimensions of the challenges to study completeness which they address nicely in an interview here: http://bigthink.com/neuro... . They are authors of the recent study Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, & Munafò MR (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews Neuroscience, 14 (5), 365-76. In that interview, they note several things that seem to intersect with the study at hand helpfully, and they counter your above positive assertion re the use of small studies. Of particular note is Dr. Button's request to be explicit as to whether a study is exploratory or confirmatory, particularly when small samples are involved. Dr. Propper, I think it is the language used in your study that seemed to indicate it wasn't exploratory. I did a cursory reading of the beginning and the discussion, and was left thinking I should figure out how to clench my fists properly as needed, when a more detailed reading indicated, to me at least, that replication was required before I should take the study seriously. The last paragraph of the discussion section, in particular, is an unmitigated hurrah of sorts about the robustness of the results. I suggest it would've been helpful to me, as well as to media interests, to be exploratory in tone. I couldn't find a single place in the whole writeup where the language was cautionary, no mention of sample size constraints, and nothing about suggested potential followups. Such are common aspects of writeups I would've liked to see here.
I'm hoping for a response to my contention of two weeks ago that the Abstract and Discussion language of your paper might have been more exploratory in tone. To be clear, I'm referring to language and overall tone exemplified by:
- "these results are striking"
- "the sizes of the effects...tended to be large or very large, with only two comparisons being in the medium range"
- "the findings presented here offer the exciting possibility that..."
To be fair, these are reasonable assertions to make in certain contexts. And I do think that researchers are faced with the difficult tension between encouraging coverage of their science and emphasizing any tentative aspects or potential limits of the study. I'm hoping to get your sense, in retrospect, of whether you achieved that tension satisfactorily. I apologize for any seeming appurtenance/impertinence; that's not intended. I just believe in the power of this kind of forum to leverage broader insights into potential improvements, or to highlight important dimensions I'm not considering.
With due respect to everyone involved, I claim that our usual reliance on null-hypothesis significance testing to extract information from the data we gather is what inevitably leads us into hopeless intellectual quagmires. Sir Ronald Fisher invented statistical methods that simply do not work when the goal is to understand human behavior. I defer to Paul Meehl to make my point more explicitly:
Jon, thank you for commenting. PLOS ONE welcomes debate and critique of published papers, but we have commenting guidelines that aim to keep the discussion constructive. Please see http://www.plosone.org/st....
We do remove comments where they breach our guidelines. I don't think this comment warrants removal - and it has spurred a useful debate including a response from the authors - but comments such as calling the work an 'undergraduate project report' distract from the issues with the science, as the ensuing thread shows. This is why we have our guidelines. If we stay respectful and 'play the ball, not the man', and we will have a constructive debate.
Can further comments in this thread please focus on the conduct and reporting of the science?
RE: A reminder of comment guidelines
MattJHodgkinson replied to MattJHodgkinson on 29 Apr 2013 at 10:46 GMT
To correct myself, Jon actually said "I’d be hard pressed not to fail this paper if it was presented to me as an undergraduate project report" rather than literally calling it an "undergraduate project report".
RE: A reminder of comment guidelines
JAnderson replied to MattJHodgkinson on 01 May 2013 at 00:56 GMT
As an associate editor for PLOS you have a responsibility to remove comments that are offensive to either the original author or other participants in the comments section. I am of the opinion that Jon's original comment is highly offensive, inflammatory, and should be deleted. This type of comment makes a mockery of both the science and PLoS One; an apology is not sufficient justification for this type of online behavior by scientists. This is a professional scientific journal, not a place for people to openly insult each other and the original work. If PLoS One gains a reputation for taking $1350 of an author's dollars and then openly supporting unprofessional remarks in its hosted forums, this will represent a major setback in the open access model. Scientists have long wanted to comment on scientific articles, but there needs to be better quality control over what is allowed in these forums.
RE: RE: A reminder of comment guidelines
MattJHodgkinson replied to JAnderson on 01 May 2013 at 13:45 GMT
Thank you for your comment, Prof. Anderson. While we fully agree that Dr. Simons’ comments should have been better worded and more polite, he has apologised and the comment has begun a mostly constructive debate about the conduct and reporting of this study, to which the authors have responded. Several non-constructive side comments have been removed. PLOS ONE values post-publication discussion and critique of published articles and, on balance, we believe that it is best to have this comment thread remain.